Designing Health Endpoints – RFC v.1.4

Change Log

changelog:
  - version: 0.1
    description: 
      - Initial draft released, covering core design principles for health endpoints in Azure Functions.
      - Provided basic implementation examples, including simple health checks for application status and dependency availability.
      - Introduced fundamental best practices for security such as API key validation and performance optimisation like dependency check timeouts.
      - Included initial guidelines for response structure and status codes with recommendations for RESTful API design.
      - Discussed basic integration with Azure Application Insights for logging health check events.
  - version: 1.2
    description: 
      - Added OAuth 2.0 integration using Azure Active Directory to secure health endpoints and prevent unauthorised access.
      - Expanded error-handling strategies to include detailed error messages for each dependency failure like databases and external APIs.
      - Introduced API mocking and test strategies for simulating dependency failures, enabling earlier testing in development and CI/CD pipelines.
      - Enhanced the documentation section to provide clearer examples of health endpoint versioning and integration with third-party monitoring systems.
  - version: 1.3
    description: 
      - Introduced advanced content-based routing, enabling health endpoints to dynamically select backends based on real-time conditions.
      - Expanded the response caching strategies, allowing health endpoint results to be cached to reduce the load on external dependencies.
      - Added support for token-based authentication using third-party identity providers such as Auth0 and Okta for securing health endpoints.
      - Enhanced performance optimisation guidelines by introducing techniques for minimising the response time of health checks in high-traffic environments.
      - Added example of an Azure Functions Health Endpoint implementation.
  - version: 1.4
    description: 
      - Strengthened API security by introducing mutual TLS (mTLS) for verifying client-server communication in sensitive environments.
      - Improved quota management and rate limiting policies, enabling better control of how often health endpoints can be accessed.
      - Expanded on distributed tracing using Azure Monitor and Application Insights, providing guidance for tracing individual health check requests across microservices.
      - Added a section on real-time monitoring and proactive alerting, focusing on setting up dynamic alerts for health endpoint failures or performance degradation.

Introduction

This document explores the design and implementation of health endpoints using Azure Functions, providing best practices, code examples, and real-world integration patterns with Azure Application Insights. Health endpoints are critical for maintaining the availability and reliability of cloud-native applications. They allow developers and operations teams to monitor system health, dependencies, and external services in real-time.

This RFC describes the practical implementation of a health endpoint in an Azure Function, how to structure health responses, and how to monitor these endpoints using Application Insights. The focus will be on cloud-native applications, microservices, and distributed systems where health endpoints are essential for ensuring high uptime and proactive issue detection. Health endpoints will also be positioned within CI/CD pipelines, automated testing, and real-time alerting.


Purpose

The primary goal of this RFC is to establish a comprehensive framework for designing, implementing, and monitoring health endpoints within Azure Functions. The secondary goal is to explain how Azure Application Insights can be leveraged to monitor these health endpoints, track their telemetry, and set up real-time alerts for any failures or performance degradations.

In this RFC, we aim to provide development teams with practical guidance on building robust, secure, and scalable health endpoints. The health endpoint design presented here will account for dependency checks, real-time telemetry, and API observability. With modern DevOps practices in mind, this RFC also highlights how health endpoints can be integrated into CI/CD pipelines for automatic validation during deployments.


Definitions

  • Health Endpoint: An API endpoint that provides real-time information about the operational status of an application or its dependencies.
  • Liveness Probe: An API endpoint that verifies whether an application is running and can recover from failure.
  • Readiness Probe: An API endpoint that verifies whether an application is ready to serve requests.
  • Azure Application Insights: A service within Azure Monitor that provides monitoring, telemetry, and alerting for cloud-native applications.
  • CI/CD: Continuous Integration and Continuous Deployment pipelines that automate testing, building, and deploying applications.

Health Endpoint Architecture

A health endpoint in Azure Functions typically operates as an HTTP-triggered function. The endpoint is responsible for checking the availability and health of the application itself as well as any external dependencies. The architecture should be flexible enough to support various types of health checks, including database connections, external API calls, and message queue monitoring.

Example of Health Endpoint Architecture:

[FunctionName("HealthCheck")]
public static IActionResult Run(
    [HttpTrigger(AuthorizationLevel.Anonymous, "get", Route = "api/v1/health")] HttpRequest req,
    ILogger log)
{
    var status = CheckDatabase() && CheckExternalAPI() ? "healthy" : "unhealthy";
    return new OkObjectResult(new { status, timestamp = DateTime.UtcNow });
}

private static bool CheckDatabase()
{
    // Simulate database health check
    return true;
}

The example above illustrates a health endpoint that checks both the database and an external API. The health response is structured in JSON format, providing real-time information about the application’s health.


HTTP Method and Routing Considerations

Health endpoints should adhere to RESTful design principles, typically using HTTP GET methods, as health checks are read-only operations. A common convention for health endpoints is to use the /health route, possibly with versioning for backward compatibility. This makes the health endpoint discoverable and easy to integrate with monitoring systems.

Example of RESTful Routing for Health Endpoints:

[FunctionName("HealthCheckV1")]
public static IActionResult HealthCheckV1(
    [HttpTrigger(AuthorizationLevel.Anonymous, "get", Route = "api/v1/health")] HttpRequest req)
{
    return new OkObjectResult(new { status = "healthy" });
}

[FunctionName("HealthCheckV2")]
public static IActionResult HealthCheckV2(
    [HttpTrigger(AuthorizationLevel.Anonymous, "get", Route = "api/v2/health")] HttpRequest req)
{
    return new OkObjectResult(new { status = "healthy", version = "v2" });
}

By versioning the health endpoints (/api/v1/health vs /api/v2/health), teams can maintain backward compatibility while introducing new features or improvements to the health checks.


Structuring the Response

The health endpoint response should be structured in a way that makes it easy for monitoring systems to parse and understand the health of the application. The response typically includes the overall status of the application, a timestamp, and any relevant information about dependencies.

Example of a Structured JSON Response:

{
  "status": "healthy",
  "timestamp": "2024-10-08T12:00:00Z",
  "details": {
    "database": "healthy",
    "externalAPI": "unhealthy"
  }
}

In this example, the health endpoint response includes the overall status of the system, the current timestamp, and details about specific dependencies such as the database and external APIs.


Health Checks vs Liveness/Readiness Probes

Health checks, liveness probes, and readiness probes serve different purposes in cloud-native systems. While health checks provide a general overview of the system’s health, liveness probes focus on ensuring the application is running and not stuck. Readiness probes verify if the application is ready to handle incoming traffic.

Example of Liveness Probe:

[FunctionName("LivenessProbe")]
public static IActionResult RunLivenessProbe(
    [HttpTrigger(AuthorizationLevel.Anonymous, "get", Route = "api/v1/liveness")] HttpRequest req)
{
    return new OkResult(); // Returns 200 OK if the service is alive
}

Example of Readiness Probe:

[FunctionName("ReadinessProbe")]
public static IActionResult RunReadinessProbe(
    [HttpTrigger(AuthorizationLevel.Anonymous, "get", Route = "api/v1/readiness")] HttpRequest req)
{
    return CheckDatabase() ? new OkResult() : new StatusCodeResult(503); // 503 if not ready
}

Handling Dependencies in Health Endpoints

A well-designed health endpoint should also check the health of the system’s dependencies, such as databases, external APIs, or message queues. These checks ensure that the application is not only running but also able to connect to its required services.

Example of Dependency Checks in a Health Endpoint:

[FunctionName("HealthCheckWithDependencies")]
public static IActionResult Run(
    [HttpTrigger(AuthorizationLevel.Anonymous, "get", Route = "api/v1/health")] HttpRequest req)
{
    var isDatabaseHealthy = CheckDatabase();
    var isExternalApiHealthy = CheckExternalAPI();

    var status = (isDatabaseHealthy && isExternalApiHealthy) ? "healthy" : "unhealthy";
    
    return new OkObjectResult(new {
        status,
        dependencies = new {
            database = isDatabaseHealthy ? "healthy" : "unhealthy",
            externalAPI = isExternalApiHealthy ? "healthy" : "unhealthy"
        }
    });
}

This example demonstrates how to include checks for multiple dependencies (e.g., databases, external APIs) in a single health endpoint.


Advanced Dependency Checks

In some cases, a simple “healthy” or “unhealthy” status might not be sufficient for understanding the health of an external dependency. For more advanced checks, the health endpoint can return additional metrics or performance data related to the dependency, such as response times or error rates.

Example of Advanced Dependency Check:

private static bool CheckDatabaseWithMetrics(out double responseTimeMs)
{
    var stopwatch = Stopwatch.StartNew();
    try
    {
        // Simulate a database check with response time
        responseTimeMs = 15.2; // Example response time in milliseconds
        return true;
    }
    catch
    {
        responseTimeMs = stopwatch.ElapsedMilliseconds;
        return false;
    }
}

This function adds an extra layer of detail by including the response time of a dependency, which can be returned in the health check response for more granular monitoring.


10. Azure Application Insights Integration

Azure Application Insights allows you to monitor the health of your API and its dependencies in real time. By adding custom telemetry to your health endpoint, you can track when the health status changes, identify performance bottlenecks, and set up alerts for specific failures.

Example of Logging Custom Telemetry:

_logger.LogInformation($"Health check performed. Status: {status}. Timestamp: {DateTime.UtcNow}");

When this telemetry is sent to Application Insights, it allows you to monitor the health of your system via dashboards and alerts in the Azure Portal.


11. Custom Telemetry for Health Checks

Custom telemetry in health endpoints can track specific events, such as when a database becomes unavailable or when an external API response time exceeds a threshold. These events can be logged and monitored in Azure Application Insights for further analysis.

Example of Logging Custom Telemetry for Dependency:

_logger.LogInformation($"Database health: {databaseStatus}. Response time: {responseTimeMs} ms.");

This telemetry can be queried in Application Insights to create detailed reports and identify patterns of degradation or failure in system dependencies.


12. Performance Monitoring with Application Insights

By integrating health endpoints with Azure Application Insights, you can monitor the performance of your health checks in real time. This includes tracking the response times of dependencies, detecting slowdowns, and identifying patterns of failure.

Example of Querying Performance Metrics in Application Insights:

requests
| where timestamp > ago(1h)
| where name == "HealthCheck"
| summarize avg(duration) by bin(timestamp, 1m)

This query returns the average response time of the health check endpoint over the last hour, allowing you to monitor performance trends.


13. Real-Time Alerts and Failure Detection

One of the key benefits of using Azure Application Insights with health endpoints is the ability to set up real-time alerts. For example, you can create an alert that triggers if the health endpoint returns a 500 Internal Server Error, or if the response time of a specific dependency exceeds a threshold.

Example of Setting Up an Alert for Health Endpoint Failures:

  1. Go to Application Insights in the Azure Portal.
  2. Navigate to the Alerts section.
  3. Set up a new alert rule based on a specific condition (e.g., when the health check returns a 500 status).

This ensures that your team is immediately notified when critical issues arise.


14. Health Endpoint Security Concerns

Health endpoints can expose sensitive information about the internal workings of your application, such as the status of databases or external APIs. For this reason, health endpoints should be secured, especially in production environments. Access can be restricted using API keys, OAuth tokens, or by limiting access to specific IP addresses.

Example of Securing a Health Endpoint:

[FunctionName("SecureHealthCheck")]
public static IActionResult Run(
    [HttpTrigger(AuthorizationLevel.Function, "get", Route = "api/v1/health")] HttpRequest req)
{
    if (!ValidateApiKey(req))
    {
        return new UnauthorizedResult();
    }

    return new OkObjectResult(new { status = "healthy" });
}

This implementation uses an API key to restrict access to the health endpoint, ensuring that only authorized clients can query the health of the system.


15. API Versioning for Health Endpoints

API versioning is important for health endpoints, particularly when rolling out changes that could impact monitoring systems. By versioning the health endpoint, you ensure that existing systems continue to function as expected, while newer systems can take advantage of improved health checks.

Example of API Versioning in Health Endpoints:

[FunctionName("HealthCheckV1")]
public static IActionResult RunV1(
    [HttpTrigger(AuthorizationLevel.Anonymous, "get", Route = "api/v1/health")] HttpRequest req)
{
    return new OkObjectResult(new { status = "healthy" });
}

[FunctionName("HealthCheckV2")]
public static IActionResult RunV2(
    [HttpTrigger(AuthorizationLevel.Anonymous, "get", Route = "api/v2/health")] HttpRequest req)
{
    return new OkObjectResult(new { status = "healthy", version = "v2", timestamp = DateTime.UtcNow });
}

Versioning allows for backward compatibility, ensuring that older systems are not disrupted by changes to the health endpoint.


16. Error Handling in Health Endpoints

Health endpoints must handle errors gracefully, particularly when one or more dependencies are unavailable. Instead of simply returning an error, the health endpoint should provide detailed information about which component failed and why.

Example of Error Handling in a Health Endpoint:

try
{
    var isDatabaseHealthy = CheckDatabase();
    if (!isDatabaseHealthy)
    {
        throw new Exception("Database connection failed");
    }
}
catch (Exception ex)
{
    _logger.LogError($"Health check failed: {ex.Message}");
    return new StatusCodeResult(500);
}

In this example, a detailed error message is logged, and the health check returns a 500 Internal Server Error to indicate failure.


17. Scaling Health Endpoints in Distributed Systems

In a distributed system, multiple instances of an application may be running across different regions or nodes. Health endpoints should be designed to handle this scale by checking the health of individual instances as well as their dependencies.

Example of Health Endpoint in a Distributed System:

[FunctionName("DistributedHealthCheck")]
public static IActionResult Run(
    [HttpTrigger(AuthorizationLevel.Anonymous, "get", Route = "api/v1/health")] HttpRequest req)
{
    var nodeHealth = CheckNodeHealth();
    var databaseHealth = CheckDatabase();

    var overallHealth = nodeHealth && databaseHealth ? "healthy" : "unhealthy";
    return new OkObjectResult(new { status = overallHealth, node = "instance-1" });
}

In this scenario, each node reports its own health status, allowing monitoring systems to track the health of each instance independently.


18. Automating Health Checks in CI/CD Pipelines

Health endpoints can be integrated into CI/CD pipelines to ensure that new deployments are fully operational before they are promoted to production. By running automated health checks as part of the deployment process, teams can catch potential issues early.

Example of CI/CD Pipeline Integration:

jobs:
  - job: HealthCheck
    steps:
      - script: curl http://myapp.azurewebsites.net/api/v1/health
        displayName: "Run Health Check"
        condition: eq(variables['Agent.JobStatus'], 'Succeeded')

This pipeline step triggers a health check immediately after deployment, ensuring the system is functioning correctly before the deployment is marked as successful.


19. Health Checks for Database Connectivity

Database connectivity is a critical component of most cloud applications. Health endpoints should check the availability and performance of the database by running a lightweight query to ensure that the database is accessible.

Example of Database Connectivity Check:

private static bool CheckDatabase()
{
    try
    {
        // Simulate database query
        using (var connection = new SqlConnection(connectionString))
        {
            connection.Open();
            var command = new SqlCommand("SELECT 1", connection);
            command.ExecuteScalar();
        }
        return true;
    }
    catch (Exception)
    {
        return false;
    }
}

This code simulates a basic query to the database. If the connection is successful, the database is marked as healthy.


20. Health Checks for External API Integrations

Many applications rely on third-party APIs. Health endpoints should verify that external APIs are responsive and performing within acceptable limits.

Example of External API Health Check:

private static bool CheckExternalAPI()
{
    using (var client = new HttpClient())
    {
        var response = client.GetAsync("https://api.external-service.com/status").Result;
        return response.IsSuccessStatusCode;
    }
}

This code sends an HTTP request to an external API and checks whether the API responds successfully. If the external API is down or slow to respond, the health endpoint should reflect this.


21. Health Checks for Message Queues

Message queues are often a core part of cloud-native architectures. Health endpoints should verify that the message queue is available and able to process messages.

Example of Message Queue Health Check:

private static bool CheckQueue()
{
    try
    {
        var queueClient = new QueueClient(connectionString, queueName);
        queueClient.CreateIfNotExists();
        return true;
    }
    catch (Exception)
    {
        return false;
    }
}

This code checks whether the message queue is available and operational by attempting to create or access the queue.


22. Monitoring Storage and Disk Usage

Applications may rely on storage systems, such as Azure Blob Storage, to store data. Health endpoints should verify that storage is available and that there is sufficient space for new data.

Example of Storage Health Check:

private static bool CheckBlobStorage()
{
    var blobClient = new BlobServiceClient(connectionString);
    var container = blobClient.GetBlobContainerClient(containerName);
    return container.Exists();
}

This code checks the availability of an Azure Blob Storage container. If the container does not exist or is unavailable, the health endpoint should reflect this.


23. Handling Timeouts in Health Checks

Timeouts are a common issue when dealing with external dependencies in health endpoints. Health checks should include timeouts to prevent them from hanging indefinitely.

Example of Handling Timeouts in Health Checks:

private static async Task<bool> CheckExternalAPIWithTimeout()
{
    using (var client = new HttpClient())
    {
        client.Timeout = TimeSpan.FromSeconds(5); // Set timeout
        var response = await client.GetAsync("https://api.external-service.com/status");
        return response.IsSuccessStatusCode;
    }
}

By setting a timeout on the HTTP request, this health check prevents long delays when an external service is slow or unresponsive.


24. Testing Health Endpoints

Health endpoints should be rigorously tested to ensure they provide accurate and reliable information. Unit tests should cover scenarios where the system is healthy, as well as cases where dependencies fail.

Example of Unit Test for Health Endpoint:

[TestMethod]
public void TestHealthCheck()
{
    var result = HealthCheckFunction.Run(null, null) as OkObjectResult;
    Assert.AreEqual(200, result.StatusCode);
}

This unit test verifies that the health endpoint returns a 200 OK status when the system is healthy.


25. Best Practices for Logging Health Checks

Health endpoints should log each check for auditing and monitoring purposes. These logs provide insight into the health history of the application and help identify patterns of degradation.

Example of Logging Health Checks:

csharpCopy code_logger.LogInformation($"Health check at {DateTime.UtcNow}: Status: {status}");

By logging each health check, teams can trace the health status of the application over time and investigate issues when failures occur.


26. Monitoring with Azure Monitor

Azure Monitor provides a unified platform for monitoring the health of your applications, including health checks. By sending telemetry data from your health endpoints to Azure Monitor, you can track the performance and availability of your system in real time.

Example of Monitoring Health Checks in Azure Monitor:

traces
| where message == "Health check performed"
| summarize count() by bin(timestamp, 1h)

This query tracks the number of health checks performed over time, allowing you to monitor the frequency and consistency of health checks.


27. Multi-Region Health Monitoring

For applications deployed in multiple regions, health endpoints should report the health of each region individually. This allows teams to monitor the health of the application in each region and detect region-specific issues.

Example of Multi-Region Health Check:

[FunctionName("RegionHealthCheck")]
public static IActionResult Run(
    [HttpTrigger(AuthorizationLevel.Anonymous, "get", Route = "api/v1/health/{region}")] HttpRequest req, string region)
{
    var regionHealth = CheckRegionHealth(region);
    return new OkObjectResult(new { status = regionHealth ? "healthy" : "unhealthy", region });
}

This health check reports the health of a specific region, providing insight into regional availability.


28. Health Endpoint Documentation

Health endpoints should be documented as part of the overall API documentation. This ensures that external systems and teams know how to query the health endpoint and interpret its responses.

Example of Health Endpoint Documentation:

GET /api/v1/health

Returns the current health status of the application.
Response:
{
  "status": "healthy",
  "timestamp": "2024-10-08T12:00:00Z"
}

This documentation provides details about the health endpoint’s route, method, and response structure.


29. Role of Health Endpoints in Microservices

In a microservices architecture, each service may expose its own health endpoint. These endpoints are critical for orchestrators like Kubernetes to manage the lifecycle of services, such as scaling, restarting, or terminating instances based on their health.

Example of Microservice Health Check:

[FunctionName("ServiceHealthCheck")]
public static IActionResult Run(
    [HttpTrigger(AuthorizationLevel.Anonymous, "get", Route = "api/v1/service/health")] HttpRequest req)
{
    return new OkObjectResult(new { status = "healthy", service = "inventory" });
}

This health check reports the health of a single service, such as an inventory service in a microservices architecture.


30. Advanced Application Insights Querying

Azure Application Insights allows you to perform advanced queries on health endpoint telemetry data, enabling you to analyze trends and detect anomalies in system health.

Example of Advanced Query for Health Endpoint Telemetry:

traces
| where message contains "Health check"
| summarize avg(duration) by bin(timestamp, 5m)

This query returns the average response time for health checks over five-minute intervals, helping you identify performance bottlenecks.


31. Dynamic Health Endpoint Responses

Health endpoint responses can be dynamic, adjusting based on the current state of the system and its dependencies. For example, if one dependency is slow to respond, the health endpoint might still return “healthy” but include a warning in the response.

Example of Dynamic Health Response:

var response = new {
    status = "healthy",
    warnings = new List<string>()
};

if (databaseResponseTime > 1000) {
    response.warnings.Add("Database response time is slow.");
}

return new OkObjectResult(response);

This dynamic response includes a warning about slow database response times, even though the overall system is still considered healthy.


32. Impact of Health Endpoints on SLA and SLO

Health endpoints play a key role in Service Level Agreements (SLA) and Service Level Objectives (SLO). By monitoring the health of your application and its dependencies, you can ensure that you meet your SLAs and SLOs for availability and performance.

Example of Monitoring SLA Compliance with Health Checks:

requests
| where name == "HealthCheck"
| summarize failureRate = countif(success == false) / count()

This query calculates the failure rate of health checks, helping you monitor compliance with your SLAs.


33. Integrating Health Endpoints with API Gateways

Health endpoints can be integrated with API Gateways to provide centralized monitoring of your application’s health. API Gateways can periodically query the health endpoint and use the results to route traffic only to healthy instances.

Example of Health Endpoint Integration with API Gateway:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: health-ingress
spec:
  rules:
  - host: "myapp.com"
    http:
      paths:
      - path: "/api/v1/health"
        backend:
          serviceName: myapp-service
          servicePort: 80

This Kubernetes Ingress configuration routes traffic to the health endpoint of your application, allowing the API Gateway to monitor the system’s health.


34. Future Extensions for Health Endpoint Design

As health monitoring evolves, health endpoints will likely become more sophisticated, providing detailed metrics, dependency tracking, and even self-healing capabilities. Future extensions to health endpoint design may include AI-driven anomaly detection, automated scaling based on health metrics, and deeper integration with observability platforms.

One potential extension could be self-healing health endpoints, where the endpoint not only reports issues but also takes action to resolve them. For example, if a database connection fails, the health endpoint could attempt to reconnect or trigger a failover process automatically.

Example of Future Health Endpoint with Self-Healing:

if (!CheckDatabase()) {
    TriggerFailover();
}
return new OkObjectResult(new { status = "healthy", selfHealing = "activated" });

This example illustrates how health endpoints could evolve to become more proactive, not only reporting failures but also attempting to fix them automatically.