How can an wrongly configured health check bring down your service and what to do about it
Health Checks
Health Checks in software engineering are a mechanism that produces a simple report of whether an application instance is working as expected or not, in order for an external system to take action on non-working instances.
Typically, this “report” is exposed on an HTTP endpoint that is frequently checked by a monitoring or orchestration system. In the modern dev-ops landscape, that system is usually Kubernetes and the mechanism that checks the HTTP endpoints is typically called a “probe”.
Liveness & Readiness
There are two commonly used probes: the liveness and the readiness probes.
The liveness probe checks if the application is running. If it's not, Kubernetes will restart the pod. For this check, you want to use something very simple, like an empty HTTP endpoint—any valid response means the application is "live".
The readiness probe checks if the application can handle requests (as the whole concept revolves around HTTP services). Typically, in the readiness health-check endpoint, you might add checks for your infrastructure (database, caching services, message brokers). The idea is that if an application instance loses connectivity with the rest of the infrastructure, it should be removed from the HTTP pool that your inbound reverse proxy uses. This allows other healthy instances to handle the traffic until connectivity is restored.
This approach might seem fine, especially for relatively simple applications that can't function without any of their dependencies. But let's explore what happens when you start adding complexity to your application and end up with non-critical dependencies.
The controversy
Imagine an application that is horizontally scaled (multiple instances running) and uses a caching service (e.g., Redis) for a specific flow. A health check for Redis is added to the Readiness endpoint. Now, imagine that Redis fails to respond correctly to the health check. All instances are marked unhealthy as a result. The reverse proxy removes all instances from the pool, causing all HTTP requests to fail—even though the actual issue only affects the specific flow that uses Redis.
Completely removing the health-check is not the correct solution here, as it will create another issue: Without the health-check there is no safe-guard for a deployment with invalid Redis configuration.
The Startup Probe
Moving the Redis health check to the Startup Probe can effectively address the issue described above. The Startup Probe is a Kubernetes feature that runs at container startup and continues until it succeeds, after which it stops running, allowing the Liveness and Readiness probes to take over. By implementing this approach, we ensure that the application only starts if it can initially connect to Redis, maintaining the safeguard against deployments with invalid Redis configurations.
Once the application has successfully started, indicating that Redis was initially accessible, the Startup Probe ceases to run. This means that temporary Redis failures won't cause the application to be marked as unhealthy after it's up and running. Consequently, the Readiness probe can be simplified to check only critical dependencies or internal application health, reducing the risk of false negatives due to non-critical service failures.
This strategy allows the application to continue serving requests even if Redis becomes temporarily unavailable, preventing unnecessary downtime for unaffected parts of the application. By utilizing the Startup Probe in this manner, we preserve the benefits of health checks during deployment while avoiding the pitfall of marking the entire application as unhealthy due to a non-critical dependency failure. This approach strikes a balance between ensuring proper initialization and maintaining service availability in the face of transient issues with non-critical dependencies.