Automatically filtering out healthchecks on ECS and Kubernetes
Health-checks are peculiar things
Healthchecks is a monitoring technique with a special place flavor: healthchecks are fired off at regular, frequent intervals (sometimes every 10 seconds, sometimes 1 minute) by orchestration platforms and monitoring tools. Most healthchecks are HTTP-based, and the returned HTTP response is checks based on the status code and (sometimes) content. But really, the only healthchecks a person needs to know about, are those that fail, which usually lead to containers being torn down and other disruptive infrastructure changes.
Issues with health-checks in Lumigo
Given that Lumigo's pricing model is based on the amount of requests we process, the large amount of successful healthchecks that every container workload undergoes leads to undesirable consumption of quota, for data that is effectively not useful. Moreover, successful healthchecks lead to noise in the Explore and Transactions view, degrading the overall experience.
Cutting the Gordian knot
Luckily, one can often spot recognize HTTP requests that are healthchecks pretty easily! Both AWS ELB health-checks, as well as Kubernetes ones (including EKS), come in with specific User-Agent
headers. Lumigo now automatically drops in the data processing pipeline in the Lumigo platform all the spans that:
- Carry the
User-Agent
HTTP header with values that are known to be health-checks, specificallyELB-HealthChecker/*
(AWS ELB, often used with Amazon ECS) andkube-probe/<kubelet_version>
(Kubernetes, including Amazon EKS) - Return an HTTP status code that denotes a successful response (a.k.a.: `2xx` like `200 OK`, `201 Accepted`, etc.). This is because if a Health-check fails (e.g. returning HTTP status code
500
), usually something bad is about to happen to your containers :-)
What do you need to do on your end?
Nothing. It just works with every version of tracers we released so far for containers and all HTTP OpenTelemetry instrumentations we have ever seen. Enjoy :-)
P.S.: Matching health-checks by path (e.g., /health
) sounds like a good solution on paper, but in practice it leads to very annoying false-positives (i.e., HTTP calls that are NOT related with health-checks). Moreover, healthcheck paths are configurable, and practitioners do make use for that configurability, which would lead to false negatives (health checks we let through). User-Agent
headers, on the other hand, are far less often changed by healthcheck systems. User-agent matching, on the other hand, is usually rather reliable for this use-case.