Lumigo Release Notes logo

Release Notes

Back to Homepage Subscribe to Updates

Labels

  • All Posts
  • Announcement
  • feature
  • Improvement

Jump to Month

  • June 2024
  • April 2024
  • January 2024
  • October 2023
  • July 2023
  • June 2023
  • May 2023
  • April 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • August 2022
  • July 2022
  • June 2022
  • May 2022
  • April 2022
  • February 2022
  • November 2021
  • September 2021
  • August 2021
  • July 2021
  • June 2021
  • May 2021
  • April 2021
  • March 2021
  • February 2021
Improvement
2 years ago

Automatically filtering out healthchecks on ECS and Kubernetes

Health-checks are peculiar things

Healthchecks is a monitoring technique with a special place flavor: healthchecks are fired off at regular, frequent intervals (sometimes every 10 seconds, sometimes 1 minute) by orchestration platforms and monitoring tools. Most healthchecks are HTTP-based, and the returned HTTP response is checks based on the status code and (sometimes) content. But really, the only healthchecks a person needs to know about, are those that fail, which usually lead to containers being torn down and other disruptive infrastructure changes.

Issues with health-checks in Lumigo

Given that Lumigo's pricing model is based on the amount of requests we process, the large amount of successful healthchecks that every container workload undergoes leads to undesirable consumption of quota, for data that is effectively not useful. Moreover, successful healthchecks lead to noise in the Explore and Transactions view, degrading the overall experience.

Cutting the Gordian knot

Luckily, one can often spot recognize HTTP requests that are healthchecks pretty easily! Both AWS ELB health-checks, as well as Kubernetes ones (including EKS), come in with specific User-Agent headers. Lumigo now automatically drops in the data processing pipeline in the Lumigo platform all the spans that:

  1. Carry the User-Agent HTTP header with values that are known to be health-checks, specifically ELB-HealthChecker/* (AWS ELB, often used with Amazon ECS) and kube-probe/ (Kubernetes, including Amazon EKS)
  2. Return an HTTP status code that denotes a successful response (a.k.a.: `2xx` like `200 OK`, `201 Accepted`, etc.). This is because if a Health-check fails (e.g. returning HTTP status code 500), usually something bad is about to happen to your containers :-)

What do you need to do on your end?

Nothing. It just works with every version of tracers we released so far for containers and all HTTP OpenTelemetry instrumentations we have ever seen. Enjoy :-)

P.S.: Matching health-checks by path (e.g., /health) sounds like a good solution on paper, but in practice it leads to very annoying false-positives (i.e., HTTP calls that are NOT related with health-checks). Moreover, healthcheck paths are configurable, and practitioners do make use for that configurability, which would lead to false negatives (health checks we let through). User-Agent headers, on the other hand, are far less often changed by healthcheck systems. User-agent matching, on the other hand, is usually rather reliable for this use-case.