Observability

3 min readApr 16, 2021

Observability deals with collecting, aggregating and visualising all events needed to measure the state of service. Getting insightful information about internal of service not only help in visibility, but help in setting the correct alerts so that It could help in defining actions at critical alerts and detecting preventable problems.

While debugging the following pillars help in monitoring the critical state of services:

Logging
Metrics and Alerting
Profiling and Tracing

Logging:

Developer Relation with Logs are like Love Hate Relationship. Even if they spend most of time together, Developer would always try to ignore what they are logging. But at the time of trouble, they will always refer logs to check the issue.

Logs, by definition, means all events that are generated by your system or its process which are not meant to directly imply that process use-case. So Let’s say You have a program to Print “Hello World”. You may have some test messages printed as well. Those messages are log events meant to debug the code.

There are various type of Logs Level that are generated:

a. TRACE

b. DEBUG

c. ERROR

d. INFO

Generally, by default, In Production: Info Logs are enabled and In Staging, debug logs are enabled. Trace logs are enabled to be in temporary case when you want to find state of each instruction.

Benefits:

Request and Response Body: These details can help in debugging customer related doubts when you have separate client service.

Logging Dashboard: In general, there is a dashboard that was build on logs event based on some filtering and aggregation. Generally it is preferred over Metrics Dashboard when you needs insights over lot of data which you can not store as metrics. One example is: Finding number of food deliveries a person can do when he is present in some area. It may help him when he wants to move to other location when customer care check on dashboard that he is in low demand area.

Logging Alerting: It is helpful to build alerts on some keywords and log Levels. Let’s say when you see segmentation fault or Panic in Keyword, you can get alerted to figure out whats the issue.

Cons:

Logs are generally high scale and can have lot of fluctuations in volume. Logs increases directly proportional with system load. Building a cost effective and on-demand system is a challenging problem and requires some engineers to monitor the in-house service to handle its huge scale.

Not Realtime: Generally most of shipping agents take some time to ship logs to logging platform. There can be ways to mitigate this issue by playing around with Logs Agent Queue Size or Wait Time.

Auditing: Although some logs are meant to be monitored for audit purpose. There can be some situations where sensitive data can get added in logs. Adding a masking feature in log pipelines can come with extra delay.

Metrics and Alerting

Metrics are numbers those are multi-tagged along time. They can be represented by a time series data.

Type of Metrics:

a. Business Metrics

b. Infrastructure Metrics

c. Service Metrics

d. Dependencies or External Metrics

Benefits:

Constant Size: Generally with high load, Metrics don’t change like logs. Most of changes would be when you add more infra to your service.

Aggregating: It is easy to query average, median, P99 metrics of data and get some overall picture of health of services over time.

Trends: Metrics help in finding historical information about data and how much increase/decrease of resources a service is taking over a longer time.

Cons:

Critical System: Metrics should appear in real time and It can become single point of failure in visibility part of services if there are no alerts on other pillars of observability.

Resource Intensive: Collecting large number of metrics can consumer some resources on service.

Profiling and Tracing

<TODO>

Observability

Written by Satyam Mittal