Cloud Bits: Beyond Pings – Checks for Cloud-Native Reliability

a close up of a computer screen with a line of ecg

Understanding the problem

In a cloud-native environment where your application is powered by multiple services with a complex dependency graph, a single developer shipping a bug in one of the services can turn into a cascading failure post deployment. This is why having checks which can monitor the vitals of a service & provides us with enough information during deployment on whether we can rely upon the dependency or do we need to think of a fallback option. Health-checks & deep-checks are the techniques that every service running in production should have configured in order to provide visibility into the service’s health.

Let set the …

a close up of a computer screen with a line of ecg

Understanding the problem

Let set the stage here to better understand this problem. We have 2 services in our system as order-service & payment-service. The client invokes an endpoint through an API gateway to create an order. The request is routed to order-service which has a dependency on payment-service for processing payment.

Internally order-service depends upon a Postgres database to store data, a RabbitMQ for sending notification events & payment-service for processing payments. payment-service itself has a dependency on the Postgres database to store payment records. You can view the code for these services on this GitHub branch.

Degradation in functioning of any of these dependencies either due to a bug or an infra issue will lead to customer impact. Therefore we should have proper checks to detect such issues ideally during deployment as well as regular intervals so that our system can reduce(or better avoid) its impact on customer.

Health-checks vs deep-checks

Health checks give you a high level signal of whether the service is up & running or not. Though this doesn’t necessarily means that the service is ready to perform all operations as per the expectations. In order to verify the complete functionality of a service, we need to configure a deep check which is more holistic.

Consider the order-service which creates a record by performing the following operations:

Calling payment-service to record a payment
Persisting an order record in Postgres database
Sending an event to RabbitMQ queue for notification

Now if the order-service is up & running, its health check can pass as it is able to invoke health-check endpoint of payment-service, execute a query on its database & verify its connection with RabbitMQ. From the service’s perspective, it is healthy as all its dependencies are reachable. Though in reality order-service might be unable to invoke the payment-service endpoint to create a new payment due to a bug in the payment-service URL configuration. So if we rely only on health checks then we will expose customer traffic to this instance resulting in failing requests.

In order to avoid the above scenario, we also need to configure a deep check for our services that is much more comprehensive as compared to health checks. The deep check can invoke deep check endpoints of dependent services & execute a test end to end operation mimicking its core operation of creating an order. Through deep check we replicate & verify an operation that resembles closely with what a customer is going to experience.

As you might have guessed health checks are comparatively lightweight than deep checks. So we should be using them at right place. A deep check is used primarily during a deployment to ensure the new service instance is prepared to start consuming customer traffic. While a health check is invoked at regular intervals by an API gateway or a service registry. If a health check fails, the control layer listening to these health checks will begin a graceful shutdown & start a new service instance. If the deep check fails then the existing customer traffic won’t be switched over to new instance. Both the checks have their own use cases & can be used to assess the health of our service to different level of granularities.

Seeing things in action

Now that we have an overview of both health check & deep checks, lets see these in action. As part of the demo both services are built using Spring boot & use Docker containers for managing their dependencies. We will first see how can we achieve this functionality using actuator & then how can we extend the health checks in a Kubernetes environment.

Using Spring Boot

Spring boot actuator provides operational endpoints which you can implement to add both health checks as well as deep checks. For eg in case of payment-service a simple health check would require verifying connection with its database

public Health health() {
try {
paymentRepository.count();
return Health.up()
.withDetail("database", "PostgreSQL connection healthy")
.withDetail("service", "Payment service operational")
.build();
} catch (Exception e) {
return Health.down()
.withDetail("database", "PostgreSQL connection failed")
.withDetail("Error", e.getMessage())
.withDetail("service", "Payment service not operational")
.build();
}
}

While a comprehensive deep check for order-service will invoke deep check endpoint of its dependent service i.e. payment-service as well as performing a test operation that goes through the expected customer flow

public Health health() {
try {
// First, verify Payment Service deep check
PaymentDeepCheckResponse paymentDeepCheckResponse = restClient.get()
.uri(paymentDeepCheckUrl)
.retrieve()
.body(PaymentDeepCheckResponse.class);
if (!"UP".equals(Objects.requireNonNull(paymentDeepCheckResponse).status())) {
return Health.down()
.withDetail("paymentDeepCheck", "Payment Service deep check reported DOWN status")
.withDetail("service", "Order service deep check failed")
.build();
}
// Perform a deep check by creating a test order
UUID testOrderId = orderService.createOrder(
testPrefix + UUID.randomUUID(),
100.0,
UUID.randomUUID());
return Health.up()
.withDetail("id for test order", testOrderId.toString())
.withDetail("service", "Order service deep check passed")
.withDetail("paymentDeepCheck", paymentDeepCheckResponse.details.service)
.build();
} catch (Exception e) {
return Health.down()
.withDetail("deepCheck", "Failed to create test order")
.withDetail("Error", e.getMessage())
.withDetail("service", "Order service deep check failed")
.build();
}
}

It is pretty evident that a health check is going to be lightweight compared to a deep check as the former is doing less number of operations. Once you have implemented these checks, you can easily configure them in the application config as below:

management:
endpoints:
web:
exposure:
include: health
endpoint:
health:
show-details: always
group:
liveness:
include: orderServiceHealth
show-details: always
deep:
include: orderServiceDeepCheck
show-details: always

Here is a quick demo of these checks in action where we have both the services running locally & we invoke both health check as well deep check endpoints on both the services. Initially when both services are operational, we can see that both deep checks as well as health checks are successful. Then we bring down the payment service which in turn starts failing health checks for the service. This quick feedback can be used in form heartbeats to take action to resolve the failure by either restarting service pod or deploying a new service instance.

You can check out all the code for this demo on this GitHub branch.

Using Kubernetes

Kubernetes provides us with probes that can be configured to listen to our health check & deep check endpoints. We specifically take a look at liveness probe which tells whether a service instance is alive & readiness probe which tells whether the service is ready to accept traffic. The probes can be configured in service deployment manifest as below. The readiness probe for order-service provides the endpoint which the Kubernetes will invoke for both health check & deep check. It also provides configurations that can be tuned based upon your app’s startup & bootstrap time.

readinessProbe:
httpGet:
port: 8001
path: /actuator/health/orderServiceDeepCheck
initialDelaySeconds: 30 # Initial wait period
periodSeconds: 10 # Interval between checks
livenessProbe:
httpGet:
port: 8001
path: /actuator/health/orderServiceHealth
initialDelaySeconds: 60
periodSeconds: 30

Now lets see a demo where we first deploy our services in Kubernetes with probes enabled & then we will introduce a failure to see how Kubernetes prevents a bad deployment from going through by checking the readinessProbe.

You will notice that there are multiple logs for both deep check as well as health check in order-service pod. These are triggered by the Kubernetes probes to ensure that the deployed pod is ready to serve the traffic & then it continues hitting the health check endpoint to ensure that the pod remains in healthy state. We also do a manual invocation of deep check endpoint for which we can see the associated log.

Now lets see how does the readiness probe prevents a bad deployment from going through. In our case, we have added a simple Spring filter to reject every request with internal server error as a fault injection.

public class FaultInjectionFilter extends OncePerRequestFilter {

@Override
protected void doFilterInternal(HttpServletRequest request, HttpServletResponse response, FilterChain filterChain)
throws ServletException, IOException {
response.setStatus(HttpStatus.INTERNAL_SERVER_ERROR.value());
response.setContentType(APPLICATION_JSON_VALUE);
response.getWriter().write("{\"status\":\"FAILURE\",\"failureReason\":\"Injected failure\"}");
}
}

If there are no readiness probes then this deployment will go through while Kubernetes will route the customer traffic on to this pod resulting in failures. But now that we have our readiness probes configured, Kubernetes waits for readiness of the pod before switching over the traffic to new instance. If the readiness probe fails, it continues trying to the configured threshold.

In our case we can see that the readiness probes are failing

While the old deployment remains intact serving customer traffic, Kubernetes continues retrying the readiness probe on the new deployment by restarting the pod.

You can view the code for Kubernetes readiness probe on this branch & the failure injection scenario on this GitHub branch.

Checks for health & readiness are core essentials of running a service in a cloud environment as you need to ensure that a new config change or a new deployment doesn’t ends up creating downtime for your application. This can be done by inspecting the dependencies for your individual service & building guardrails around it. Once you have configured these checks, you can also integrate them with monitoring tools such as Prometheus so that you have clear visibility for the health of your service.

Hope this post was helpful. Happy learning.

Similar Posts