How to build resilient microservices

Resilient microservices require resilient interservice communications. These patterns and Ballerina can help

How to build resilient microservices
Gerd Altmann (CC0)

Increasingly, developers rely on a microservices architecture to build an application as a suite of fine-grained, narrowly focused, and independent services, each of which is developed and deployed independently. Despite the agility fostered by the microservices approach, it also brings new challenges, since these services have to interact with each other and with other systems, such as web APIs and databases, via network calls. And because the network is always an unreliable factor, such interactions are susceptible to failure at any time.

Therefore, the resiliency of a microservices-based application—i.e., the ability to recover from certain types of failure and remain functional—heavily depends on how well the app handles inter-service communication over an unreliable network. For this reason, the resiliency of a microservices based-application depends just as heavily on how resiliently you implement microservices communications.

Given the sheer proliferation of microservices, building microservices communications over the network is becoming one of the hardest aspects of realizing a microservices architecture. At the same time, numerous resiliency patterns have emerged to address this challenge. These include timeout, retry, circuit breaker, fail-fast, bulkhead, transactions, load balancing, failover, and guaranteed delivery.

Let’s look at the commonly used microservices resiliency patterns. Then we’ll examine how these patterns can be applied using Ballerina, an emerging open source programming language that is optimized for integration and designed for building microservices and cloud-native applications. Finally, we will discuss the role of service meshes in offloading inter-service communication functions.

Resilience patterns for inter-service communication

A common mistake in distributed computing is assuming that the network will be reliable. A more prudent approach is planning for the eventual network failure. In this context, it is valuable to consider several patterns related to inter-application communication over an unreliable network, which Michael Nygard discusses in his book, Release It. Because microservices communications always take place over the network, those patterns can be directly applied to microservices architecture.

Timeout

Timeout is a quite common resiliency pattern. When we use synchronous inter-service communication, one service (caller) sends a request and waits for a response in a timely manner. The timeout pattern is used to decide when to stop waiting for a response at the caller service level.

So, when you call another service or system using synchronous messaging protocols, such as  HTTP, you can specify the timeout length that it wishes to wait for a response, and if that timeout is reached, you can define a specific logic to handle such incidents. In this way, a timeout helps services to isolate misbehavior or the anomaly of another system or service, so that it does not have to become your service’s problem.

Retry

There can be intermittent or instantaneous failures or outages of a given service. The key concept behind the retry pattern is providing a way to get the expected response, despite network disruption, after retrying to invoke the same service one or more times. As part of the service invocation, you may specify when you should retry a given request, the number of retries for a given back-end service, and the interval between each retry. 

Circuit breaker

When we invoke a given microservice, if it keeps on failing, further invocation of that service may cause further damage and cascading failures. The circuit breaker is a wrapper component you use when invoking external services. A circuit breaker will prevent the invocation if the service invocation fails and reaches a certain threshold.

As shown in the following figure, under normal circumstances, the circuit is in the closed state, and the invocation of microservice B takes place normally. However, when there are failures that match the circuit breaker opening criteria, the services goes to an open state, preventing the invocation.

circuit breaker WSO2

When there is an invocation failure, the circuit breaker maintains the closed state and updates the threshold count. Based on the threshold count or frequency of the failure count, it opens the circuit. When the circuit is open, the real invocation of the external service is prevented, and the circuit breaker generates an error and returns immediately. When the circuit is in an open state for a certain time period, we can apply a self-resetting behavior by trying the service invocation again after a suitable interval and resetting the breaker should it succeed. This time interval is known as the circuit reset timeout. Therefore, we can identify different states and state transitions in a given circuit breaker instance, which are shown in the following state diagram.

circuit breaker states WSO2

So, when the reset timeout is reached, the circuit state is changed to the half-open state, where the circuit breaker allows one or more invocations of the external service as a trial. The circuit breaker changes the state to closed state again if the trial succeeds, or it changes the state to the open state if the trial fails.

By design, the circuit breaker is a mechanism for degrading the performance of a system when it is not performing well or failing. This prevents any further damage to the system or cascading failures. We can tune the circuit breaker with the various back-off mechanisms, timeouts, reset intervals, error codes that trigger open states, error codes that trigger an ignore response, etc.

Fail-fast

Detecting failures as quickly as possible is central to building resilient services. In the fail-fast pattern, the fundamental concept is that a fast failure response is much better than a slow failure response. Therefore, as part of both the service’s business logic and the network communications, we should try to detect potential errors as quickly as possible.

There are many ways that we can detect failures, and they may change from one use case to another. In certain situations, just by looking at the content of the messages, we can decide that this request is not a valid one. In other cases, we can check for the system resources—thread pools, connections, socket limits, databases, etc.—and the state of the downstream components of the request lifecycle. Fast-Fail, together with Timeout, will help us to develop a microservices-based application that is stable and responsive.

Bulkhead

Bulkhead is a mechanism for partitioning your application so that an error occurring within a partition is localized to that partition only. This way, the error won’t bring the entire system to an unstable state; only the partition will fail. By design, a microservices-based application is good at isolating failures because each service focuses on a fine-grained business capability that is fully isolated from the rest of the system. If we can partition the service into multiple sub-components—e.g. components that are used for reading operations versus components that are used for update operations—we can independently design each component such that, in the event of a failure of one component, the other components still function as expected. For example, read operations can still be used even if update operations are suspended. 

Failover and load balancing

The ideas behind the failover and load balancing patterns are quite simple. Load balancing is used to distribute the load across multiple microservice instances, while failover is used to reroute the request to alternate services if a given service has failed. There are various failover and load balancing techniques that can be applied at different levels of software application development. In the context of microservices, we will only focus on client-side failover and load balancing where a given service that invokes other services uses either failover or load balancing logic to make service functionality highly available. With client-side load balancing (or failover), instead of relying on another service to distribute (or pick up) the load, the client is responsible for deciding (using service discovery or a predefined set) where to send the traffic using an algorithm, such as round-robin.

Reliable delivery

Given that communications take place over an unreliable network, there are certain scenarios where we need to make sure that the message delivery between two services or systems is reliable. To do so, as part of the implementation, we need to build messaging between the services or systems to ensure that there are no messages lost in the communications between them. Usually, a third-party component such as a message broker is used to realize this pattern. Most of the existing message brokers and messaging protocols, such as the AMQP (advanced message queuing protocol), facilitate a lot of the semantics required for reliable delivery use cases.

Transaction resiliency

Building transactions within the service boundary and outside the service boundary is a common requirement in developing microservices-based applications. When you build a microservice, you will often need to build transactions within the services boundary, which may include the service’s private database and other resources. Such transactions are often known as local transactions, as they operate within the scope of the individual service, and they often are easier to implement as they are local to the service.

When the transaction boundary spans multiple services and systems, we need to use distributed transactions to implement such use cases. There are certain patterns commonly used to distribute transactions across microservices. These include two-phase commits and sagas, an approach in which distributed transactions are built using the compensation operations of each participant service, which are coordinated by a central execution coordinator running in a distributed persistent layer. However, it is a good practice to avoid distributed transactions whenever possible and only use them when absolutely necessary for your use cases.

Using the Ballerina programming language for microservices

There are diverse technologies that allow us to build the microservice resilience patterns we have discussed. Based on the technology you leverage to build microservices, you need to select the resilient service implementation stack for each technology used, for example Hystrix for Java microservices.

In our example here, we will use Ballerina, a new open source programming language that is designed for building microservices and cloud-native applications. Ballerina aims to fill the gap between integration products and general-purpose programming languages by making it easy to write programs that integrate and orchestrate across distributed microservices and endpoints in a type-safe and resilient manner.

Let’s try to use Ballerina to implement some of the microservice resilience patterns that we’ve discussed above. The source code for all of the following examples is available in this GitHub repo.

Timeout in Ballerina

For the timeout use case, suppose we have a legacy service that takes more time to respond, but the microservice we develop is designed to have a timeout of 3 seconds. To implement this service with Ballerina, you just need to set the timeout as part of your outbound endpoint.

ballerina timeout 1 WSO2

In the event of a timeout scenario, an error will be thrown as the result of the invocation, and you will need to explicitly handle the business logic of a timeout-related scenario.

ballerina timout 2 WSO2

Retry in Ballerina

For a Retry use case, imagine you have a service that intermittently fails, and retrying the same request may return the expected results. You can implement such an endpoint at the Ballerina level by adding the retry configuration to your HTTP client endpoint.

ballerina retry WSO2

Circuit breaker in Ballerina

The circuit breaker implementation with Ballerina is quite similar to timeout. Here we embellish our HTTP client code with the circuit breaker configuration. You can define the various conditions that change the circuit breaker status between open, close, and half-open states. When the circuit is open, the invocation of the endpoint will throw an error, and we can handle the behavior for the circuit breaker open state as part of that error-handling logic. In this example, we will set a default message as the response.

ballerina circuit breaker 1 WSO2

In the event of a “circuit open” scenario, an error will be thrown without invoking the back-end service. You can write the business logic for the scenario as part of your error-handling logic.

ballerina circuit breaker 2 WSO2

Failover and load balancing in Ballerina

Failover scenarios can be implemented with Ballerina by using a dedicated failover endpoint and specifying the failover URLs and failover parameters. For instance, the sample use case mimics a failover behavior between three different endpoints located in different geographical regions.

ballerina failover WSO2

Similarly, we can build load balancing scenarios using Ballerina’s load balancing endpoint. Here also, we can define a set of load balancing endpoint URLs and other parameters, such as the load balancing algorithm.

ballerina load balancing WSO2

Transactions in Ballerina

Using Ballerina’s transaction capabilities, you can build a distributed transaction based on two-phase commit. For example, let’s take a distributed transaction scenario for travel booking. Suppose that there is a travel management microservice responsible for booking the airline and hotel for a trip, and everything has to be done in a single transaction.

If we build our travel management service using Ballerina, invocation of the airline and hotel service can be done inside a “transaction” block as shown in the following code snippet. Since the travel management service initiates the transaction, it is known as the initiator. The other services that participate in the same transaction, i.e., airline and hotel services, are known as participants.

ballerina transactions 1 WSO2

Airline and hotel services implement their business logic related to the same transaction inside a “transaction” block as well. Based on the business logic, they can abort the transaction at any time, which results in the rollback of the same transaction across other parties. The following code snippet shows the sample code of a participant airline service or hotel service.

ballerina transactions 2 WSO2

The current transaction implementation is based on two-phase commit, but there are ongoing efforts within the Ballerina project to create a compensation-based transaction implementation that will allow you to build a saga pattern with Ballerina.

Other resilient patterns for microservices

Other resilient patterns, such as fast-fail, are quite straightforward to implement as most of the error handling is built into the frameworks or languages (e.g. Ballerina error handling). Patterns such as bulkhead are often implemented at the service boundary and deployment levels. Guaranteed delivery is often implemented with the use of a message broker.

Service meshes for resilient microservices communication

Service mesh is an emerging pattern for addressing some of the concerns related to resilient inter-service communication. One of the main challenges we face in realizing inter-service communication is that the same commodity features required for resilient inter-service communication have to be duplicated across multiple languages and frameworks (e.g. a circuit breaker implementation for Java, Node, Go, etc.). The key idea behind using a service mesh is to offload these capabilities to a separate service mesh layer.

service mesh WSO2

At a high level, a service mesh is an inter-service communication infrastructure. With a service mesh, a given microservice won’t communicate directly with the other microservices. Rather all service-to-service communications will take place on top of a software component called a “service mesh proxy” or “sidecar proxy.” The sidecar is co-located with the service in the same virtual machine or Kubernetes pod, providing capabilities such as circuit breaking, timeouts, and observability (among others) for your microservices. The configuration of the sidecars is managed by a separate central component called the control plane. 

With the service mesh, microservice developers can focus more on the business logic while most of the work related to network communications is offloaded to the service mesh.

For instance, you no longer need to worry about circuit breaking when your microservice calls another service. Service mesh is still a pretty new architecture pattern and yet to be widely adopted for production use cases.

Kasun Indrasiri is the director of integration architecture at WSO2. He is a key member of WSO2’s architecture team that drives the development efforts of WSO2’s integration platform.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Copyright © 2018 IDG Communications, Inc.