Circuit Breakers : Break, before it breaks you!

It might look from the title, that in this blog we are going to talk about electrical circuits and 100 ways you can save electricity, but that’s not the case. However, the concept we are going to talk about is something similar.

With the digitisation growing day by day, more and more people are using web services on the internet in some way or the other. With that, the challenges to scale the services and keeping the response times within SLA’s also grow. Social networking websites, e-commerce websites, gaming websites, audio and video streaming applications and many others are serving thousands of simultaneous users per second within few milliseconds.

How do they do this ? – By following some patterns which make their systems resilient, fault tolerant and highly scalable. And guess what? Circuit Breaker is one such pattern.

Circuit Breaker Pattern :

Remote calls – Any call to a software component which is running elsewhere and requires communication over network can be called a remote call. Such calls have high chances of failures, due to network issues, hardware failures etc. If not failures, they may hang for a while and respond very slowly. In such cases you often retry few number of times and be hopeful that it will respond. This works in cases where the problem is transient, but may be dangerous in cases like below if you keep hammering the service.

A service has 100 clients, and they hit the service very frequently(1000req/sec) to get the data. So basically the service needs to handle 100*1000 req/sec on an average. Suddenly, few nodes of the service face problems and they tend to die. Soon after, few clients face exceptions while consuming the service and keep on retrying. As the number of healthy nodes go down on the service, other nodes are impacted and they tend to die as well. This cascading effect can be fatal for the service if it doesn’t recover on time. But the problem is, clients are not giving the service, time to recover, they just keep retrying frequently and face timeouts. This is a problem on the server side.

On the client side, since the service is not responding your threads are blocked until a timeout is thrown (connection timeout can be 100s to 1000s of ms). Which means that chances of failure are high and they keep growing as every client hammer the service and the service performance keeps on degrading. A situation is reached when you waste your resources on a service which is dying, you are giving failure response to your client just a lot slower. This impacts your clients as well and in turn their clients. This cascades to an extent where majority of your system is down (or slow) and at the end your customers are unhappy. This is the last thing you want.

How to deal with this problem? – Let’s come to this a bit later, but before that, think about when did the problem start? – Definitely when few nodes went down on the first service, but since clients didn’t give enough chance to the service to recover, the problem worsened. So what was needed? Fail fast strategy could have helped. But how? So basically when you see that a service is not responding for a while and some percentage (based on some threshold) of requests are failing, you simply assume that the subsequent calls are going to fail and you fail (on behalf of the service). Which means that you fail fast, without even calling the service and this saves you a lot of time and resources, (since they are not waiting on this service) which can be used to do something else. On the server side this results in reduced number of requests which gives some time to the service to recover. After some time, you try hitting the service and see if it back to normal. If yes, that’s awesome. If not, you keep failing fast and hope for the service to recover as fast as possible.

Wait, but what about the end customers? They are unhappy again. Yes, that’s true. But the big difference is, you don’t ask them to wait for a large amount of time and then show them the failure message. Instead, you fail very fast and this is a better experience for the customer as compared to wasting their time and still being unable consume your services. Remember, customers nowadays are very impatient. They will go to some other website if your website is slow just by few milli seconds.

Circuit breaker pattern just helps you to achieve this.

IMG_20171210_212135__01__01

In general it is a very simple pattern. You just wrap your function/method/behaviour with another function which keeps track of the failures/success of the that function (service/DB calls) and opens or closes the circuit for you. Sounds simple enough. It is simple.

As coders, we understand this better by seeing some code than reading literature. Lets look at a very basic version JUST to feel what’s happening.

Your function :

String getStringFromExternalService() throws Exception {
    return simpleStringService.getString();
}

This is a simple function which calls the service named SimpleStringService.

Below is a very simple implementation of the circuit breaker pattern.

@Getter
@Setter
@ToString
@EqualsAndHashCode
public class CircuitBreaker<T> {
    
    //current number of failures.
    private int currentFailure; 

    //threshold value for failures.
    private int thresholdNumOfFailures = 5; 
    
    //window size for the threshold failures.
    private int slidingWindowSizeinSeconds=10; 
    
    //Successful no. of calls to close the circuit.
    private int numOfSuccessesToCloseCircuit=1;
    
    //Flag to indicate the circuit state - OPEN/CLOSE ?
    private boolean isCircuitOpen = false;

    //First failure captured.(timestamp)
    private long firstFailure;
    
    //Time between an attempt to close the circuit and
    // when the circuit breaker was opened.
    private int sleepTimeInSec=1;
    
    //Time when circuit breaker was opened.
    private long circuitBreakerOpeningTime;
    
    //The actual behaviour which needs to be wrapped.
    private Callable<T> callable;

    //Some methods.
}

 

Now let’s implement the basic wrapping function and the logic to open and close the circuit.

public T execute() throws Exception {
    try {
        if (isCircuitOpen) {
            if(System.currentTimeMillis() >= (circuitBreakerOpeningTime + sleepTimeInSec*1000)) {
                // try if the request succeeds.
                try {
                    T result = callable.call();
            //"Looks like the service is back, Closing the circuit"
                    resestCircuitBreaker();
                    return result;
                } catch (Exception e) {
                    // circuit should still be open.
                }
            }
            throw new CircuitBreakerException("Circuit is open, cannot reach service");
        }
        return callable.call();
    } catch (Exception e) {
        if (currentFailure == 0) {
            firstFailure = System.currentTimeMillis();
        }
        currentFailure++;
        if (currentFailure >= thresholdNumOfFailures && isWithinSlidingWindow(firstFailure)) {
            circuitBreakerOpeningTime = System.currentTimeMillis();
            isCircuitOpen = true;
        }
        throw e;
    }
}

//Some helper methods.
private void resestCircuitBreaker() {
    isCircuitOpen = false;
    currentFailure = 0;
}

private boolean isWithinSlidingWindow(final long firstFailure) {
    return System.currentTimeMillis() <= firstFailure + slidingWindowSizeinSeconds * 1000;
}

So basically, the callable won’t be executed if the circuit is OPEN.

When we use a simple simulator to see what’s happening we see the below result.

I am from external service
Call to external service took 12ms
I am from external service
Call to external service took 11ms
I am from external service
Call to external service took 11ms
I am from external service
Call to external service took 13ms
java.lang.Exception: External Service is down
Call to external service took 106ms
java.lang.Exception: External Service is down
Call to external service took 101ms
java.lang.Exception: External Service is down
Call to external service took 102ms
java.lang.Exception: External Service is down
Call to external service took 102ms
java.lang.Exception: External Service is down
Call to external service took 105ms
com.geeknarrator.CircuitBreakerException
Call to external service took 1ms
com.geeknarrator.CircuitBreakerException
Call to external service took 2ms
com.geeknarrator.CircuitBreakerException
Call to external service took 1ms
com.geeknarrator.CircuitBreakerException
Call to external service took 1ms
com.geeknarrator.CircuitBreakerException
Call to external service took 2ms
com.geeknarrator.CircuitBreakerException
Call to external service took 3ms
Check if service is back to normal...
java.lang.Exception: External Service is down
Call to external service took 106ms
Keep the circuit open...
com.geeknarrator.CircuitBreakerException
Call to external service took 2ms
com.geeknarrator.CircuitBreakerException
Call to external service took 3ms
com.geeknarrator.CircuitBreakerException
Call to external service took 1ms
Check if service is back to normal...
I am from external service
Call to external service took 13ms
Looks like the service is back, Closing the circuit...
Circuit is closed..
I am from external service
Call to external service took 13ms
I am from external service
Call to external service took 13ms
I am from external service
Call to external service took 13ms

Above simulator log clearly shows that we failed very fast when the circuit was OPEN (as we didn’t call the service).

Now, do you need to implement your own circuit breaker pattern? The answer is NO. You don’t want to reinvent the wheel and spend time in testing it. There are open-source libraries which provide you this feature very easily without changing your code much.

Hystrix is the most widely used library which has huge support, great documentation and tested at massive scale. It brings lot of features with very minimal code and makes your highly distributed system fault tolerant and saves you from cascading failures to different parts of your system.

Javanica is another great library, which supports annotation based Hystrix integration.Which means that instead of writing additional code you just annotate your methods with @HystrixCommand annotation and you get all the features out of the box.

It has a very good documentation here Hystrix and here hystrix javanica.

Some important points to note based on my personal experience:

  • Before implementing circuit breaker pattern, one should always have a good retry strategy, fallback logic implementation and also tuned timeouts for your services.
  • Circuit breakers are mainly useful if your service has really high load (~1000 req/sec).
  • In most cases the default configuration for Hystrix works, but you may also need to tune its configuration to achieve better performance.
  • Javanica annotation for Hystrix doesn’t support CompletableFuture (if your method returns CompletableFuture. In that case you might want to move the inner code to another method which returns something else and wrap that with Hystrix annotation.

Hopefully, you liked this article. Please subscribe, post your feedback, like, comments and share with your friends. Keep reading.

Cheers,

Kaivalya Apte

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s