Failure resilient model using circuit breakers for Microservices

Spread the love

This one is a long pending article on my drafts. I gave this as a talk roughly a year ago at Rootconf, but got buried under other things.

If you’re thinking, why microservices? Does everyone need microservices? It makes a lot of sense for a big company to not deploy on a single code application(imagine the horrors). Although, this is something that I don’t aim to answer in this blog post. For your reading, I think this is a very good presentation that might give some insight into this topic.

Nevertheless, you don’t need to be already in a microservices model for reading further, you might be able to use some of this model in your application right away! The model used throughout in this post would be a world facing application that works with a set of microservices. We hereby have 2 components for our system, our main application which basically serves the customer, and dependencies that we aim to be consuming from our application.

Our guiding principles

We will start with a simple principle that we will focus upon:

Always design for failures

Always assume one or all of your microservices WILL go down. We aim at doing the best for our customer when such a scenario arises. Let’s list down some guiding rules for our pricinple:

  • Don’t fail if your microservice goes down: We want to serve our customers even if a microservice is unavailable.
  • Application should not forever wait for microservice: Customers are impatient, don’t make them wait.
  • Contain and Isolate failures: Limit the number of requests that are going to be affected by any failures, and also, isolate the failure of microservices from each other
  • Respect the service when it is slow: It isn’t someone else’s problem if a service is down, it’s your company’s, we want to make sure that we don’t make things worse for the service.
  • Fail fast – Recover fast: Our application should be able to fail with as less requests as possible, and as recover as soon as our failing service is back up.

If you think about it, the first 2 points in the above list are fairly straightforward to solve, simply add timeouts to your service calls, and you are taking care of them.

A simple example with just timeouts

Let’s quickly jump in with what happens if you don’t design our application for failures. We build a simple multithreaded application which allows us to make requests to our remote services. We have a HTTP handler in our application and a set of application threads which are used for running our application code and asynchronously calling remote services A and B.
Application

We will pick some numbers for our application.

Number of threads: 100
Average response time from each service: 100ms

Failing service A

Service A not working
Thinking about the happy case. Since both services are called concurrently, we can process 50 requests in every 100ms. Requests per second served by our application is as high as 0.1 * 50 = 500
Consider the (likely) scenario where our Service A has troubles responding to our requests. We were smart enough and added 1s timeouts to Service A requests.

Now under stressed conditions, every request processing Service A remote calls takes 1 second before timing out. We can ignore any time taken by Service B so in our example. So we have 100 threads occupied for 1 second each. Our RPS quickly dropped down to 1 * 100 = 100 which is 1/5 of our original RPS.
Going back to our rules above:

  • We didn’t fail when when a microservice was unavailable.
  • We didn’t wait forever, the customer was given a fallback after 1 second
  • We were unable to Contain and isolate failures: Each and every request was affected, so we weren’t able to contain the effects. Service B which was operating normally wasn’t able to be served to the customer which means that we were unsuccessful in containing the failures.
  • We didn’t help Service A recover, we made as many requests possible to the service. We were trying to make things worse for the service by adding more and more backpressure.
  • We failed fast, but we wouldn’t recover fast When service A comes back up, it would first try to process all the requests backlog before the application recover completely.

Overall we did not design for failures. Every customer was affected, and a large majority started seeing rejection of requests.

Circuit breaker mechanism

Next we try to introduce the concept of a Circuit breaker mechanism and how to implement it in a non-complicated way. Similar to how it sounds, we aim to build an analogy to a circuit breaker where we can stop sending requests to a service when it is unavailable. We do one step better than an electric breaker and re-open the circuit breaker whenever the service is available again.
Circuit breaker

Seeing the above diagram should give you an idea of what is happening. When everything is good, all requests are served by our Circuit breaker setup. When Service A is unavailable, our circuit gets closed and starts rejecting requests without even sending it to the remote service. Both our application and the remote service get benefits with this.

Implementation of a simple circuit breaker

In the most trivial implementation, we simply create a thread pool per service in order to provide isolation of each of these remote service calls.
Circuit breaker implementation
We again start picking numbers for our threads. If we need to serve 500 RPS at a average latency of 100ms, each of the services needs a thread pool of size 50.
Presentation (4)
Let’s again repeat the case where Service A goes down. What happens? Again with our smart timeouts of 1s, our thread pool for service A blocks responses for the requests. This time though, the application is unaffected and continues to run the same. Every second, 50 requests(size of service A thread pool) get blocked and affect the customer by delaying them for 1 second. The remaining 450 requests simply skip calling Service A and the application continues running.
Let’s think again about our rules:

  • We didn’t fail when when a microservice was unavailable.
  • We didn’t wait forever, 10% customers was given a fallback after 1 second, while other 90% got fallback immediately.
  • We Contained and isolated failures of service A: Only 10% of the requests were affected, and application didn’t get affected in serving the content
  • We reduced the load on Service A, and possibly helped it recover.
  • We failed fast, and recovered immediately When service A comes back up, our thread pool with just 50 requests will empty immediately, and we recover.

We designed for failure scenarios and gave a good experience to our customers without adding load on any of our systems.

Some more details about a real world implementation

  • Each of these separate thread pools should have fixed queue sizes(note that an infinite queue is the path to doom). Use Little’s law to tune the number of threads. Queue size is trickier and might need some trial and error, but unless you scale it too high(10x of thread pool size) or too low(10% of thread pool size), it would most probably not give you any troubles.
  • We like to also add retries to our timed out service calls. This makes the customer experience better. Assuming our timeouts are more than our P99s, we are simply retrying only 1% of our requests, which means no added load on our dependencies.
  • Responding to customer even faster or having even smaller amount of affected customers. A simple way to do that is let the service call timeouts be higher say 3 seconds in the above example and our application level timeouts to be 1 second. This way our Service call thread pools fill even faster without adding a dent in customer experience. The added benefit is that the number of customers who get affected reduce.
  • Testing this setup is important. One easy way to simulate failures is to use IPTables to drop or add latency to incoming packets from the service you want to test failures with.
  • Monitor your setup for number of retries and rejections. Number of retries tell us about what is the additional load added on the service, number of rejections help us understand the exact customer experience.

References:
Netflix’s Hystrix implementation using Semaphores is something everybody should read about
Martin Fowler’s introduction to Circuit breakers
Credits: Alex Koturanov, my mentor at Amazon who always guided me with his depth of knowledge. We closely worked together at putting together a Fault tolerant model.

You may also like...

4 Responses

  1. ian says:

    Seems there is some confusion in terms: circuit is closed when service is available.
    When requests start failing the circuit opens and makes impossible to pass requests throught it.

    • Kunal Grover says:

      Yeah you are correct. Open / Closed have the opposite meaning above as opposed to the Electric circuit breaker.

      I do find them extremely confusing and that’s why it’s messed up in the post. I’d correct it but wondering if there is any better terminology against this which is less confusing?

  2. ian says:

    Hi, hardly it can be better because it uses concrete engineering device with settled terminology – device that interrupts a circuit in the event of a fault. Closed – everything is ok, Opened – chain is interrupted. May be the better analogy to not engineers would be the moveable bridge, when it’s closed cars can move, when it’s opened car traffic is interrupted. But I believe it’s much better to stick to existing terms because it’s a domain language used by community even if it confuses in the beginning.

  1. June 2, 2018

    […] • Микросервисы: как подстелить соломку на случай отказов. […]

Leave a Reply

Your email address will not be published. Required fields are marked *