Netflix Guide to Microservices

Netflix Guide to Microservices

Netflix's Microservice talk is one of the best if you want to learn about how systems scale.

You can easily draw a picture of how an organization might have scaled along with technology and how to deal with problems that come along. Out of many, below I will be sharing a few learnings.

Why did they go microservice? In the initial days, they had a monolith, which of course everyone has, but the way they were shipping and dealing with an increasing customer base, they had problems with the monolith, like it was hard to debug what went wrong, what caused the outage, and so on.

What does a typical microservice contain? Well, it has a logic layer, a database, to save DB calls you add cache, and an application layer to access the service, so these 4 are the basic components required. Of course, you can add or remove based on needs.

Usually, there are 4 areas that you should take care of while building towards microservice: Dependency, Scale, Variance, and Change.

Lets zoom into each

Dependency

a. Inter-service communication: If 2 services are communicating, there could be a possibility of network latency, hardware failure, or anything that could affect communication, and poor service that became a victim could go down, leading to cascading failures.

They made Hystrix an internal tool to check service health. https://netflixtechblog.com/introducing-hystrix-for-resilience-engineering-13531c1ab362

b. Testing: How will you test for each service? Let's say you have 2 today, and tomorrow it could be 10. You won't be doing N permutations for each release. The solution is to have an identified set of critical services that should be up in any case, allowing users to at least do the basic tasks that should be tested. That's the basic need here.

In Netflix's case, it was provided just the basic set of recommendations and popular movies and the User should be able to play it, that's the minimal requirement.

c. Client Libraries: Since we talked about the 4 components of microservices, there is usually common code at the application layer but is modified for required service usage. Now you have 2 options to either keep that part redundant or every service have one common point like monolith there were many discussions and no ideal answer but they decided to keep monolith with minimum code required, so it does not heat up i.e. get overloaded.

d. Persistence: a simple eventually consistent model. The reason I suppose they are read-heavy is that movies, recommendations, and users' data can be later eventually synced. CAP theorem reference 👆

Scale

a. Stateless Service

  • they don’t have db or cache to store their own data

  • they can access config or metadata

  • very short span of user service

  • they can go down and can be replaced with other one in no time.

use autoscaling to manage their automation of these kind of services.

with chaos monkey, they got super comfortable with service going down, not an issue for them.

https://netflixtechblog.com/netflix-chaos-monkey-upgraded-1d679429be5d

b. StateFul Service

  • they have database and cache

  • If node goes down, everything is gone, critical service.

Once they had an outage when their dependency was in one zone, US East 1, leading to an outage in the North America and Canada regions during the holiday season. They couldn't recover from it because of cached data took long time to refill.

They made EVcache, which works on a simple principle: write to each zone for redundancy and read from the current zone for a fast response, of course, in case of unavailability then from neighbours.

https://netflixtechblog.com/announcing-evcache-distributed-in-memory-datastore-for-cloud-c26a698c27f7

c. Hybrid Service

Mix and match of both.

Once they had one EVcache layer going down, the solution in the previous case became a problem here. This led to the enhancement of the system in three ways:

  • Workload partition: that is, don't make real-time calls and batch calls on the same service. It will get overloaded.

  • Request-level caching

  • Secure-token fallback: here, you can provide enough data inside the token so that when the service dies, you can have enough data to keep the app running or keep recommendation data in the token for some time so it can fetch available data like videos instead of frequently hitting it dead service, in this case, videos is different service, which is up and running.

This above strategy has many use cases apart from Netflix's like you can store users' topics inside token, in the case of social media, so they get feed generated from DB directly.

Variance

a. Operational Drift: Things change over time, such as response fallbacks, thresholds, or best practices and formats across organizations, which can unintentionally change and keep other parts out of synced. They have automated these kinds of mundane tasks to keep them uniform and have created internal tools for monitoring them.

Like you can make a linting tool that is automated with each PR so now everyone's code has best practices implemented without developer keeping track of it.

b. Polyglot and Containers: Different services will be using different languages. You cannot restrict use cases based on scenarios to maintain norms as norms. Using containers was the solution, but there are still non-technical things to address, such as why this is the best solution and what is the impact of this. People should know about this so that they don't conflict when certain changes are required by someone else or when teams change, etc.

Change

Deployments often cause outages. They are usually done on weekdays and in the morning, which is the opposite of their peak time.

To continue delivering Spinnaker, which allows for tweaking deployments, such as checking if the stats for the current code are better than others, did new code improve latency or any other stats, using canary, etc., a workflow can be created based on their deployment checklist, which varies org to org. This will help on deployments without much chaos.

https://netflixtechblog.com/global-continuous-delivery-with-spinnaker-2a6896c23ba7

This is the main talk that I referenced, I encourage you to read it because this blog is just a glimpse of it. You will learn a lot.

I have recently started writing about my system design learnings and will be sharing them along the way.

let's connect over Twitter, Peerlist or LinkedIn