We're Using Kubernetes and Here's Why You Should Too
Introduction
"Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications". In a nutshell, Kubernetes helps to abstract away machine details so a cluster of machines can operate much like a single individual machine. This allows developers to simply focus on their apps. Things like failover, scaling, deployment, DNS, logging, etc are all managed by Kubernetes. Resources are automatically allocated to applications, DNS wires microservices together and external services are load balanced and exposed based on simple YAML or JSON config files.
This article provides a background on why we chose Kubernetes, along with some basic tips and concepts that we use to setup and operate our services.
Monolith to microservices
Sajari is a distributed application comprised of many features. A year ago this was a single monolithic application, which worked well, but was going to be very difficult to scale out and distribute. Along with scaling issues, this also came with the disadvantage that heavy CPU operations from one customer could impact the speed for other customers. Things like document conversion and processing were notorious for this, model processing was intentionally slowed down for the same reason. It wasn't ideal for anyone.
At that point we realised the monolithic beast had to be broken up into microservices. That's an easy initial conclusion, but delivering it was a much bigger undertaking. Sajari was tightly coupled for a reason, document addition and queries both flowed through the same machine learning pipelines. Data co-location and shared memory have lots of advantages that are difficult to offset. The structure of microservices is also quite different in general and much more so for applications with state.
We looked at various alternatives to Kubernetes and discussed the advantages and disadvantages with a bunch of very smart people, then we attended a talk at Google which basically sealed our choice. Learning about Borg and Omega and Google's 10 year journey orchestrating containers was phenomenally interesting. The roots of Kubernetes come from over 10 years of lessons from building and running apps like Gmail at Google. The depth of knowledge is huge, and it shows.
Getting started with Kubernetes
The first thing was to break the app up into components and then define the rules to govern them:
So that's interesting, apart from wanting everything low latency, there are some really diverse requirements, which kinda validates the separation. Kubernetes takes care of most of these differences, except for state, which we'll detail last.
CPU and Memory management
Kubernetes allows you to set limits on resources such as CPU and memory. This is useful to stop processes hogging resources, to kill processes that have gone off the reservation and also to allow efficient scheduling of multiple diverse services (e.g. one uses a lot of memory, another uses a lot of CPU, this is also one of Borg and Omega's key goals. It's critical at Google scale, where even a few % per machine gets wasteful really quickly).
We have a few services where we like to limit CPU, such as document conversion and model crunching. This is done mainly so search latency can remain low. In general we currently don't have too many limits as in practice we've found this can kill processes that burst instead of capping them and these also tend to now be on different machines (node pools), but in the future that approach may be changed.
More info on limiting resources here
Collocating containers
Kubernetes has a concept called "pods". A pod is a logical grouping of containers that will be deployed together on the same underlying instance. They are tightly coupled, can share memory, disk and communicate via localhost. These containers live and die together, their fate is tied.
One example for us is our search engine pod: we have a configuration controller which sits in its own container alongside search engine containers. It fetches config from a central controller on startup and runs a reconciliation loop to update the engine processes when changes are passed to it.
Pods can be annotated with "labels" - key-value pairs - which can be used to define pod groupings in a cluster. For instance, you might have a set of pods which hold different versions of a particular service: 'Pod A: name=api,version=v1', 'Pod B: name=api,version=v2', so the selector 'name=api' would match both 'Pod A' and 'Pod B', whereas 'name=api,version=v2' would only match 'Pod B'.
Services
Kubernetes manages service discovery and load balancing via "services" which can be limited to internal access (i.e. from within the cluster) or opened up to external facing endpoints. A service defines its pods using labels, so it's very easy to move pods in and out of operation (i.e. when experimenting with a new version or running profiling/tracing). You don't need to think about any of the internal DNS or routing, Kubernetes takes care of it for you.
From an external perspective services define how your application can be accessed, but you don't need to worry about the details of how traffic is being delivered to your containers, or even which containers. There are some subtle differences between cloud providers, but a service can be designated as a load balancer and even mapped to external IP's with very simple YAML definitions.
A sample YAML for our autocomplete service is shown below. It is setup as an internal service and is mapping port 4321 from the targeted pods (selector 'name=autocomplete') to a cluster IP.
To make this an external-facing service, simply change type: ClusterIP to type: LoadBalancer. Depending on your platform, this should spin up an external load balancer (i.e. ELB in AWS or a Google LoadBalancer in Google Cloud) which will be bound to an external IP. You can also specify an IP address using the loadBalancerIP attribute.
Scaling and replication
Replication controllers were the previous way of controlling how containers/pods are respawned and scaled, but this has more recently been replaced with the concept of replica set which is commonly created by "deployments". A deployment is defined using another simple YAML file and provides state to deployment process: they can be updated/rolled back and retain useful information about the pods they are managing. They're very powerful for such a simple definition.
We use deployments to keep things up and running and restart containers/pods as necessary. Rolling out new versions is also very straightforward using deployments. Rollouts are staged so new versions are brought up before being inserted into the deployment, thus failed deployments are less likely to cause issues. Again the concept of "labels" is fantastic.
Below is a sample YAML definition file for our autocomplete deployment.
Managing state
Our biggest challenge is handling state recovery (without data loss!) when pods are restarted or moved. Sajari is a realtime search engine so we have a large stream of updates flowing into the index at any given time. Although Kubernetes can recover pods and re-mount any associated disks, our application code is 100% responsible for avoiding dirty state and defining fallbacks and recovery strategies to avoid data loss.
However if you ignore the more difficult search index components, the rest of our stateful microservices are much simpler to manage. For many of these the data changes are less frequent and state can be consolidated and kept somewhere more robust and fetched as needed. This concept has served us well and in fact we use a common "storage" layer that handles fetching and caching files, hiding the underlying complexity from our application code. Information can be retrieved, saved to local disk and loaded into memory as needed by chaining several storage implementations. If an application requests a file that hasn't been used before, the storage system automatically bubbles its way up the chain and the data propagates back down. We have implementations for Google Cloud Storage and Amazon S3, as well as local and memory filesystems, and some more specialised extras like a content-addressable layer for some use cases. This type of setup is used for a bunch of our microservices, e.g. synonyms, machine learning models, autocompletion, spell checking and more.
For the search index itself, we use persistent disks as the backing storage, with further redundancy on CloudStore or S3. If you were operating a database with only a few nodes, you could create the disks manually and refer directly to their ID in the Pod YAML file. This has the advantage that the persistent disks will be re-bound during a recovery or update, so the data is preserved. The down side to this approach is that a) it doesn't scale and b) disks (understandably) have one-to-one relationships with Pods, so during an update the disk must be unbound and re-bound, which can take a while (on GKE we've seen this take ~ 1 minute). An alternative approach to identifying disks by ID is to use "persistent disk claims", which provide a storage abstraction layer on top of whatever underlying storage you choose to use. This means a pod can ask for storage and, assuming there is availability in the cluster, it will be allocated and bound. You can also control what to do with the claim once released (e.g. retain, recycle or delete).
We utilise the extra redundant storage layer of CloudStore / S3 specifically so we can allow disks to disappear. Currently we're still referencing disks using IDs, but we will soon switch to claims. In the rare event of multiple nodes (primary and replicas) failing simultaneously such that data is lost, redundant storage can be used to rebuild this information. We already operate with this extra redundant layer for recovery, but will also achieve much greater scalability by moving to persistent volume claims.
Managing Kubernetes
The main take home here is that everything is just containers, which are organised and connected using simple YAML config files. We're used to working with concurrent languages, but for people unfamiliar with concurrency (Python, PHP, Ruby, etc) this is gold dust, as you might have 50+ containers on a single underlying machine, and that doesn't really matter if it suits your application. One-to-one machine-container relationships are wasteful, and more than that it's a waste of time if you're even thinking about it at that level. It doesn't scale and Kubernetes is designed specifically to abstract that away.
The command-line tool kubectl connects to Kubernetes clusters and can modify, query, pull logs, setup port forwarding, and much more. Once you've created a cluster you can do everything with a few command line operations.
Managing keys and secrets
Kubernetes ships with secrets, which allows passwords, tokens and keys to be stored in the cluster and accessed from container as necessary - far away from containers and definition files and managed in a much more secure way. Super useful.
We create secrets from the command line. This can be done using files, or strings. The secrets then become part of the secret store in the cluster, which can be referenced in YAML definitions so containers can access specific information without needing to disclose it anywhere. When creating a cluster we run a series of scripts to add SSL certs, secrets and tokens to the secret store.
Logging
Logging in Kubernetes is also unified, so it's possible to stream logs from specific deployments, pods, services, anything with a label, which is amazingly useful. Fluentd is integrated into all pods. If you're using Google cloud, logs are also automatically pushed into Google Cloud Logging, which can be used to set alerts, even within an iOS app. Our developers get alerted on their iPhones when things go off the reservation.
Conclusion
Kubernetes is extremely well thought out. There are some deficiencies, but overall the project is moving very quickly, so everything that bothers us now is on the roadmap to be fixed. Google has made a big bet releasing this technology as open source and we think it will pay huge dividends. When we compared it to other projects, we kept concluding this is a 3rd generation project (+10 years of hard lessons) vs first generation products trying to solve the problem.
Give it a try and let us know how you go!