Load balancing gRPC in K8s without service mesh

Loadbalancing communication in gRPC is tricky! This is due to the fact that the selling point of gRPC is it’s ability to create sticky connections. In short, gRPC uses a single TCP connection and multiplexes requests on top of that connection.

This means that the layer 4 load balancer provided by K8s doesn’t work well for gRPC. Usually this problem is solved by using a service mesh, which will do the load balancing on layer 7 (see Linkerd, Istio). This is the ideal solution to the problem of load balancing gRPC connections in k8s.

But what happens if you don’t have the option of using a service mesh? Link to heading

At my current workplace, I was stuck in a scenario wherein we had control over both the client and server code, but infrastructure was managed by a dedicated team. The infra team was still beta-testing service mesh on k8s, but only on non-production environment’s.

Most teams (us included), continued to use gRPC in new services because of multiple reasons (future readiness, protobuf enforcement…so on), but there was one big problem. How do you circumvent the load balancing issue?

So the solution we were following was simple, use a new gRPC connection per request. Yes, it does kinda render the main feature of gRPC (sticky connections useless). But who doesn’t like creating tech debt?

Context Deadlines Link to heading

So with this solution in mind, we created a new service and used gRPC for communication. Slowly we started noticing a lot of rpc error: code = DeadlineExceeded desc = context deadline exceeded errors between our services. The problem spiked a lot of interest, There was no-clear sign of why this was happening.

This got me interested about the effectiveness of our solution. That’s when I decided to experiment and collect data:

Experiment Link to heading

For this experiment we’ll create a simple server and client which communicate in gRPC.

The server will simply echo the message sent by the client
The server will sleep for 100ms to emulate work being done for each request
Clients have a timeout of 300ms set
We’ll run the experiment 5 times over multiple connection sizes to average out the values
The entire code is in Go and we’ll use protoc to generate the go code from the protobuf
We’ll test this with 1 client and 1 server initially

Single connection with requests multiplexed Link to heading

Here we create a single connection and requests are multiplexed over it. This is the way gRPC was meant be used. This will act as our base case for comparison.

Notice that the error rate is 0 upto 10000 connections and slowly starts to peak soon after. All errors here are context deadlines due to lots of requests being made to the server and the server not being able to respond within the set timeout (300ms).

Code: https://github.com/KarthikNayak/grpc_test/tree/single_connection

New connection for each request Link to heading

This is the current solution we’re using, here for each request we create a new gRPC connection. The performance can be expected to be bad here since we’re recreating connections. In fact, is might be worse than HTTP/1.1 which uses persistent connections (remember gRPC uses HTTP/2).

From this we can see that using a new connection per request is much worse, the error rate spikes up much earlier (at around 1000 requests). Even the latency is much higher, the lower latency for higher number of connections is mostly due to the increase in error rate.

Code: https://github.com/KarthikNayak/grpc_test/tree/connection_per_request

Using Keepalive on the server side Link to heading

While searching for a solution which might be closer in performance to using a single connection, while also providing load-balancing, I stumbled upon the keepalive parameters for gRPC.

If you look at ServerParameters.MaxConnectionAge, it says that the server can end connections as per the set duration. So the question is, if the server sends a GOAWAY to the client, does the client reconnect to the same server, or does it try to resolve to a new server IP.

To test this, I extended our code to be run on 3 servers on minikube. What we can see is that the client does DNS resolution each time. This is amazing, since this means, we get load balancing as a side effect over time. To summarize, the client creates a sticky connection with one server, after the configured time, the server sends a GOAWAY signal to the client. The client then re-establishes the connection either with the same or a new server based on the DNS resolution. Over time the sticky connections should be balanced over all the servers available.

We can see this in the gRPC logs:

Also checking the logs of the three servers, we can see that the load is balanced over the lifespan:

Code: https://github.com/KarthikNayak/grpc_test/tree/server_timeout

The best part is that the error rate and latency is almost comparable to the ideal scenario (Note: we used 1s max timeout).

Summary Link to heading

When you want to loadbalance gRPC connections, use a service mesh. But if that’s not something you have setup, using max timeout settings on the server side is a really good alternative!