Istio - Breaking The Trace

In my previous blog-post about Istio, I described briefly what Istio service mesh is and what problems it solves. In this one, I’ll try to take a deeper look under the hood and identify some limitations of Istio service mesh.

In the olden days, application tracing and telemetry required instrumentation inside the app code. This was very tiresome and an inefficient approach. Every new language or framework required that the instrumentation libraries be ported and made compatible with them. Adding these tools in legacy code was almost impossible as well. Istio solved this problem by using the Envoy proxies. The sidecar pattern allowed Istio to do extra-thread collection and processing of all the tracing and metrics. This decoupled the application from the tracing tools. Istio does not require any instrumentation inside the application. Well, it almost does not. This is one of the shortcomings of Istio. In order to understand this limitation, we have to first understand how Istio works.

Istio divides the service mesh into two parts namely, Data Plane and Control Plane. Data plane consists of all the application containers and the Envoy sidecars. The control plane on the other hand has three main controllers, named:

Mixer, which is responsible for tracing, telemetry, etc.
Pilot, which is responsible for configuring the sidecars.
Citadel, which is responsible for pushing security certificates to the sidecars for mTLS.

The controller of interest for us is the Mixer. Mixer sees everything in the data plane. All of the traffic is visible to Mixer and it can see and keep track of all the requests to the application and its microservices. In order to keep tabs on all of this traffic and see which services talk with each other when the application receives a request, it assigns unique IDs to each individual incoming request. Similarly to the request tracing, each span is also identified by a unique ID. The complete list of these IDs and some other flags that Istio keeps is:

x-request-id
x-b3-traceid
x-b3-spanid
x-b3-parentspanid
x-b3-sampled
x-b3-flags
x-ot-span-context

You must be wondering now where does Istio store all of these IDs and how does it associate them with each individual request? The naming of all of these flags are a big hint. All of these IDs and flags are stored in the request headers. Istio injects them in the headers for each new incoming request. These headers are then propagated in the service-mesh throughout the lifetime of the request. Intercepting the IDs at each sidecar allows Istio to work out all the tracing and timing information. This is a one (fairly) big limitation. Remember the bit in the start of this blog-post about instrumentation? Well, Istio requires your application to propagate these headers for it. If these headers are lost in the way, it breaks all of Istio’s telemetry. So, its not as bad as implementing your own telemetry tooling and adding it to your application, but its not ‘no instrumentation required’ either. If there is a rogue third-party microservice or even if some developer forgot to propagate these headers, it breaks the trace.

Open Tracing is an opensource project that is aiming to solve this problem by standardizing the headers and their protection for tracing tools like Istio. You can learn more about it on opentracing.io.