Welcome

Welcome to our Rust telemetry course!
You'll learn how to build observable applications in Rust.

Observability is about being able to ask arbitrary questions about your environment without —and this is the key part—having to know ahead of time what you wanted to ask.

Honeycomb

I'll take you on a journey through the Rust ecosystem, exploring the available telemetry solutions, learning how to combine them together to build a coherent and comprehensive toolkit.

Methodology

This course is based on the "learn by doing" principle.
It has been designed to be interactive and hands-on.

Mainmatter developed this course to be delivered in a classroom setting, in a single day: each attendee advances through the lessons at their own pace, with an experienced instructor providing guidance, answering questions and diving deeper into the topics as needed.
If you're interested in attending one of our training sessions, or if you'd like to bring this course to your company, please get in touch.

You can also follow the course on your own, but we recommend you find a friend or a mentor to help you along the way should you get stuck. You can also find solutions to all exercises in the solutions branch of the GitHub repository.

Structure

On the left side of the screen, you can see that the course is divided into sections. Each section introduces a new concept or feature.
To verify your understanding, each section is paired with an exercise that you need to solve.

You can find the exercises in the companion GitHub repository.
Before starting the course, make sure to clone the repository to your local machine:

# If you have an SSH key set up with GitHub
git clone git@github.com:mainmatter/rust-telemetry-workshop.git
# Otherwise, use the HTTPS URL:
#
#   git clone https://github.com/mainmatter/rust-telemetry-workshop.git

We also recommend you work on a branch, so you can easily track your progress and pull in updates from the main repository, if needed:

cd rust-telemetry-workshop 
git checkout -b my-solutions

All exercises are located in the exercises folder. Each exercise is structured as a Rust package. The package contains the exercise itself, instructions on what to do (in src/lib.rs), and a test suite to automatically verify your solution.

wr, the workshop runner

To verify your solutions, we've provided a tool that will guide you through the course. It is the wr CLI (short for "workshop runner"). Install it with:

cargo install --locked workshop-runner

In a new terminal, navigate back to the top-level folder of the repository. Run the wr command to start the course:

wr

wr will verify the solution to the current exercise.
Don't move on to the next section until you've solved the exercise for the current one.

We recommend committing your solutions to Git as you progress through the course, so you can easily track your progress and "restart" from a known point if needed.

Enjoy the course!

Author

This course was written by Luca Palmieri, Principal Engineering Consultant at Mainmatter.
Luca has been working with Rust since 2018, initially at TrueLayer and then at AWS.
Luca is the author of "Zero to Production in Rust", the go-to resource for learning how to build backend applications in Rust.
He is also the author and maintainer of a variety of open-source Rust projects, including cargo-chef, Pavex and wiremock.

Exercise

The exercise for this section is located in 00_intro/00_welcome

Structured logging

We will start our observability journey with logs.

Logs are the most common kind of telemetry data.
Even developers who have never heard of observability have an intuitive understanding of the usefulness of logs: logs are what you look at when stuff goes south to understand what is happening, crossing your fingers extra hard hoping you captured enough information to troubleshoot effectively.

What are logs though? The format varies, depending on the epoch, the platform and the technologies you are using. The simplest starting point: a bunch of text pushed to stdout, with a line break to separate the current record from the next one. For example:

The application is starting on port 8080
Handling a request to /index
Returned a 200 OK

Three perfectly-valid log records for a web server. Enough to understand, from the outside, what is happening.

println!

The standard library contains the most basic logging framework you can think of: println!. println! is a macro that writes a formatted string to the standard output.

#![allow(unused)]
fn main() {
let name = "Alice";
// Prints: Hello, Alice!
println!("Hello, {name}!");
}

Exercise

The exercise for this section is located in 01_structured_logging/00_intro

The log crate

Coupling

println! and its sibling eprintln! can get you started, but they won't get you very far.

Our ultimate goal is to monitor our applications in production environments. Each production environment is its own little world, with its own constraints and requirements. In particular, expectations around log data vary wildly: some environments expect logs to be written to stdout, others to stderr, others to a file, others to a socket, and so on.

println! breaks down quickly in the face of such requirements: it couples together the information that we want to log (i.e. the message) with the way we want to log it (i.e. the destination, stdout).
To change the destination, we would have to change the code that produces the log message. That's a problem: that log message might be coming from a third-party library, or from a part of the codebase that we don't own.
Having to fork those modules to accommodate the logging requirements of the final application is a non-starter.

Facade

We need to decouple.
On one side, we will have an instrumentation API, used to emit log messages from the application and its dependencies.
On the other side, we will have the processing code, the logic in charge of deciding what to do with the log messages that have been produced.

If you are familiar with the Gang of Four's Design Patterns, you might have recognized this as an instance of the facade pattern.

The log crate

The facade pattern only works if the ecosystem, as a whole, standardizes around the same facade. You can then use a single pipeline to collect and process instrumentation data coming from both first party code and third party dependencies.

In the early days of the Rust ecosystem, the log crate arose as the de-facto standard logging facade for Rust applications.
Let's have a look at it!

Exercise

The exercise for this section is located in 01_structured_logging/01_log_crate

Log levels

The cost of logging

Logs are useful, but they come at a cost.
Formatting, serializing, and writing log messages to a file or a socket takes time. It is not uncommon to hear about applications that spend more time writing logs than doing actual work.

As a rule of thumb, we want each log record emitted by our application to be useful.

How do you define "useful"?

Easier said than done.

The usefulness of a log record can be situational: redundant under normal circumstances but vital when trying to troubleshoot a particularly vicious bug.

Usefulness is also subjective: depending on your intent, you might or might not care about a particular log record. This is a challenge for library authors: their software will be used in a variety of contexts, and it's hard to predict what information will be useful to the final application.

How do we reconcile these needs?

Log filtering

The common solution is to use log filters.
A log filter is a predicate that determines whether a log record should be emitted or not.

The filters are defined by the final binary, the application, therefore they are tuned based on the application's needs.
Library authors can instrument their code ~freely, knowing that the application owner will be in charge of deciding what they want or don't want to see.

Filters rely primarily on two pieces of information:

  • the log level of the record, a measure of its importance
  • the module that emitted the record, also known as the target.

Filters are evaluated at runtime for each log record, therefore they aren't entirely zero-cost, but they let us skip the most expensive parts of the logging process (formatting, serializing, emitting), therefore amortizing the cost of "unnecessary" log records.

Compile-time filtering

There is an exception though: you can filter out log records based on their level at compile time, using some of the cargo features exposed by the log crate.
The filtered log statements disappear entirely: they don't exist in the compiled binary, therefore they don't incur any runtime cost.
It is a great option to remove the noisiest log statements (e.g. TRACE-level), although it's not a silver bullet: you can't use it to filter out log records based on their source.

Exercise

The exercise for this section is located in 01_structured_logging/02_log_levels

What to log

We have talked extensively about the mechanics of logging (how to log, how to implement a logger, how to filter), but we haven't really talked about what to log.

Why do we log?

To determine what to log, we need to stop for a second and think about the purpose of logging. We use logging as a way to determine, from the outside, what is going on inside our applications. In particular, as a way to diagnose if our applications are not behaving as expected.

The "easy" solution would be to log everything, a full-fidelity representation of the internal state of our applications.\ Unfortunately, that's usually not feasible: the cost of producing and storing that information would be prohibitive.

Logs, therefore, are necessarily a lossy representation. We need to choose carefully what to log in order to maximize our chances of spotting* problems and being able to troubleshoot them.

Unit of work

A good rule of thumb is to view your application logic as a series of units of work.
Each unit of work has a start and an end, and may in turn contain other sub-units of work.

For each of those unit of work, we'll surely want to know:

  • how long it took to complete, i.e. its duration
  • whether it completed successfully or not, i.e. its outcome

We'll get both of these pieces of information if we emit two log records for each unit of work: one at the start and one at the end.

How fine-grained should the units of work be?

If you take this approach to the extreme, you could model each function call as a unit of work. While that may be useful in some scenarios, it's not a good default.

You should consider a unit of work as a meaningful piece of work, one that may occupy a significant amount of time with respect to the duration of the over-arching operation (e.g. processing an incoming request).

Let's make an example in the context of a web server.
Parsing a header value is probably not a good candidate for a unit of work—it's very fast and it's unlikely to vary much in duration. On the contrary, parsing the body of a request is a good candidate: it's likely to take a significant amount of time, and it's likely to vary in duration depending on the size of the body itself and the way it is being sent to the server.

Always take the over-arching context into account when determining what should or should not be treated as a unit of work.

Exercise

The exercise for this section is located in 01_structured_logging/03_what_to_log

The tracing crate

log is not the right abstraction

In the previous exercise, we laid out the key abstraction we need to model our applications: units of work.
We've also tried to leverage the log crate to properly track them, but you may have noticed that it's not a perfect fit.

The log crate is structured around the concept of log events.
There is no duration. There is also no relationship between one event and the other, while we have seen how our units of work are naturally organised in a hierarchical fashion— a unit of work may in turn contain multiple sub-units of work, and so on.

We need to change our instrumentation library to one that is more suited to our needs: the tracing crate.

tracing::Span

The tracing crate models units of work as spans.
A span is a unit of work that has a start and an end. A span can also have a parent, which naturally introduces that hierarchical relationship we were looking for.

The richer data model translates into a more complex interface, both on the instrumentation side (i.e. the code that emits spans) and on the consumer side (i.e. the code that processes spans).
This is one of the reasons I chose to talk about log first: it's a gentler introduction to the overall facade pattern and by probing its limitations you (hopefully) have a better understanding of the rationale behind tracing's additional complexity.

Exercise

The exercise for this section is located in 01_structured_logging/04_tracing

Enriching spans with structured fields

The test assertions in the previous exercise have highlighted some of the tracing advantages we talked about—e.g. the hierarchical relationship between spans which our subscriber translated into a prefix-based structure for the log output.
We haven't fully replicated our log version though:

  1. we're not capturing the outcome of each unit of work
  2. we're not capturing the duration of each unit of work

2 is easily achieved by tweaking the configuration for our subscriber (.with_timer(Uptime)), so let's focus on 1., the outcome.
To pull it off in an idiomatic way, we need to learn about span fields.

Span fields

tracing allows us to attach key-value pairs to spans, which we refer to as span fields.
The syntax looks like this:

#![allow(unused)]
fn main() {
tracing::info_span!("my special task", foo = 42, bar = "hello");
}

You can also change the value of a field after the span has been created using .record:

#![allow(unused)]
fn main() {
let span = tracing::info_span!("my special task", foo = 42);
span.record("foo", 43);
}

The way fields are handled depends on the subscriber we're using, just like the span data itself.

Fields must be defined upfront

It's important to point out one key limitation of span fields: they must be known when the span is created. In other words:

#![allow(unused)]
fn main() {
// Don't do this in prod.
let span = tracing::info_span!("my special task");
span.record("foo", 43);
}

won't work because the field foo is not defined when the span is created. No error will be raised, but the field will be ignored at runtime—that record call will have no effect whatsoever.

You may be wondering: what if I don't know the value of a field upfront?
Good question! You can use the tracing::field::Empty as value for it when defining the span:

#![allow(unused)]
fn main() {
// This works!
let span = tracing::info_span!("my special task", foo = tracing::field::Empty);
span.record("foo", 43);
}

This limitation stems from tracing's commitment to low overhead: knowing the fields upfront allows tracing to avoid a few heap allocations in the hot path of your application.

Exercise

The exercise for this section is located in 01_structured_logging/05_fields

Ergonomics

Before I stated that we shouldn't treat each function as its own unit of work—it's way too verbose and it leads to a lot of noise in the logs.
Nonetheless, it is often the case that a unit of work maps to a certain function. The work starts when that function is invoked and ends when it returns. That was indeed the case in the previous exercise.

When it happens, we can use the instrument macro to avoid having to manually create a span inside the function body.

Exercise

The exercise for this section is located in 01_structured_logging/06_instrument_macro

Structured output

In all our exercises (and tests), we've been looking at the emitted diagnostics for a single invocation of an instrumented function. In the real world, however, we can expect to have hundreds if not thousands of invocations per second, each of which will emit its own diagnostics and might result in a different outcome.

In order to make sense of all this data, we need to be able to analyze it.

We might want to know, for example, how many failures we have had in the last hour. Queries can also get more complex: how many failures have we had in the last hour, broken down by error type and by user?

So far we've always been looking at a single line of text which interpolates the target of the span, the values of our fields and other metadata (e.g. timing) into a human-readable string. That kind of representation is not very amenable to machine processing.

In order to answer those questions, we need to be able to structure our output.

The simplest data format we can use is JSON.

Exercise

The exercise for this section is located in 01_structured_logging/07_structured_output

OpenTelemetry

While looking at JSON data in a terminal might be a good start, it's not very practical.
Analyzing telemetry data using jq or scrolling through a terminal window is fairly tedious and error-prone.
In most real-world scenarios, we'll want to export our telemetry data to a centralized system that provides advanced analysis capabilities.

In the old dark days, you would have had to pick a vendor and use their SDK to export your data to their systems in whatever bespoke format they devised.
Thankfully, the industry has moved on and we now have a vendor-neutral format for exporting telemetry data: OpenTelemetry. It has gained a significant foothold and it's now supported by most observability vendors as a valid (if not preferred!) format for data ingestion.

Exercise

The exercise for this section is located in 01_structured_logging/08_opentelemetry

tracing subscribers

So far we've focused on the instrumentation side of tracing—spans, fields, etc. But we haven't talked about how to consume the data we're producing.

Subscriber

To see tracing in action we've already had to play with different processors: interpolated text to the console, JSON, and OpenTelemetry.

They are all examples of tracing subscribers—i.e. implementors of the Subscriber trait.

A subscriber is much more complex than what we have previously seen in this workshop, the Record trait from the log crate.
tracing offers a richer data model, therefore pushing more responsibilities to the subscriber. For example: each span has to be assigned a unique ID, which is used to link spans together in a hierarchical structure.
Complexity is further compounded by the fact that tracing is designed to be extremely low-overhead, to maximise its applicability in production environments.

Registry

It's rare to find yourself implementing the Subscriber trait from scratch.
More often than not, you'll rely on Registry as your foundation.
It takes care of all the complicated bits (e.g. span ID generation and management) and exposes a simple(r) interface for you to implement: the Layer trait.

Layer

Even better: you don't even have to implement Layer yourself (unless you want to).
You can combine multiple layers together using the with method exposed by the SubscriberExt trait.
You just stack multiple layers from the ecosystem on top of each other, downstream of a Registry, and you're good to go.

Exercise

The exercise for this section is located in 01_structured_logging/09_subscriber

The lifecycle of a span

All the instrumentation work we have done so far has happened in a synchronous context.
We have only used async Rust on the processing side, to use non-blocking IO to ship telemetry data to a sink over the network (i.e. Honeycomb's API).

But what if we wanted to instrument asynchronous code?

It's trickier!
There are some subtleties to be aware of, at the intersection between the lifecycle of a tracing span, work-stealing async runtimes and tracing's internal state management.

It's a lot to cover in a single exercise, so we'll break it down into smaller chunks.
Let's start with the lifecycle of a span.

A span was born—what then?

You've seen in the first tracing exercises that there are two distinct phases in the lifecycle of a span:

  • creation, i.e. what happens when tracing::<level>_span! is called;
  • entering, i.e. what happens when you can .enter() on a span, getting back a RAII guard.

A span can only be created once, but it can be entered multiple times.

#![allow(unused)]
fn main() {
use tracing::info_span;

let span = info_span!("my span");
{
    let _guard = span.enter();
    // Do something
}
{
    // We re-enter the same span!
    let _guard2 = span.enter();
    // Do something else
}
}

Why would you do that, you may be wondering?
Well, your span may be tracking a piece of work that can be paused (async, remember?).

Creation and entering have two matching phases: closing and exiting.

A span is exited when the guard object returned by .enter() is dropped.
A span is closed, instead, when there are no more handles to it. It's the subscriber's responsibility to do this bookkeeping.

When a span is entered into a single time, the closing and exiting phases usually happen one after the other. This is what happened so far in our exercises.

Exercise

The exercise for this section is located in 01_structured_logging/10_span_lifecycle

Thread local state

How does tracing know which span is currently active?

That's necessary to implement a few of the features we've seen so far:

  • Span::current(), which returns a handle over that span that is currently active
  • Attaching a new span as a child of the currently active span

The answer is thread local state.

thread_local!

Rust's standard library exposes a thread_local! macro that allows you to define a variable that is local to the current thread.
It's a bit like static, but it's not shared across threads.

It comes with a caveat: if you move to a different thread, you won't be able to access the value you set on the previous thread.

Spawning threads breaks the hierarchy

tracing uses thread local state to keep track of the currently active span.
This has an interesting implication: if you spawn a thread to do some work, the spans created in that thread will not be linked to the spans created in the parent thread.

#![allow(unused)]
fn main() {
use tracing::{info_span, info};

let spawner_span = info_span!("spawner");
let _guard = spawner_span.enter();

std::thread::spawn(|| {
    // This is NOT a child of `spawner_span`!
    let spawned_span = info_span!("spawned");
    // [...]
});
}

This is something to be aware of when you're using tracing in a multithreaded environment. You have three options:

  • Leave the spans unlinked. This is OK if the two unit of works are actually unrelated.
  • Explicitly mark the spawned span as a child of the spawner span. This is desirable if the spawner span won't be closed until the spawned span is closed (e.g. if you are waiting for the new thread to finish).
  • Explicitly mark the spawned span as a "follower" of the spawner span. This maintains a connection between the two spans, but it doesn't require the spawner span to be kept open. This works pretty well when spawning background tasks that might complete after the spawner unit of work is done.

Exercise

The exercise for this section is located in 01_structured_logging/11_thread_local_state

Instrumenting asynchronous code

We're finally ready to tackle the big boss: instrumenting asynchronous code.

Let's start by brushing up on Rust's asynchronous programming model. If you're familiar with the details, feel free to skim through!

Futures

The cornerstone of Rust's asynchronous programming model is the Future trait.

pub trait Future {
    type Output;

    // Required method
    fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output>;
}

We refer to an implementor of the Future trait as a future. A future represents a computation that will eventually produce a value (of type Output). Futures are designed to be lazy—if you want them to make progress on whatever job they're doing, you need to repeatedly call their poll method until they're done, i.e. Poll::Ready is returned.

Async runtimes

The component in charge of polling a future is known as an asynchronous runtime. Rust's standard library does not include a "default" asynchronous runtime. You need to bring your own, usually as a third-party dependency.

When it comes to tracing, there are two key things to keep in mind when it comes to asynchronous runtimes:

  • When a certain future can't make any more progress (i.e. it returns Poll::Pending) the runtime will usually try to run (on that very thread!) another future that has been queued up. This is what allows concurrency to be highly efficient!
  • Many popular runtimes (e.g. tokio or async-std) are built around the concept of multi-threaded work-stealing. The next time a future is polled, it may be on a different thread. This model comes with some advantages: you don't need to worry too much about balancing your work across threads, the runtime will transparently take care of it for you.
    It also introduces some challenges: you can't rely on thread-local state to keep track of values that the future cares about, because the next time it's polled it may be on a different thread where that state is not available (or it's set to a different value).

Instrumented

As much as possible, we want our instrumentation code to behave correctly no matter what runtime our code is being executed on.
Let's assume that a future is a unit of work we want to track. We want to know how much time it spent doing work (i.e. inside poll), as well as how long it takes in terms of wall clock time.
Based on what we discussed so far, it follows that:

  • Any tracing-specific state that is associated to our future (i.e. its span handle) must be stored in the future itself.
  • Whenever our future is polled, we need to re-enter its span.
  • Whenever our future returns from poll, we need to exit its span.

This is the exact strategy implemented in the tracing crate via the Instrumented future:

// I have spared you the Pin-related stuff here, for simplicity.
// You can see the full implementation with all the details in `tracing`'s source code.
pub struct Instrumented<T> {
    inner: T,
    span: Span,
}

impl<T: Future> Future for Instrumented<T> {
    type Output = T::Output;

    fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
        let _enter = self.span.enter();
        self.inner.poll(cx)
    }
}

It wraps around the future you want to instrument and bundles it together with its tracking span. Every time poll is called, we enter the span and then exit it when it returns. One of those cases where the actual implementation is very simple compared to the rationale that justifies it!

Exercise

The exercise for this section is located in 01_structured_logging/12_futures

Instrumenting asynchronous code - Part 2

Instrumented is neat, but ergonomics are a bit poor: in order to attach a span to an invocation of an async function we need to write a bunch of wrapping glue code.

Wrapping glue code... we've complained about it before, haven't we? Yes! When attaching a span to a synchronous function invocation.

#[tracing::instrument] is here to rescue us, again! The macro is smart enough to detect that the function it is applied to is asynchronous: it will automatically generate the required glue code and attach the span using .instrument.

Exercise

The exercise for this section is located in 01_structured_logging/13_futures_2

Interoperability

You have seen first-hand how much tracing brings to the table. Nonetheless, migrating an entire ecosystem takes time: you can still find crates that rely on log for instrumentation.

We need a way to bridge the gap between the two worlds: you should be able to use a single pipeline to process telemetry coming from both sources.

tracing-log

The tracing-log crate can act as such a bridge: it provides you with a log-compatible processor which redirects all log events to tracing.

Be mindful: if you're using tracing_subscriber::fmt as your tracing subscriber, it'll automatically install this bridge for you unless you explicitly disable the tracing-log feature.

Exercise

The exercise for this section is located in 01_structured_logging/14_interop

Error handling

When monitoring a system, we are primarily interested in failures. We want to know when something goes wrong, and we want to know why.

This section of the course focuses on failures—in particular, how to capture their occurrence in our telemetry data.

Enjoy!

Exercise

The exercise for this section is located in 02_failures/00_intro

The Error trait

In Rust, errors are expected to implement the Error trait (there are some exceptions, but we'll ignore them for now).

The Error trait is defined in the standard library, and looks like this:

#![allow(unused)]
fn main() {
use std::fmt::{Debug, Display};

pub trait Error: Debug + Display {
    fn source(&self) -> Option<&(dyn Error + 'static)>;
}
}

Let's unpack the definition:

  • All errors must implement the Debug trait. This representation is primarily intended for operators. It is likely to expose internal details of the error and the system it was emitted from. In most cases, you can just use #[derive(Debug)] for your error type.
  • All errors must implement the Display trait. This representation is primarily designed for users. It should be understandable by a person that is not familiar (nor has access) to the internals of the system.
  • An errors may travel through multiple "layers" in your application. E.g. a failure to execute a query might arise from a network error, which in turn might be caused by a DNS resolution failure. Each additional semantic layer is often represented as a wrapper over the original error. The source method allows you to walk the chain of errors and inspect each layer.

What does this mean for us?

From a telemetry perspective, we should be careful and try to capture as much context as possible!
The Display representation is likely to omit details which are going to be necessary to troubleshoot. The Debug representation will be more verbose, but it might still miss some crucial details which are only available in the source chain—e.g. it might say "I failed to run SELECT * FROM TABLE" but it won't tell you "Timed out while trying to connect to XYZ.com".

My suggestion is to be slightly wasteful and capture all three representations. They're not going to be 100% orthogonal, but you'll be maximising your chances at capturing all the information you need.

Exercise

The exercise for this section is located in 02_failures/01_error_trait

Error report types

In the previous exercise I mentioned that not all error types actually implement the Error trait. The primary exceptions to this rule are error report types.

An error report type is a type which is used to report about an error, with the potential addition of some extra context. They're often used as opaque error types, i,.e. a way to hide the actual type of the underlying error from the user when it is not relevant or it would be a breach of semantics to expose it.

Their common characteristic is that they want to implement a conversion from an arbitrary error type (E: Error) into themselves. This, in turn, means that they can't implement the Error trait themselves, as that would result in a conflicting From implementation (see https://github.com/dtolnay/anyhow/issues/25#issuecomment-544140480).

Examples in the ecosystem include:

  • Box<dyn Error>, the OG opaque error type
  • anyhow::Error, a more ergonomic version of Box<dyn Error>
  • eyre::Report, a fancier version of anyhow::Error

and many more.

It is fairly common to upcast all errors to one of these types at the boundary of your application, and then use that opaque error type for all your error handling (and reporting) needs. When that's the case, you need to carefully study the documentation of the error report type you're using: does it expose a source method, even if it doesn't implement the Error trait? Does its Debug implementation already expose the entire source chain (e.g. anyhow)?

By adapting your strategy to the specific type you're working with you can avoid emitting duplicate information, or missing out on important details.

Exercise

The exercise for this section is located in 02_failures/02_error_reporters

Panic handling

There are two types of failures in Rust:

  • recoverable errors, using the Result type and the Error trait
  • panics, (almost) unrecoverable failures that unwind the stack

It's time to talk about the latter!

Panics

You should trigger a panic when:

  • you encounter a situation that you can't recover from
  • it would be unsafe to continue
  • it's not worth the effort to perform any kind of more sophisticated error handling.

A few examples:

  • you're missing a resource that you absolutely need to do your job
  • an internal invariant of your application has been violated
  • you're in testing code and you don't want to continue if a specific condition is not met

Panics and telemetry

Regardless of the source, you should build a telemetry pipeline that knows how to handle panics.

Whenever a panic is triggered, Rust will start unwinding the stack and will eventually invoke the panic hook for your application.
The default hook will print the panic message to stderr and then abort the process. This may be good enough if you're running your application in a terminal, but it's not going to be very useful if you're running in a production environment: you want the panic information to be captured by the same telemetry pipeline that you're using for everything else.

We need a custom panic hook!

Exercise

The exercise for this section is located in 02_failures/03_panics

Metrics

Early detection is a key capability if you're operating a highly available service: you need to quickly spot if something is wrong in order for your first-response team to engage.

How does it work, in practice?

SLAs, SLOs and SLIs

It's a common industry practice to define a set of service level indicators (SLIs) for your service. They capture a set of measurable dimensions that highly correlate with your users having a good experience. For a payment platform, for example, you might monitor the success rate of payments, the settlement time (i.e. how long it takes for the counterpart to receive the funds), notification latency (i.e. how long it takes for the user to be notified of the outcome), etc.

Once you've identified a good set of SLIs, you need to set targets. If those targets are part of a contract with your users, they are usually called service level agreement (SLAs). If they are instead "self-imposed", they're called service level objectives (SLOs). It's common to have both in a mature business, with each SLO being stricter than its SLA counterpart.

When those targets are breached, an alert is triggered and one or more operators have to engage, investigate and/or try to bring the system back to a working state.

Measuring

Let's stress it again: your SLIs need to be measurable. But how does that actually work?

There are two common strategies:

  • You "metricise" your structured logs. E.g. you compute your API error rate by counting the number of log records where span.kind is set to server and status code is not in the 2xx/3xx range.
  • You pre-compute the relevant metrics in your application and then export the result to a dedicated system. E.g. every time you finish handling a request, you bump a counter based on the returned status code.

The first strategy is more flexible (e.g. you can define new alerts without having to touch the application, assuming your structured logs are rich enough), but it requires you to ship more data to your telemetry system where it will be processed further to output the number you're looking for. Depending on your scale, this may be either too slow, too wasteful or too expensive.

The opposite applies to pre-aggregated metrics: they are cheaper to ship and store, but they are rather inflexible.

My recommendation is to follow a hybrid strategy:

  • Collect structured logs (with a smart sampling strategy) to ask arbitrary questions about your systems and dive deep.
  • Pre-compute all SLIs and other relevant quantities that you know you need to monitor.

It's also worth pointing out that some quantities do not neatly map to logs and are instead well-suited to be collected as metrics with a time-based frequency. E.g. the balance of your company's operating account.

Next

This part of the workshop will focus on the metrics crate, our tool of choice for collecting and exporting pre-aggregated metrics in our Rust applications.

Get ready!

Exercise

The exercise for this section is located in 03_metrics/00_intro

The metrics crate

How does the metrics crate work? log used the facade pattern, tracing used the facade pattern... You guessed it, metrics uses the facade pattern as well!

It exposes a set of macros to create and manipulate metrics. The metric data is then forwarded to an implementor of the metrics::Recorder trait, which takes care of processing it.

It should all feel pretty familiar by now!

Exercise

The exercise for this section is located in 03_metrics/01_facade

Describe

Let's break down what happened in the last exercise. You created a counter: a monotonically increasing value.
Once it has been created, its value can only be incremented.

You can create a counter using the counter! macro. You can then call increment on the Counter returned by the macro to increment its value.

Counter exposes another method, absolute, which is a bit special: it's primarily designed to initialize a counter sequence, specifying a non-zero initial value. Beware: recorders will enforce the monotonicity property for counters, so you can't rely on absolute to artificially decrement an existing counter.

When using counter!, the Recorder implementation is expected to:

  • Create a new counter with the name you specified, if one doesn't exist;
  • Retrieve the counter, if one exists.

UPSERT behaviour, for the SQL-minded among us.

Metadata

It can be useful to add metadata to your counters (and metrics in general).
You can rely on the describe_* macros offered by the metrics! crate: they let you add a unit (e.g. capture that the counter value represents bytes or seconds) and a description (which recorders can then choose to expose when metrics are exported/analyzes).

Exercise

The exercise for this section is located in 03_metrics/02_describe

Labels

Not all metric samples are equal. You start tracking the number of requests received by your API. But knowing that number on its own is not enough: your API offers multiple endpoints, where is that traffic actually going?

You could solve the issue by using a separate metric for each endpoint. It would work! But it'd make it more difficult to manipulate those metrics in your tool of choice: it's much easier to look for a standard "request_count_total" than a bespoke "request_subscribe_count_total". You might also want to visualize them all in a single dashboard, using a separate colour for each endpoint: having to add each endpoint manually would be tedious.

The metric world came up with a solution: when you want to capture a certain dimension, but be able to break it down further, you add labels.

Labels are a set of key-value pairs that you specify when interacting with a metric. Under the hood, it behaves exactly as having separate metric series: recorders will track each combination of (metric name, [(label keys, label value)]) as its own unique metric.
But we retain the semantic understanding that they are all "subsets" of an overall measurement.

Cardinality

Be extremely careful with labels!
You should only rely on labels for tracking dimensions that have a well-known and bounded cardinality (the set of possible values)—e.g. you don't want to add "user_id" as a label on your metrics!
Each unique label value creates its own metric series, significantly increasing the amount of memory and work necessary to keep track of those metrics in your application.

The number of metric series you are effectively creating/storing/exporting scales as the product of the cardinality of each label:

# metrics series = { # unique values for label 1 } x .. x { # unique values for label N }

That grows fast. If you don't keep an eye on label cardinality you're likely to blow up whatever tool you are exporting metric data to. If it doesn't fail over, you can expect to be punished at the end of the next billing cycle.

You've been warned!

Exercise

The exercise for this section is located in 03_metrics/03_labels

Prometheus

Honeycomb is my go-to recommendations when working with structured logs. What about metrics?

The most common option, by far, is Prometheus. It's a timeseries database that has seen incredible adoption and it's now deployed in production environments at all levels of scale. It is often used in combination with:

  • Grafana, a dashboard tool to explore and visualize metric data
  • Alertmanager, a system to trigger alerts based on metric queries (e.g. alert me when the % of failed requests exceeds 5% for more than 15 minutes)

Explaining the ins and outs of these systems is outside the scope of this workshop—it'd have to be a whole workshop on its own!

But we can try to set it up once, to get a feel for what they look like.

Push vs Pull

When it comes to metrics, there are two mechanisms for exporting data: push-based or pull-based.

In the push-based model it's the application's job to send metric data to your storage system of choice. This can be done, for example, over the network or by appending the data to a file. The application is also in control of the exporting schedule.

In the pull-based model, instead, the application is passive: it exposes an endpoint for retrieving the current value of all recorded metrics. It is the job of an external system to call (or "scrape") that endpoint to collect the data, usually on a schedule (15s is a common choice for Prometheus).

Prometheus

The recommended Prometheus configuration is pull-based. You can also choose to push metrics though, using their push gateway.

In both cases, you can rely on the metrics-exporter-prometheus crate.
It provides a recorder implementation that can be configured to either listen to incoming HTTP requests or push data on a schedule.

What about OpenTelemetry?

The OpenTelemetry specification covers all kinds of telemetry data, including metrics. Nonetheless, the metric story (and the associated tooling) is relatively recent: you might want to do due some due diligence before relying on it.

Exercise

The exercise for this section is located in 03_metrics/04_prometheus

Gauges

We have seen the end-to-end journey of a metric series. It's time to take a step back and expand our toolset beyond counters.

We'll start with gauges. Counters are expected to be positive and monotonically increasing: that's not the case for gauges. Gauges are designed to represent the current value of some property of interest—the number of concurrent requests being handled by your API, the number of idle connections in your database connection pool or the available disk space on your server. The value can increase or decrease over time, therefore counters are ill-suited for these kinds of measurements.

The value of a gauge is expected to be a float, rather than an integer. Apart from that, everything else we learned about counters applies to gauges as well (description, units, labels).

Exercise

The exercise for this section is located in 03_metrics/05_gauges

Histograms

Counters, gauges... just one more metric type to go: histograms!

Histograms are designed to capture (some) statistical properties of a set of observed values. E.g. you might want to know the 99th percentile of your API latency in order to fire an alert if it exceeds a certain threshold.

The precise representation of histograms (both in-memory and in the exported data) depends on the recorder you will be using. In Prometheus, for example, you have to specify a set of bins and the recorder will count how many values fall into each bin. Since the metrics crate is backend-agnostic, it doesn't expose APIs to configure backend-specific behaviour. You need to check your recorder of choice to perform those customizations (e.g. the PrometheusRecorderBuilder exposes a few methods to customise the set of default bins as well as per-metric bins).

The observed value for a histogram is expected to be a float, rather than an integer. metrics automatically converts Durations into floats, since tracking how long something takes is the bread and butter of histograms!

Everything we learned about counters and gauges applies to histograms as well (description, units, labels).

Exercise

The exercise for this section is located in 03_metrics/06_histogram

Combining everything together

We've covered a lot of ground together: structured logging, failure handling, metrics. I've tried to break each topic down into small bites, empowering you to build up your knowledge incrementally.

It's time to put everything together! Pick your web framework of choice—I recommend either actix-web or axum.

You have to:

  • Configure a tracing subscriber that exports data to both Honeycomb and stdout, in JSON format
  • Configure a suitable panic hook
  • Configure a metric recorder that exposes metric data at /metrics, using a different port than your API endpoints
  • Add one or more middleware to:
    • Create a top-level INFO span for each incoming request
    • Track the number of concurrent requests using a gauge
    • Track request duration using a histogram
    • Track the number of handled requests All metrics should include success/failure as a label.

Bonus points if you:

  • devise a mechanism for your request handlers to get a handle to the root span, in order to populate span fields with context from your domain layer
  • allow changing the set of filtered/included tracing spans at runtime

I don't have a suite of tests for you here, but please call me in when you're done—I want to see what you come up with!

If you're looking for inspiration, you can check out these solutions:

Exercise

The exercise for this section is located in 03_metrics/07_outro