Welcome
Welcome to our Rust telemetry course!
You'll learn how to build observable applications in Rust.
Observability is about being able to ask arbitrary questions about your environment without —and this is the key part—having to know ahead of time what you wanted to ask.
Honeycomb
I'll take you on a journey through the Rust ecosystem, exploring the available telemetry solutions, learning how to combine them together to build a coherent and comprehensive toolkit.
Methodology
This course is based on the "learn by doing" principle.
It has been designed to be interactive and hands-on.
Mainmatter developed this course
to be delivered in a classroom setting, in a single day: each attendee advances
through the lessons at their own pace, with an experienced instructor providing
guidance, answering questions and diving deeper into the topics as needed.
If you're interested in attending one of our training sessions, or if you'd like to
bring this course to your company, please get in touch.
You can also follow the course on your own, but we recommend you find a friend or
a mentor to help you along the way should you get stuck. You can
also find solutions to all exercises in the
solutions
branch of the GitHub repository.
Structure
On the left side of the screen, you can see that the course is divided into sections.
Each section introduces a new concept or feature.
To verify your understanding, each section is paired with an exercise that you need to solve.
You can find the exercises in the
companion GitHub repository.
Before starting the course, make sure to clone the repository to your local machine:
# If you have an SSH key set up with GitHub
git clone git@github.com:mainmatter/rust-telemetry-workshop.git
# Otherwise, use the HTTPS URL:
#
# git clone https://github.com/mainmatter/rust-telemetry-workshop.git
We also recommend you work on a branch, so you can easily track your progress and pull in updates from the main repository, if needed:
cd rust-telemetry-workshop
git checkout -b my-solutions
All exercises are located in the exercises
folder.
Each exercise is structured as a Rust package.
The package contains the exercise itself, instructions on what to do (in src/lib.rs
), and a test suite to
automatically verify your solution.
wr
, the workshop runner
To verify your solutions, we've provided a tool that will guide you through the course.
It is the wr
CLI (short for "workshop runner").
Install it with:
cargo install --locked workshop-runner
In a new terminal, navigate back to the top-level folder of the repository.
Run the wr
command to start the course:
wr
wr
will verify the solution to the current exercise.
Don't move on to the next section until you've solved the exercise for the current one.
We recommend committing your solutions to Git as you progress through the course, so you can easily track your progress and "restart" from a known point if needed.
Enjoy the course!
Author
This course was written by Luca Palmieri, Principal Engineering
Consultant at Mainmatter.
Luca has been working with Rust since 2018, initially at TrueLayer and then at AWS.
Luca is the author of "Zero to Production in Rust",
the go-to resource for learning how to build backend applications in Rust.
He is also the author and maintainer of a variety of open-source Rust projects, including
cargo-chef
,
Pavex and wiremock
.
Exercise
The exercise for this section is located in 00_intro/00_welcome
Structured logging
We will start our observability journey with logs.
Logs are the most common kind of telemetry data.
Even developers who have never heard of observability have an intuitive understanding of
the usefulness of logs: logs are what you look at when stuff goes south to understand what is
happening, crossing your fingers extra hard hoping you captured enough information to
troubleshoot effectively.
What are logs though?
The format varies, depending on the epoch, the platform and the technologies you are using.
The simplest starting point: a bunch of text pushed to stdout
, with a line break to separate
the current record from the next one. For example:
The application is starting on port 8080
Handling a request to /index
Returned a 200 OK
Three perfectly-valid log records for a web server. Enough to understand, from the outside, what is happening.
println!
The standard library contains the most basic logging framework you can think of: println!
.
println!
is a macro that writes a formatted string to the standard output.
#![allow(unused)] fn main() { let name = "Alice"; // Prints: Hello, Alice! println!("Hello, {name}!"); }
Exercise
The exercise for this section is located in 01_structured_logging/00_intro
The log
crate
Coupling
println!
and its sibling eprintln!
can get you started, but they won't get you very far.
Our ultimate goal is to monitor our applications in production environments. Each production
environment is its own little world, with its own constraints and requirements. In particular,
expectations around log data vary wildly: some environments expect logs to be written to
stdout
, others to stderr
, others to a file, others to a socket, and so on.
println!
breaks down quickly in the face of such requirements: it couples together the
information that we want to log (i.e. the message) with the way we want to log it (i.e. the
destination, stdout
).
To change the destination, we would have to change the code that produces the log message.
That's a problem: that log message might be coming from a third-party library,
or from a part of the codebase that we don't own.
Having to fork those modules to accommodate the logging requirements of the final application
is a non-starter.
Facade
We need to decouple.
On one side, we will have an instrumentation API, used to emit log messages from the
application and its dependencies.
On the other side, we will have the processing code, the logic in charge of deciding
what to do with the log messages that have been produced.
If you are familiar with the Gang of Four's Design Patterns, you might have recognized this as an instance of the facade pattern.
The log
crate
The facade pattern only works if the ecosystem, as a whole, standardizes around the same facade. You can then use a single pipeline to collect and process instrumentation data coming from both first party code and third party dependencies.
In the early days of the Rust ecosystem, the log
crate arose as the de-facto standard
logging facade for Rust applications.
Let's have a look at it!
Exercise
The exercise for this section is located in 01_structured_logging/01_log_crate
Log levels
The cost of logging
Logs are useful, but they come at a cost.
Formatting, serializing, and writing log messages to a file or a socket takes time. It is
not uncommon to hear about applications that spend more time writing logs than doing actual
work.
As a rule of thumb, we want each log record emitted by our application to be useful.
How do you define "useful"?
Easier said than done.
The usefulness of a log record can be situational: redundant under normal circumstances but vital when trying to troubleshoot a particularly vicious bug.
Usefulness is also subjective: depending on your intent, you might or might not care about a particular log record. This is a challenge for library authors: their software will be used in a variety of contexts, and it's hard to predict what information will be useful to the final application.
How do we reconcile these needs?
Log filtering
The common solution is to use log filters.
A log filter is a predicate that determines whether a log record should be emitted or not.
The filters are defined by the final binary, the application, therefore they are tuned based
on the application's needs.
Library authors can instrument their code ~freely, knowing that the application owner will be
in charge of deciding what they want or don't want to see.
Filters rely primarily on two pieces of information:
- the log level of the record, a measure of its importance
- the module that emitted the record, also known as the target.
Filters are evaluated at runtime for each log record, therefore they aren't entirely zero-cost, but they let us skip the most expensive parts of the logging process (formatting, serializing, emitting), therefore amortizing the cost of "unnecessary" log records.
Compile-time filtering
There is an exception though: you can filter out log records based on their level at compile
time, using some of the cargo
features exposed by the log
crate.
The filtered log statements disappear entirely: they don't exist in the compiled binary,
therefore they don't incur any runtime cost.
It is a great option to remove the noisiest log statements (e.g. TRACE
-level), although
it's not a silver bullet: you can't use it to filter out log records based on their source.
Exercise
The exercise for this section is located in 01_structured_logging/02_log_levels
What to log
We have talked extensively about the mechanics of logging (how to log, how to implement a logger, how to filter), but we haven't really talked about what to log.
Why do we log?
To determine what to log, we need to stop for a second and think about the purpose of logging. We use logging as a way to determine, from the outside, what is going on inside our applications. In particular, as a way to diagnose if our applications are not behaving as expected.
The "easy" solution would be to log everything, a full-fidelity representation of the internal state of our applications.\ Unfortunately, that's usually not feasible: the cost of producing and storing that information would be prohibitive.
Logs, therefore, are necessarily a lossy representation. We need to choose carefully what to log in order to maximize our chances of spotting* problems and being able to troubleshoot them.
Unit of work
A good rule of thumb is to view your application logic as a series of units of work.
Each unit of work has a start and an end, and may in turn contain other sub-units of work.
For each of those unit of work, we'll surely want to know:
- how long it took to complete, i.e. its duration
- whether it completed successfully or not, i.e. its outcome
We'll get both of these pieces of information if we emit two log records for each unit of work: one at the start and one at the end.
How fine-grained should the units of work be?
If you take this approach to the extreme, you could model each function call as a unit of work. While that may be useful in some scenarios, it's not a good default.
You should consider a unit of work as a meaningful piece of work, one that may occupy a significant amount of time with respect to the duration of the over-arching operation (e.g. processing an incoming request).
Let's make an example in the context of a web server.
Parsing a header value is probably not a good candidate for a unit of work—it's very fast and
it's unlikely to vary much in duration.
On the contrary, parsing the body of a request is a good candidate: it's likely to take a
significant amount of time, and it's likely to vary in duration depending on the size of the
body itself and the way it is being sent to the server.
Always take the over-arching context into account when determining what should or should not be treated as a unit of work.
Exercise
The exercise for this section is located in 01_structured_logging/03_what_to_log
The tracing
crate
log
is not the right abstraction
In the previous exercise, we laid out the key abstraction we need to model our applications:
units of work.
We've also tried to leverage the log
crate to properly track them, but you may have noticed
that it's not a perfect fit.
The log
crate is structured around the concept of log events.
There is no duration. There is also no relationship between one event and the other,
while we have seen how our units of work are naturally organised in a hierarchical fashion—
a unit of work may in turn contain multiple sub-units of work, and so on.
We need to change our instrumentation library to one that is more suited to our needs: the
tracing
crate.
tracing::Span
The tracing
crate models units of work as spans.
A span is a unit of work that has a start and an end.
A span can also have a parent, which naturally introduces that hierarchical relationship we
were looking for.
The richer data model translates into a more complex interface, both on the instrumentation
side (i.e. the code that emits spans) and on the consumer side (i.e. the code that processes
spans).
This is one of the reasons I chose to talk about log
first: it's a gentler introduction
to the overall facade pattern and by probing its limitations you (hopefully) have a better
understanding of the rationale behind tracing
's additional complexity.
Exercise
The exercise for this section is located in 01_structured_logging/04_tracing
Enriching spans with structured fields
The test assertions in the previous exercise have highlighted some of the tracing
advantages
we talked about—e.g. the hierarchical relationship between spans which our subscriber translated
into a prefix-based structure for the log output.
We haven't fully replicated our log
version though:
- we're not capturing the outcome of each unit of work
- we're not capturing the duration of each unit of work
2 is easily achieved by tweaking the configuration for our subscriber (.with_timer(Uptime)
),
so let's focus on 1., the outcome.
To pull it off in an idiomatic way, we need to learn about span fields.
Span fields
tracing
allows us to attach key-value pairs to spans, which we refer to as span fields.
The syntax looks like this:
#![allow(unused)] fn main() { tracing::info_span!("my special task", foo = 42, bar = "hello"); }
You can also change the value of a field after the span has been created using .record
:
#![allow(unused)] fn main() { let span = tracing::info_span!("my special task", foo = 42); span.record("foo", 43); }
The way fields are handled depends on the subscriber we're using, just like the span data itself.
Fields must be defined upfront
It's important to point out one key limitation of span fields: they must be known when the span is created. In other words:
#![allow(unused)] fn main() { // Don't do this in prod. let span = tracing::info_span!("my special task"); span.record("foo", 43); }
won't work because the field foo
is not defined when the span is created. No error will be
raised, but the field will be ignored at runtime—that record
call will have no effect
whatsoever.
You may be wondering: what if I don't know the value of a field upfront?
Good question! You can use the tracing::field::Empty
as value for it when defining the span:
#![allow(unused)] fn main() { // This works! let span = tracing::info_span!("my special task", foo = tracing::field::Empty); span.record("foo", 43); }
This limitation stems from tracing
's commitment to low overhead: knowing the fields upfront
allows tracing
to avoid a few heap allocations in the hot path of your application.
Exercise
The exercise for this section is located in 01_structured_logging/05_fields
Ergonomics
Before I stated that we shouldn't treat each function as its own unit of work—it's way
too verbose and it leads to a lot of noise in the logs.
Nonetheless, it is often the case that a unit of work maps to a certain function. The work
starts when that function is invoked and ends when it returns. That was indeed the case in the
previous exercise.
When it happens, we can use the instrument
macro
to avoid having to manually create a span inside the function body.
Exercise
The exercise for this section is located in 01_structured_logging/06_instrument_macro
Structured output
In all our exercises (and tests), we've been looking at the emitted diagnostics for a single invocation of an instrumented function. In the real world, however, we can expect to have hundreds if not thousands of invocations per second, each of which will emit its own diagnostics and might result in a different outcome.
In order to make sense of all this data, we need to be able to analyze it.
We might want to know, for example, how many failures we have had in the last hour. Queries can also get more complex: how many failures have we had in the last hour, broken down by error type and by user?
So far we've always been looking at a single line of text which interpolates the target of the span, the values of our fields and other metadata (e.g. timing) into a human-readable string. That kind of representation is not very amenable to machine processing.
In order to answer those questions, we need to be able to structure our output.
The simplest data format we can use is JSON.
Exercise
The exercise for this section is located in 01_structured_logging/07_structured_output
OpenTelemetry
While looking at JSON data in a terminal might be a good start, it's not very practical.
Analyzing telemetry data using jq
or scrolling through a terminal window is fairly tedious
and error-prone.
In most real-world scenarios, we'll want to export our telemetry data to a centralized
system that provides advanced analysis capabilities.
In the old dark days, you would have had to pick a vendor and use their SDK to export your data
to their systems in whatever bespoke format they devised.
Thankfully, the industry has moved on and we now have a vendor-neutral format for exporting
telemetry data: OpenTelemetry.
It has gained a significant foothold and it's now supported by most observability vendors as
a valid (if not preferred!) format for data ingestion.
Exercise
The exercise for this section is located in 01_structured_logging/08_opentelemetry
tracing
subscribers
So far we've focused on the instrumentation side of tracing
—spans, fields, etc.
But we haven't talked about how to consume the data we're producing.
Subscriber
To see tracing
in action we've already had to play with different processors:
interpolated text to the console, JSON, and OpenTelemetry.
They are all examples of tracing
subscribers—i.e. implementors of the
Subscriber
trait.
A subscriber is much more complex than what we have previously seen in this workshop,
the Record
trait from the log
crate.
tracing
offers a richer data model, therefore pushing more responsibilities to the subscriber.
For example: each span has to be assigned a unique ID, which is used to link spans together in a
hierarchical structure.
Complexity is further compounded by the fact that tracing
is designed to be extremely
low-overhead, to maximise its applicability in production environments.
Registry
It's rare to find yourself implementing the Subscriber
trait from scratch.
More often than not, you'll rely on Registry
as your foundation.
It takes care of all the complicated bits (e.g. span ID generation and management) and
exposes a simple(r) interface for you to implement: the
Layer
trait.
Layer
Even better: you don't even have to implement Layer
yourself (unless you want to).
You can combine multiple layers together using the with
method exposed by
the SubscriberExt
trait.
You just stack multiple layers from the ecosystem on top of each other,
downstream of a Registry
, and you're good to go.
Exercise
The exercise for this section is located in 01_structured_logging/09_subscriber
The lifecycle of a span
All the instrumentation work we have done so far has happened in a
synchronous context.
We have only used async Rust on the processing side, to use non-blocking IO to ship telemetry
data to a sink over the network (i.e. Honeycomb's API).
But what if we wanted to instrument asynchronous code?
It's trickier!
There are some subtleties to be aware of, at the intersection between the lifecycle of a
tracing
span, work-stealing async runtimes and tracing
's internal state management.
It's a lot to cover in a single exercise, so we'll break it down into smaller chunks.
Let's start with the lifecycle of a span.
A span was born—what then?
You've seen in the first tracing
exercises that there are two distinct phases in the lifecycle
of a span:
- creation, i.e. what happens when
tracing::<level>_span!
is called; - entering, i.e. what happens when you can
.enter()
on a span, getting back a RAII guard.
A span can only be created once, but it can be entered multiple times.
#![allow(unused)] fn main() { use tracing::info_span; let span = info_span!("my span"); { let _guard = span.enter(); // Do something } { // We re-enter the same span! let _guard2 = span.enter(); // Do something else } }
Why would you do that, you may be wondering?
Well, your span may be tracking a piece of work that can be paused (async, remember?).
Creation and entering have two matching phases: closing and exiting.
A span is exited when the guard object returned by .enter()
is dropped.
A span is closed, instead, when there are no more handles to it. It's the subscriber's
responsibility to do this bookkeeping.
When a span is entered into a single time, the closing and exiting phases usually happen one after the other. This is what happened so far in our exercises.
Exercise
The exercise for this section is located in 01_structured_logging/10_span_lifecycle
Thread local state
How does tracing
know which span is currently active?
That's necessary to implement a few of the features we've seen so far:
Span::current()
, which returns a handle over that span that is currently active- Attaching a new span as a child of the currently active span
The answer is thread local state.
thread_local!
Rust's standard library exposes a thread_local!
macro that allows you to define a variable
that is local to the current thread.
It's a bit like static
, but it's not shared across threads.
It comes with a caveat: if you move to a different thread, you won't be able to access the value you set on the previous thread.
Spawning threads breaks the hierarchy
tracing
uses thread local state to keep track of the currently active span.
This has an interesting implication: if you spawn a thread to do some work, the spans
created in that thread will not be linked to the spans created in the parent thread.
#![allow(unused)] fn main() { use tracing::{info_span, info}; let spawner_span = info_span!("spawner"); let _guard = spawner_span.enter(); std::thread::spawn(|| { // This is NOT a child of `spawner_span`! let spawned_span = info_span!("spawned"); // [...] }); }
This is something to be aware of when you're using tracing
in a multithreaded environment.
You have three options:
- Leave the spans unlinked. This is OK if the two unit of works are actually unrelated.
- Explicitly mark the
spawned
span as a child of thespawner
span. This is desirable if thespawner
span won't be closed until thespawned
span is closed (e.g. if you are waiting for the new thread to finish). - Explicitly mark the
spawned
span as a "follower" of thespawner
span. This maintains a connection between the two spans, but it doesn't require thespawner
span to be kept open. This works pretty well when spawning background tasks that might complete after thespawner
unit of work is done.
Exercise
The exercise for this section is located in 01_structured_logging/11_thread_local_state
Instrumenting asynchronous code
We're finally ready to tackle the big boss: instrumenting asynchronous code.
Let's start by brushing up on Rust's asynchronous programming model. If you're familiar with the details, feel free to skim through!
Future
s
The cornerstone of Rust's asynchronous programming model is the Future
trait.
pub trait Future {
type Output;
// Required method
fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output>;
}
We refer to an implementor of the Future
trait as a future.
A future represents a computation that will eventually produce a value (of type Output
).
Futures are designed to be lazy—if you want them to make progress on whatever job they're
doing, you need to repeatedly call their poll
method until they're done, i.e. Poll::Ready
is returned.
Async runtimes
The component in charge of polling a future is known as an asynchronous runtime. Rust's standard library does not include a "default" asynchronous runtime. You need to bring your own, usually as a third-party dependency.
When it comes to tracing
, there are two key things to keep in mind when it comes to
asynchronous runtimes:
- When a certain future can't make any more progress (i.e. it returns
Poll::Pending
) the runtime will usually try to run (on that very thread!) another future that has been queued up. This is what allows concurrency to be highly efficient! - Many popular runtimes (e.g.
tokio
orasync-std
) are built around the concept of multi-threaded work-stealing. The next time a future is polled, it may be on a different thread. This model comes with some advantages: you don't need to worry too much about balancing your work across threads, the runtime will transparently take care of it for you.
It also introduces some challenges: you can't rely on thread-local state to keep track of values that the future cares about, because the next time it's polled it may be on a different thread where that state is not available (or it's set to a different value).
Instrumented
As much as possible, we want our instrumentation code to behave correctly no matter what runtime
our code is being executed on.
Let's assume that a future is a unit of work we want to track. We want to know how much time
it spent doing work (i.e. inside poll
), as well as how long it takes in terms of wall
clock time.
Based on what we discussed so far, it follows that:
- Any
tracing
-specific state that is associated to our future (i.e. its span handle) must be stored in the future itself. - Whenever our future is polled, we need to re-enter its span.
- Whenever our future returns from
poll
, we need to exit its span.
This is the exact strategy implemented in the tracing
crate via the Instrumented
future:
// I have spared you the Pin-related stuff here, for simplicity.
// You can see the full implementation with all the details in `tracing`'s source code.
pub struct Instrumented<T> {
inner: T,
span: Span,
}
impl<T: Future> Future for Instrumented<T> {
type Output = T::Output;
fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
let _enter = self.span.enter();
self.inner.poll(cx)
}
}
It wraps around the future you want to instrument and bundles it together with its tracking span.
Every time poll
is called, we enter the span and then exit it when it returns.
One of those cases where the actual implementation is very simple compared to the rationale
that justifies it!
Exercise
The exercise for this section is located in 01_structured_logging/12_futures
Instrumenting asynchronous code - Part 2
Instrumented
is neat, but ergonomics are a bit poor: in order to attach a span to an
invocation of an async function we need to write a bunch of wrapping glue code.
Wrapping glue code... we've complained about it before, haven't we? Yes! When attaching a span to a synchronous function invocation.
#[tracing::instrument]
is here to rescue us, again!
The macro is smart enough to detect that the function it is applied to is asynchronous:
it will automatically generate the required glue code and attach the span using .instrument
.
Exercise
The exercise for this section is located in 01_structured_logging/13_futures_2
Interoperability
You have seen first-hand how much tracing
brings to the table.
Nonetheless, migrating an entire ecosystem takes time: you can still find crates that rely
on log
for instrumentation.
We need a way to bridge the gap between the two worlds: you should be able to use a single pipeline to process telemetry coming from both sources.
tracing-log
The tracing-log
crate can act as such a bridge: it provides you with a log
-compatible
processor which redirects all log
events to tracing
.
Be mindful: if you're using tracing_subscriber::fmt
as your tracing
subscriber, it'll
automatically install this bridge for you unless you explicitly disable the tracing-log
feature.
Exercise
The exercise for this section is located in 01_structured_logging/14_interop
Error handling
When monitoring a system, we are primarily interested in failures. We want to know when something goes wrong, and we want to know why.
This section of the course focuses on failures—in particular, how to capture their occurrence in our telemetry data.
Enjoy!
Exercise
The exercise for this section is located in 02_failures/00_intro
The Error
trait
In Rust, errors are expected to implement the Error
trait (there are some exceptions,
but we'll ignore them for now).
The Error
trait is defined in the standard library, and looks like this:
#![allow(unused)] fn main() { use std::fmt::{Debug, Display}; pub trait Error: Debug + Display { fn source(&self) -> Option<&(dyn Error + 'static)>; } }
Let's unpack the definition:
- All errors must implement the
Debug
trait. This representation is primarily intended for operators. It is likely to expose internal details of the error and the system it was emitted from. In most cases, you can just use#[derive(Debug)]
for your error type. - All errors must implement the
Display
trait. This representation is primarily designed for users. It should be understandable by a person that is not familiar (nor has access) to the internals of the system. - An errors may travel through multiple "layers" in your application. E.g. a failure to execute
a query might arise from a network error, which in turn might be caused by a DNS resolution
failure.
Each additional semantic layer is often represented as a wrapper over the original error.
The
source
method allows you to walk the chain of errors and inspect each layer.
What does this mean for us?
From a telemetry perspective, we should be careful and try to capture as much context
as possible!
The Display
representation is likely to omit details which are going to be necessary
to troubleshoot.
The Debug
representation will be more verbose, but it might still miss
some crucial details which are only available in the source
chain—e.g. it might say
"I failed to run SELECT * FROM TABLE
" but it won't tell you "Timed out while trying to
connect to XYZ.com".
My suggestion is to be slightly wasteful and capture all three representations. They're not going to be 100% orthogonal, but you'll be maximising your chances at capturing all the information you need.
Exercise
The exercise for this section is located in 02_failures/01_error_trait
Error report types
In the previous exercise I mentioned that not all error types actually implement the Error
trait.
The primary exceptions to this rule are error report types.
An error report type is a type which is used to report about an error, with the potential addition of some extra context. They're often used as opaque error types, i,.e. a way to hide the actual type of the underlying error from the user when it is not relevant or it would be a breach of semantics to expose it.
Their common characteristic is that they want to implement a conversion from an arbitrary
error type (E: Error
) into themselves.
This, in turn, means that they can't implement the Error
trait themselves, as that would
result in a conflicting From
implementation (see https://github.com/dtolnay/anyhow/issues/25#issuecomment-544140480).
Examples in the ecosystem include:
Box<dyn Error>
, the OG opaque error typeanyhow::Error
, a more ergonomic version ofBox<dyn Error>
eyre::Report
, a fancier version ofanyhow::Error
and many more.
It is fairly common to upcast all errors to one of these types at the boundary of your
application, and then use that opaque error type for all your error handling (and reporting)
needs.
When that's the case, you need to carefully study the documentation of the error report
type you're using: does it expose a source
method, even if it doesn't implement the Error
trait? Does its Debug
implementation already expose the entire source chain (e.g. anyhow
)?
By adapting your strategy to the specific type you're working with you can avoid emitting duplicate information, or missing out on important details.
Exercise
The exercise for this section is located in 02_failures/02_error_reporters
Panic handling
There are two types of failures in Rust:
- recoverable errors, using the
Result
type and theError
trait - panics, (almost) unrecoverable failures that unwind the stack
It's time to talk about the latter!
Panics
You should trigger a panic when:
- you encounter a situation that you can't recover from
- it would be unsafe to continue
- it's not worth the effort to perform any kind of more sophisticated error handling.
A few examples:
- you're missing a resource that you absolutely need to do your job
- an internal invariant of your application has been violated
- you're in testing code and you don't want to continue if a specific condition is not met
Panics and telemetry
Regardless of the source, you should build a telemetry pipeline that knows how to handle panics.
Whenever a panic is triggered, Rust will start unwinding the stack and will eventually
invoke the panic
hook for your application.
The default hook will print the panic message to stderr
and then abort the process. This
may be good enough if you're running your application in a terminal, but it's not going to be
very useful if you're running in a production environment: you want the panic information
to be captured by the same telemetry pipeline that you're using for everything else.
We need a custom panic hook!
Exercise
The exercise for this section is located in 02_failures/03_panics
Metrics
Early detection is a key capability if you're operating a highly available service: you need to quickly spot if something is wrong in order for your first-response team to engage.
How does it work, in practice?
SLAs, SLOs and SLIs
It's a common industry practice to define a set of service level indicators (SLIs) for your service. They capture a set of measurable dimensions that highly correlate with your users having a good experience. For a payment platform, for example, you might monitor the success rate of payments, the settlement time (i.e. how long it takes for the counterpart to receive the funds), notification latency (i.e. how long it takes for the user to be notified of the outcome), etc.
Once you've identified a good set of SLIs, you need to set targets. If those targets are part of a contract with your users, they are usually called service level agreement (SLAs). If they are instead "self-imposed", they're called service level objectives (SLOs). It's common to have both in a mature business, with each SLO being stricter than its SLA counterpart.
When those targets are breached, an alert is triggered and one or more operators have to engage, investigate and/or try to bring the system back to a working state.
Measuring
Let's stress it again: your SLIs need to be measurable. But how does that actually work?
There are two common strategies:
- You "metricise" your structured logs. E.g. you compute your API error rate by counting the number
of log records where
span.kind
is set toserver
and status code is not in the 2xx/3xx range. - You pre-compute the relevant metrics in your application and then export the result to a dedicated system. E.g. every time you finish handling a request, you bump a counter based on the returned status code.
The first strategy is more flexible (e.g. you can define new alerts without having to touch the application, assuming your structured logs are rich enough), but it requires you to ship more data to your telemetry system where it will be processed further to output the number you're looking for. Depending on your scale, this may be either too slow, too wasteful or too expensive.
The opposite applies to pre-aggregated metrics: they are cheaper to ship and store, but they are rather inflexible.
My recommendation is to follow a hybrid strategy:
- Collect structured logs (with a smart sampling strategy) to ask arbitrary questions about your systems and dive deep.
- Pre-compute all SLIs and other relevant quantities that you know you need to monitor.
It's also worth pointing out that some quantities do not neatly map to logs and are instead well-suited to be collected as metrics with a time-based frequency. E.g. the balance of your company's operating account.
Next
This part of the workshop will focus on the metrics
crate, our tool of choice for
collecting and exporting pre-aggregated metrics in our Rust applications.
Get ready!
Exercise
The exercise for this section is located in 03_metrics/00_intro
The metrics
crate
How does the metrics
crate work?
log
used the facade pattern, tracing
used the facade pattern... You guessed it, metrics
uses the facade pattern as well!
It exposes a set of macros to create and manipulate metrics.
The metric data is then forwarded to an implementor of the metrics::Recorder
trait,
which takes care of processing it.
It should all feel pretty familiar by now!
Exercise
The exercise for this section is located in 03_metrics/01_facade
Describe
Let's break down what happened in the last exercise.
You created a counter: a monotonically increasing value.
Once it has been created, its value can only be incremented.
You can create a counter using the counter!
macro.
You can then call increment
on the Counter
returned by the macro to increment its value.
Counter
exposes another method, absolute
, which is a bit special: it's primarily designed to initialize
a counter sequence, specifying a non-zero initial value.
Beware: recorders will enforce the monotonicity property for counters, so you can't rely on absolute
to artificially decrement an existing counter.
When using counter!
, the Recorder
implementation is expected to:
- Create a new counter with the name you specified, if one doesn't exist;
- Retrieve the counter, if one exists.
UPSERT
behaviour, for the SQL-minded among us.
Metadata
It can be useful to add metadata to your counters (and metrics in general).
You can rely on the describe_*
macros offered by the metrics!
crate: they let you add
a unit (e.g. capture that the counter value represents bytes or seconds) and a description
(which recorders can then choose to expose when metrics are exported/analyzes).
Exercise
The exercise for this section is located in 03_metrics/02_describe
Labels
Not all metric samples are equal. You start tracking the number of requests received by your API. But knowing that number on its own is not enough: your API offers multiple endpoints, where is that traffic actually going?
You could solve the issue by using a separate metric for each endpoint. It would work! But it'd make it more difficult to manipulate those metrics in your tool of choice: it's much easier to look for a standard "request_count_total" than a bespoke "request_subscribe_count_total". You might also want to visualize them all in a single dashboard, using a separate colour for each endpoint: having to add each endpoint manually would be tedious.
The metric world came up with a solution: when you want to capture a certain dimension, but be able to break it down further, you add labels.
Labels are a set of key-value pairs that you specify when interacting with a metric.
Under the hood, it behaves exactly as having separate metric series: recorders will track each
combination of (metric name, [(label keys, label value)]) as its own unique metric.
But we retain the semantic understanding that they are all "subsets" of an overall measurement.
Cardinality
Be extremely careful with labels!
You should only rely on labels for tracking dimensions that have a well-known and bounded
cardinality (the set of possible values)—e.g. you don't want to add "user_id" as a label on
your metrics!
Each unique label value creates its own metric series, significantly increasing the amount of
memory and work necessary to keep track of those metrics in your application.
The number of metric series you are effectively creating/storing/exporting scales as the product of the cardinality of each label:
# metrics series = { # unique values for label 1 } x .. x { # unique values for label N }
That grows fast. If you don't keep an eye on label cardinality you're likely to blow up whatever tool you are exporting metric data to. If it doesn't fail over, you can expect to be punished at the end of the next billing cycle.
You've been warned!
Exercise
The exercise for this section is located in 03_metrics/03_labels
Prometheus
Honeycomb is my go-to recommendations when working with structured logs. What about metrics?
The most common option, by far, is Prometheus. It's a timeseries database that has seen incredible adoption and it's now deployed in production environments at all levels of scale. It is often used in combination with:
- Grafana, a dashboard tool to explore and visualize metric data
- Alertmanager, a system to trigger alerts based on metric queries (e.g. alert me when the % of failed requests exceeds 5% for more than 15 minutes)
Explaining the ins and outs of these systems is outside the scope of this workshop—it'd have to be a whole workshop on its own!
But we can try to set it up once, to get a feel for what they look like.
Push vs Pull
When it comes to metrics, there are two mechanisms for exporting data: push-based or pull-based.
In the push-based model it's the application's job to send metric data to your storage system of choice. This can be done, for example, over the network or by appending the data to a file. The application is also in control of the exporting schedule.
In the pull-based model, instead, the application is passive: it exposes an endpoint for retrieving the current value of all recorded metrics. It is the job of an external system to call (or "scrape") that endpoint to collect the data, usually on a schedule (15s is a common choice for Prometheus).
Prometheus
The recommended Prometheus configuration is pull-based. You can also choose to push metrics though, using their push gateway.
In both cases, you can rely on the metrics-exporter-prometheus
crate.
It provides a recorder implementation that can be configured to either listen to incoming
HTTP requests or push data on a schedule.
What about OpenTelemetry?
The OpenTelemetry specification covers all kinds of telemetry data, including metrics. Nonetheless, the metric story (and the associated tooling) is relatively recent: you might want to do due some due diligence before relying on it.
Exercise
The exercise for this section is located in 03_metrics/04_prometheus
Gauges
We have seen the end-to-end journey of a metric series. It's time to take a step back and expand our toolset beyond counters.
We'll start with gauges. Counters are expected to be positive and monotonically increasing: that's not the case for gauges. Gauges are designed to represent the current value of some property of interest—the number of concurrent requests being handled by your API, the number of idle connections in your database connection pool or the available disk space on your server. The value can increase or decrease over time, therefore counters are ill-suited for these kinds of measurements.
The value of a gauge is expected to be a float, rather than an integer. Apart from that, everything else we learned about counters applies to gauges as well (description, units, labels).
Exercise
The exercise for this section is located in 03_metrics/05_gauges
Histograms
Counters, gauges... just one more metric type to go: histograms!
Histograms are designed to capture (some) statistical properties of a set of observed values. E.g. you might want to know the 99th percentile of your API latency in order to fire an alert if it exceeds a certain threshold.
The precise representation of histograms (both in-memory and in the exported data) depends
on the recorder you will be using.
In Prometheus, for example, you have to specify a set of bins and the recorder will count
how many values fall into each bin.
Since the metrics
crate is backend-agnostic, it doesn't expose APIs to configure
backend-specific behaviour. You need to check your recorder of choice to perform those
customizations (e.g. the PrometheusRecorderBuilder
exposes a few methods to customise
the set of default bins as well as per-metric bins).
The observed value for a histogram is expected to be a float, rather than an integer.
metrics
automatically converts Duration
s into floats, since tracking how long something
takes is the bread and butter of histograms!
Everything we learned about counters and gauges applies to histograms as well (description, units, labels).
Exercise
The exercise for this section is located in 03_metrics/06_histogram
Combining everything together
We've covered a lot of ground together: structured logging, failure handling, metrics. I've tried to break each topic down into small bites, empowering you to build up your knowledge incrementally.
It's time to put everything together!
Pick your web framework of choice—I recommend either actix-web
or axum
.
You have to:
- Configure a
tracing
subscriber that exports data to both Honeycomb and stdout, in JSON format - Configure a suitable panic hook
- Configure a
metric
recorder that exposes metric data at/metrics
, using a different port than your API endpoints - Add one or more middleware to:
- Create a top-level INFO span for each incoming request
- Track the number of concurrent requests using a gauge
- Track request duration using a histogram
- Track the number of handled requests All metrics should include success/failure as a label.
Bonus points if you:
- devise a mechanism for your request handlers to get a handle to the root span, in order to populate span fields with context from your domain layer
- allow changing the set of filtered/included
tracing
spans at runtime
I don't have a suite of tests for you here, but please call me in when you're done—I want to see what you come up with!
If you're looking for inspiration, you can check out these solutions:
Exercise
The exercise for this section is located in 03_metrics/07_outro