Microservices in Rust with actix

Marco Amann
Digital Frontiers — Das Blog
18 min readJun 30, 2020

--

Photo by Atik sulianami on Unsplash

Should you write your next microservices in Rust? In this post you will learn how you can develop such an application with Rust and you will be able to answer the question, if you should.

This is the first part of a series of articles on how you can create, deploy, monitor and scale a microservice-based application. You can find the sourcecode at GitHub. Stay tuned for the next parts to be published, they will be linked here.

Why even consider Rust?

To be honest, I think it would be faster to create the application we will develop using Java instead of Rust, at least if we used some of the goodies from the Spring ecosystem. I say this despite Rust being my favorite language, so why even bother trying to create it in Rust? Microservice applications bring, despite their intriguing benefits, a whole range of challenges, some of which I dare to say are a perfect fit to tackle with Rust.

Due to the distributed nature of microservices, developers need to take care of latency, network partitioning, unavailable services and the fact that your local data store might not have all of the required information available. This calls for highly concurrent applications that can cope with possibly failing data transfers, without blocking up the whole application.

If you roll out your microservice app, you don’t just need to deploy one monolith but rather tens of individual applications, possibly written by different teams. The self-contained nature of Rust binaries might ease this process.

But don’t take my word for it, follow along the rest of the post to see if Rust might be worth considering for you.

What we will build

Since I evaluate the feasibility of efficiently building microservices with rust, the following scenario was designed to cover many different aspects typical to such applications.

In this first part of the series we build a service that provides a REST API, consumes another REST API and keeps a bit of state in a database.

This scenario shall accompany us the next few blog posts:
We build software for an imaginary, quickly growing company, that allows users to manage collections of images. The service will download and store the images for them, in addition to provide them with means for printing and sending the prints per mail.

Let’s have a closer look at what will be done with the application:

Our use case
  • The user passes an URL of an image to the service
  • The image is downloaded and saved alongside a thumbnail in an S3 bucket (this is the focus of the next post)
  • When the download is done, the user can preview the thumbnail and download the image
  • The user can print a bunch of images and will receive them in their mail

Dividing the system into Services

To be able to build the system in multiple iterations, I chose the following setup:

A print service for accepting print jobs and sending them to the printer. Let’s just assume the printer itself understands HTTP requests, nobody wants to dive into the horrors of real printers.
Some management service for the user to CRUD the images. We assume that one is already existing, developed with Spring Boot. We need to integrate with it.
The download service that will be discussed in the next post.
The S3 bucket the user will directly access. We neglect security concerns for now.

In this post we will focus on the print service.

We need a bit more connections though: The print service needs to have the image that it needs to print after all. For a supplied user and image id, we request the file location at the management service. This means we need to communicate between services, on behalf of the user.

We want to reply the user print job requests with an OK, only if the user is allowed to print the file (ask the management service) and the job has been persisted. This means we need to ‘hold’ the user-request until the management service has confirmed the user is allowed to do as they please.

The print service

Since we want to expose the print-http-endpoint, it is advisable to use a web-framework: In the last post I used rocket, so this time, actix-rs will be used. Actix does not try to hide the asynchronous nature of its foundation as much as rocket does and hence feels a bit more powerful to me. We will see some of the good and bad sides of it over the course of this article.

Service interfaces

The scope of our service is pretty small, I will nonetheless only discuss the interesting ones, these are:

GET  /print/jobs/{:id}   For the user to see the job status
POST /print/jobs For the user to create a new print job

The GET request queries a single job by its id from the database. The POST request asks the management service for a bit of information, stores the job in the database and triggers the actual printing process.

Consumed interfaces

A key aspect of microservices is, that they cooperate with other services. To cover this functionality in this evaluation, I designed the print service to consume three interfaces: the API of the printer itself, that we have no control over, the management API, written by a different team in a different language and the S3 API to asynchronously download the image.

The following figure visualizes the interactions:

No swag for us

It would be cool to have the boilerplate code directly generated from its specification, for example by using swagger. Unfortunately, the swagger-codegen for rust does only support the rather low-level HTTP library ‘hyper’ instead of the more abstract frameworks. ‘Paperclip’ aimed to build a actix-web code generator for OpenAPI specs but there seems to be no active development. This leaves us with writing the actix-code by ourselves.

It is nonetheless important to ensure that the understanding of the syntax and semantics of our messages is consistent between the communicating services. In one of the next posts, this will be done by employing contract testing in one of the following posts.

Callbacks from the printer

To handle callbacks from the printer, once it finished printing, correlation-IDs are used. They are stored in a database to keep track of active jobs across requests. This requires our rust code to interact with the database. I chose to use tokio_postgres, a sub-project of rust-postgres, allowing for clean integration with asynchronous programming using tokio (what we will be using due to actix anyways). But since we want to keep the load on the db caused by connections to a minimum, a connection-pool would be a sensible choice. The most prominent project, 2rd2, unfortunately does not already support the tokio variants of the rust-postgres crate. Therefore I opted for bb8, describing itself as “A generic connection pool, designed for asynchronous tokio-based connections”. I really like their choice of crate-names.

Why use asynchronous programming at all here? We can assume that the database is not located on the same server that our app will run on, so we need to expect a few milliseconds of latency. Since we want to use the async features for downloading images anyways, it is a quite natural choice for the database as well. In the following section I will give a really brief overview of the needed features.

Async-Await

When uploading 500 images to several printers in parallel (we assume the print hardware is slow), we do not want to have 500 threads running. To avoid this, Rust provides a pretty new and exciting feature: async await. Since this feature is huge, the rust-guys are writing a whole book about it, similar to the rust-nomicon. Despite being a bit tricky, this feature makes rust such a great fit for parallel execution, so let me quickly introduce the concept.

One can mark functions and blocks as async. This means, that part of the code can pause its execution and resume if there is any progress to be made. I won’t attempt to explain rust-futures here, read the primer in their announcement blogpost if you want a quick overview. In a function marked with async, you can hand-off processing (for that code it fells like ‘pause’ processing) by calling await .

Let me show you an example:

let body = reqwest::get(url).await;
println!("Got {}",body);

This instructs the compiler to build code that does the following: Start the ‘get’ request and go do something else until it finished downloading, then commence execution here, printing out the result. This allows us to write our code if we would block the thread but the compiler re-wires everything, so that we can run thousands of async tasks simultaneously.

For this magic to happen, the applications needs a run-time to drive progress on the individual tasks. The current de-facto standard is tokio. If you plan to do anything network- or file-system related with your rust-application, definitely have a look at it. Don’t get scared away from the currently incomplete guides, they are currently restructuring everything and the tokio-0.1.* versions had great guides so I expect them do come back updated and even more polished.

The interesting code samples, that show how this feature greatly increases readability, are located further down where I go over the implementation details for each relevant service part.

The Actix-web implementation

Actix-web is based on actix, an actor framework and organizes most of its functionality around (async) handler functions, that create responses when acting on requests. The handlers are executed in a tokio-runtime, allowing them to utilize multiple CPU cores. However, since there is only a finite amount of CPUs, the handler code must not block, e.g. by waiting for a database query.

Due to being run in the runtime already, waiting (feels like blocking) for async tasks to finish, is as easy as calling .await on it.

Similar to rocket, actix allows for custom types to be extracted from requests, as well as using them as the response body. These types can be annotated with basic verification rules to ensure only valid requests are processed.

#[derive(Validate, ...)]
struct PrintJob{
//...
#[validate(length(min = 1, max = 20))]
name: String
}
// in the handler:
job.validate().map_err(ErrorBadRequest)?;
// processing code

This code validates an incoming print job in the handler and a validation error is mapped to something of our choice, in this case a bad_request. In case there was no error, the code simply continues with the next instruction, in case the validation failed, the ? means that an error-type is returned from the function. This is a bit of a scary part of rust: A simple ? is a conditional return statement, that can easily be overlooked. However, compared with implicit returning caused by exceptions in other languages this is still more explicit.

The obligatory, unrelated image of a server «|» Photo by Taylor Vick on Unsplash

Providing the API with actix

Out of the above-mentioned API endpoints, GET requests asking for a special job_id, might be the simplest ones, so let me quickly walk you through their implementation so you can get a feeling for actix.

GET  /print/jobs/{:id}

The method signature of our handler is as follows:

async fn get_job_by_id(id: web::Path<u32>) -> impl Responder { ...

This is a lot of syntax here, so let me explain:

  • async means that this is an async function, that can be directly run in a runtime. We further promise, this function will not block.
  • web:Path<u32> is the type of our id parameter, since it is encoded in the request path. This parameter is defined in the app-routing directive: ...route("/print/jobs/{id}",...) .
  • impl Responder defines our response-type as anything that does implement the Responder trait. So e.g. HttpResponse but also its result-type. This means we can directly return Err(HttpResponse::BadRequest) .

If you have read the last post where I used the rocket framework, this should be quite familiar.

Since we want to load the print job from the database, we need to have a way to query it. Using the aforementioned bb8, we get an asynchronous connection pool we can execute our queries on. Accessing the pool is done similarly to rocket: by adding a data parameter:

async fn get_job_by_id(
id: web::Path<i32>,
storage: web::Data<Storage>
) -> impl Responder {

That storage parameter is handled by the middleware and we get a cloned reference to the pool for each and every request.

Now to the method body. In contrast to the web-service developed last time, I chose to make them as clean as possible:

storage.select_print_job_by_id(id.into_inner()).await;

That’s all.
Let’s examine why this works: That whole thing needs to implement Responder (see method declaration) to be handled by actix. As we can see with the await, the select_print_job_by_id function is an async function. Before we look at that function, a note to the into_inner: This simply means that the path-parameter will be converted to the contained variable (i32). Note however, this cannot fail, since the Path struct was already validated before the method was called in the first place.

Back to the select_print_job_by_id function:

pub async fn select_print_job_by_id(&self, id: i32)
-> Result<Option<PrintJob>, PrintJobError> {...}

Rust amazes me every time, how much syntax you can cram into one function…
So what’s the magic behind this: Why does this implement Responder ?

Responder is implemented for Option- and Result-Types, that in turn implement Responder. That means that now only the PrintJob needs to implement Responderand PrintJobError needs to implement ResponseError . Since PrintJobError Already implements std::error:Error (which in turn requires the Display and Debug traits), the “implementation” of ResponseError is simply stating it does:

impl ResponseError for PrintJobError { }

Yes, an empty implementation.

The Responder is a bit more complex (~10 Lines) due to the asynchronous nature of it, so I spare you that but you can have a look at the sourcecode if you are interested.

So what does the return-type of the select_print_job_by_id function mean for the caller? On the top-level it is a Result, possibly containing a PrintJobError , mapping to Error 500 with a description of the error. If it is not an error, it is either None or our PrintJob . If the database did not find anything (returning None), this naturally translates to Error 404, what is actually done by actix for us. If a PrintJob is returned, we give it to the client encoded as JSON.

I think this is one of the greatest features of rust: You are forced to think about Error handling. This makes rust such a great fit for microservices, since you won’t accidentally crash another calling service because you return null or nothing at all, since you encountered a NullPointerException yourself.

Let’s fetch some resources «|» Photo by K. Mitch Hodge on Unsplash

Consuming an API with reqwest

When the user wants to print something, they supply a user-id and a file-id. The print service asks the management service, if the requesting user is allowed to print the file and where to find the file. There are several ways this user-request could go wrong: e.g. the file has been deleted in the meantime or the user was naughty and tries to print a file they are not allowed to. Perhaps the management service is overloaded or not reachable. Either way, we do not want to answer the user until we have processed the reply of the management service. But we promised actix not to block in the handler, so what to do? Use async of course, here’s how:

POST /print/jobs         For the user to create a new print job

First, we need to fire off a request with the reqwest crate:

let response = reqwest::Client::new()
.get("http://127.0.0.1:8090/lookup")
.json(&job)
.send()
.await
.map_err(|e|...)?;

This gives us a future that completes, once the request completes. This does not necessarily require the body to be there yet, the headers are sufficient. We further map the error here to something we understand, depending if the request timed out or the target is not available.

Depending on the HTTP status code, we can now decide further: If we get OK, we can proceed. If we get 404, we let the user know about it. If anything else is received, an InternalServerError is returned. That way, the user will see relevant information (the file is not there, did you mess something up?) but we don’t publish sensitive information about what backends are available.

So in case of 200, we can wait for the body to arrive and being parsed:

let result = response.json::<ManagementLookupResponse>()
.await
.map_err(|e| ...)?;

The next steps of adding the job to the database and returning to the user are omitted here.

I want to point out, that the await syntax allows us to write our code the same way if it would block, but in fact it will run concurrently, in this case even parallel. Easy access to strong concurrency primitives is extremely important if several programs communicate, possibly across the globe, with considerable latencies.

Slow backends

If you worry about slow backends, you might be right. Although actix is taking care of backpressure quite well and a million waiting tokio-tasks do not really pose a problem, two million opened TCP connections do. We therefore need to prevent work from piling up and stalling our service. So let me use this challenge to show how you can incorporate low-level system primitives in high-level web-frameworks.

We want to assure that at every moment, there are no more than a specified amount of requests waiting. This should hold true for each instance of the service individually but will incorporate synchronization across many tasks. If some arriving request would breach the set threshold, we do not process it but return HTTP status 429. To keep track of active requests, we could use an AtomicUsize but there are some nasty details we don’t want to get wrong, just have a look at the Ordering you need to get right. Instead we use a semaphore.
Before you close this browser tab in fear of semaphore permits held for all eternity, have a look at the implementation provided by tokio. We do not need to remember calling release! In fact, there is no such function. This means we won’t block the service indefinitely because we forget to call release on held locks. But how does this work then?
Before we look into the Rust way of solving this, let’s do a quick excursion to Java: Java provides a finally block (see docs), that gets executed in most cases, if the code in the corresponding try or catch blocks ends. That way, it is quite safe to put the release()function call on the semaphore there. So back to Rust: given there is no try (mind the reference), wouldn’t it be nice to have something similar? Well, Rust has a drop trait, something like a destructor, that is automatically called when a variable goes out of scope (means, we can’t forget to call it). So our Semaphore returns SemaphorePermitif it has capacity and that permit will be ‘released’, once it goes out of scope. Even if we return from a method with ? and even if the current thread panics.

So that way, we can try to acquire a permit from the semaphore, return status 429 if it fails or continue processing if we get a permit. This simple check further increases the resilience of our code, since we don’t accumulate work and won’t hammer on a backend service with a million concurrent requests.

Testing the service

It obviously makes sense to test our service before we put it in production.

Before you get too excited about contract testing: For this topic to be covered, you need to wait for one of the future posts, concerned with the integration of Rust and Java.

Unit Tests

Rust allows you to write unit-tests directly in the files that contain your functions. That way it is easy to test private functions without performing any hacks.

Assume we have a (dummy) health-probe with the following code:

async fn health_probe(_: HttpRequest) ->  HttpResponse {
HttpResponse::Ok().finish()
}

A test for this method could be the following:

#[actix_rt::test]
async fn test_status_ok() {
let req = test::TestRequest::default().to_http_request();
let resp = super::health_probe(req).await;
assert_eq!(resp.status(), http::StatusCode::OK);
}

In this case, we actually don’t start our HTTP server but test the function directly, without routing or managed data. Note the function annotation: this isn’t your default #[test] annotation, since we are testing an async function (with an async function).

That way, we can pass the desired parameters to the function, for example a mocked database backend or any sort of nasty request, that would otherwise be hard to generate.

Integration tests

The slightly more complex integration tests of our service can test the “real” application: we can have an app with either the real routing and data attached or mocked ones. We could for example replace the real storage from above with a mock implementation.

#[actix_rt::test]
async fn test_xyz() {
let storage = ...

let mut app = test::init_service(
App::new()
.data(storage.clone())
.route(...)
).await;

let req = test::TestRequest::get()
.uri("/print/jobs/10").to_request();
let resp = test::call_service(&mut app, req).await;
assert!(resp.status().is_success());
}

Although this is pretty close to the real running application, it is worth noting that tokio tests and hence actix tests, use the basic_scheduler for their test-runtime, that schedules all tasks in a single thread. This might lead to different behavior compared to the default, thread-pool based scheduler.

Metrics

If many instances of our microservice will be run, we need to make sure they are healthy and we can detect broken ones, to replace them. In this section we will have a look at how you can collect metrics from the app to aggregate them in a central place and further process them there.

There is a crate called actix-web-prom that allows for prometheus metrics to be directly exposed on the /metrics endpoint, for the scraper to collect. That crate does automatically generate stats for your endpoints, if you add it as a middleware to your actix app. If you want to have custom metrics, like in our case the available slots for the backend requests, the crate provides a registry to put the custom histograms/gauges into.

When this is done, it is an act of simply calling observe() when values change:

histogram.observe(semaphore.available_permits() as f64 );

Of course measuring things isn’t free (although it is cheap): the maximum request rate at the health-endpoint, returning static responses with “OK” slightly decreased from 696K/s to 654K/s when activating metrics recording. This is surprisingly cheap as opposed to the default logging: “only” 414K requests per second could be served when logging using env_logger was added, even if the output was turned off. Please note that these are extreme conditions and you probably should not run your application under such high loads.

The following grafana-dashboard shows request rates, 95 percentiles of request duration and the 95 percentiles of available request slots during a load tests.

Grafana dashboard visualizing the prometheus metrics during a stress-test

Possible improvements

Although the service works in its current form, there is a lot of things not covered yet

  • Readiness- and Liveliness-Probes: For the moment we don’t need complicated readiness or liveliness probes but they will be necessary when we plan to scale-out our service. This could be done either directly providing them in the application itself or by obtaining this information from the prometheus metrics.
  • Reusing connections: It would greatly reduce the load on backend-services, if we reused the connections to them. However, this requires a bit more involved implementation, since we need to share the connection pool between requests, like we did with the bb8 database pool.
  • Authentication and authorization were completely neglected in this post. OAuth alone can easily fill a whole blog post about it.
  • Dynamic configuration: Once the service runs, it should be able to adopt to a changing environment, even without restarting the service instances. This is especially important if we want to adopt dynamic service discovery in the deployment, where services can register themselves to be available for usage.
  • More in-depth testing on the public APIs is needed. In one of the next posts we will come back to this and discuss, how we can ensure different services understand each other.
  • Deployment: Although it is easy to run Rust binaries just as they are, it might be advisable to also create a super light-weight docker-container to be able to run the application on a Kubernetes cluster. Since these binaries have very few to none dependencies, multistage-builds are easily achievable, producing light-weight containers.

Lessons learned

It is possible to create a microservice in Rust and integrate it with an existing environment, across language boundaries. However, development with Rust is slower compared to Java in my case but is arguably more fun: The type-system of Rust forces developers to think about error handling in advance, saving you from frustrating hours of debugging problems that occur at runtime.

The async-await APIs finally allow Rust to live up to its “Fearless Concurrency” promise, making it easy to write concurrent or parallel code that has strong guarantees about the absence of data-races, lost-updates and mostly even deadlocks.

In Spring it might be bad practice to share state in ways that are not safe, the rust compiler will simply not allow it. Such restrictions might increase the overall code quality when working in teams with different backgrounds, allowing us to write safe and secure software at the cost of slightly slower development and a steeper learning-curve.

Next Posts

In the next post we will build the download-service using a different technology: Apache Kafka. Stay tuned to read on the different challenges and possibilities that come with loosely coupling the service.

Thanks for reading! If you have any questions, suggestions or critique regarding the topic, feel free to respond or contact me. You might be interested in the other posts published in the Digital Frontiers blog, announced on our Twitter account.

--

--