My best microservices experience is running a dozen of services on a single machine and I have learnt a lot about eventing.
In the past year I have been part of a big team engineering software for self-service checkouts.
Full disclosure: This article by no means represents the opinion or views of the company I consult for.
These things, not specifically this brand:
Nowadays every checkout in the world has to support:
- scanning and recognizing barcodes
- weighing fruits
- verifying weights on the bagging scale
- payment with cash or card
- UI: pick items, show the basket, remove items, pay
- … and many more
One of the biggest architectural decision of this endeavor was to use a “microservice architecture” for the whole checkout. I am not a big fan of microservice, partly because it is a very overloaded term. But here it just made sense.
Microservice is an organizational pattern first, and an architectural pattern second.
The whole problem domain of a purchase is so complex that it is only natural to split it up into subdomains following the Domain Driven Design approach, and assign a team for each subdomain. We can think of domains such as payment, customer basket, weight verification and more not to mention some auxiliary domains.
We understand Conway’s law so we choose an architecture that is compatible with our organizational structure
Having the domains and teams assigned to them, a naturally emerging microservice architecture style means: one service for each domain. More specifically one running docker container for each domain. On the checkout.
Microservices on a single machine?
Seasoned software engineers will burst out: “that is an overkill to have microservices for one machine!”. Technically speaking, it is possible to develop only modules that we link and bundle together into a single executable JAR (because we mostly use Java). OSGi and Java Jigsaw can do that. However, also technically speaking, one docker container is a simple Linux process “jailed” by the Linux kernel, so why the fuss?
The real benefits of separate services is looser coupling:
- agility gained by independent deployment of services and teams
- more flexibility on interfaces by using REST, eventing and JSON instead of in-process method calls
- better availability due to isolation: one bad apple will not spoil the whole barrel
- programming language agnostic architecture
- better isolation of domain models aka separation of concern
There are challenges though:
- complex integration/system testing and deployment pipeline
- higher memory and disk consumption depending on the language runtime (especially true to JVM)
- duplicate development effort on vertical concerns e.g. healthchecks, tracing, authentication even though reuse by shared libraries is encouraged
- high integration effort
Eventing on a single machine
We use two communication methods in the system: REST and asynchronous eventing.
REST is simple as screw to use. It’s in the fingertips of all frontend and backend developers. It’s a blocking operation and easy to design with. Everyone knows how to do proper error and timeout handling for REST calls.
Therefore REST is used mostly on the critical path. When a dependency is essential to fulfill the request (e.g. the purchase). Such actions are payment or basket operations (add item to basket).
Each service also emits messages on events for example Basket service emits ItemAddedToBasket. We use a ZeroMQ-like solution for p2p messaging between services with a retry mechanism and at-least-once-delivery semantics.
We like eventing because after a certain number of services it greatly reduces coupling. Especially in cases when functions are not essential to the purchase.
Introduce item weight verification scenario
A good example is weigh verification using the bagging scale. Suppose weight verification is done with the WeightCheck service. This service is not essential to the purchase scenario. Customers can use the checkout without this functionality with the higher risk of missed scans. The baseline is: no one should go home hungry because of a degraded functionality.
WeightCheck service has two inputs:
With REST only, we have two choices:
- A) Basket service calls the WeightCheck service after each item added to the basket
- B) WeightCheck polls the Basket service and the Bagging scale frequently
In case of A), we change the direction of the dependency between WeightCheck service and Basket service. We also don’t know how long the Basket service should wait until customer places the potato chips on the scale.
On the other hand B) is either wasteful with too frequent polls, or misses a state from Basket service.
Neither solution brings joy.
By introducing the WeightHasChanged event from the scale “driver” I revealed another obvious reason in favor of eventing. It is the user who is controlling the flow and not the software. User pokes the checkout with different inputs on the peripherals: scanner, bagging scale, touchscreen. Therefore a WeightHasChanged event is analogous to an onMouseMove event in a web application frontend. Our only option is to react instead of drive.
Take a look of the communication styles:
3 Challenges of eventing
In this architecture microservices are communicating over HTTP/REST and Websocket. When you run your microservices on a single machine, you could easily fall for the fallacies of distributed computing because “there is no network”.
Let’s see what can go wrong on a single machine.
Yes, there are surely more challenges, but 3 makes up a good article.
The order of otherwise causal events can change because of the preemptive scheduling done by the OS. When the OS decides to park the thread it doesn’t care the process is trying to send OR receive an event.
Consider the happy case scenario, where everything is in order and causal.
Now let’s see what happens when the OS decides to stop execution of the sender thread in favor of another process
To remedy the situation the receiver always sorts the events in the correct order based on the creation timestamp before applying it’s logic on it.
If the process evaluates that some previously sent events should not have been sent then it sends compensating events, so subscribers can correct themselves.
The algorithm should cover:
- sort events in order
- generate new events
- compare new events with previously sent events
- send compensating events for events should have not been sent
- send new events
More on Retroactive Events, read Martin Fowler’s article
This mechanism inherently require an event sourcing pattern because we have to be able to rebuild the “truth” upon every new incoming event.
The difficulty is that the definition and meaning of compensating events are tightly coupled with the original event determined by the domain model.
For example the compensating event for an ItemAddedToBasket event would be the ItemRemovedFromBasket event. The compensating event definitions should come naturally from the domain design, but it could be a difficult task to grasp.
This approach serves the simplicity of the clients. From the perspective of clients a compensating event is just a regular domain event.
There is no generic protocol for retroactive events. (It’d be nice if there was such protocol like 2PC for XA transactions). And there are a lot of nuances for example this approach will not work if (given the last figure) the event creation timestamp falls behind.
We saw out-of-order events happen in the order of 1 in 10 000 events. Even though it’s sporadic, when it happens, it can leave transactions in an error, so better be prepared.
Concurrent processing of events
The problem of concurrent processing is well known from web applications. Think of double submission problem and post/redirect/get patterns. When two related actions that work on the same data are triggered by the user within a short time period, they will cause a race condition. Back in the days we solved it by maintaining locks on the user session object for resources (urls) we wanted to guard.
However, in the current case the ItemAddedToBasket and WeightHasChanged events can arrive to WeightCheck Service at nearly the same time, but still in the correct order. The service can handle the events in a parallel fashion. This could theoretically have many different outcomes depending on luck.
I recommend serializing the event processing mechanism. In our case a Java newSingleThreadExecutor solved the problem.
In case of a server application or for any process that takes high volume of events, I don’t recommend making the application totally single-threaded. However I recommend serializing the processing belonging to the same session or transaction. In this manner you can keep everything in order when events come from multiple topics and your logic builds on the causality of events.
Don’t forget to listen
When service A calls service B via an HTTP POST, if service B is not reachable, service A will get an error or timeout, the erroneous behavior is clear, it’s logged, we can take action.
Depending on what messaging system you use, it can be easy to miss events and act everything is fine. When the service A is not listening service B, or the message queue because the connection has dropped, then its easy to miss events. It’s particularly true when you use simple Websocket communication. It can be painful when the user is waiting for the action to be completed.
Always take care of:
- robust reconnection strategy (most well-known message queue clients take care of it)
- delivery guarantees by the sender not to miss any events on a connection loss
In this article we just scratched the surface of the realm of eventing. Using events and asynchronous messages as a communicational pattern is 3.5x times more complicated than making good old synchronous requests via REST. A complex business logic and transaction flow requires a very careful implementation.
Not suggesting that a synchronous REST-based approach would always be simpler. A business transaction that involves many microservices would have to implement compensating actions using the SAGA-pattern which is also a cumbersome task to design well.