The Domain of Failure

Published in

Site Reliability Engineering Leadership

7 min readAug 3, 2023

As a software engineer, I was always interested in resilience and how we could build better software, and I think it was a big driver in why I was attracted to SRE and operational resilience world. I started building actor-based systems in 2008 using Scala, and I learned a lot about the way Erlang was designed to allow engineers to design and build resilient systems. Today, I want to share these concepts, along with other principles such as Domain-Driven Design, to show how software resilience isn’t just an SRE concern, but can be layered across all aspects of software architecture, design, engineering, and delivery to production.

Application and service resilience happens at multiple levels, not just the non-functional space where SREs typically operate. Understanding more about where resilience can better be managed within the software that we engineer should be an area of focus for senior SREs embedded in software development teams. As SREs, we tend to think about the non-functional issues first, such as hosts going down, or network traffic management issues. But one of the SLIs we often track with respect to applications and services is the Error Rate, and that error rate often has insidious functional issues that may or may not show up as server 500 errors, such as not finding the customer record when you go to update it. By building a domain of failure into our applications, we can handle these functional errors as part of the software itself, and improve the design of the service.

This concept has been a part of software engineering for decades now, but it hasn’t become prevalent in the way people build software beyond exception handling. To learn more, let’s look at a pioneer in the space, the Erlang programming language.

What is Erlang?

Erlang is a programming language created by Joe Armstrong, Robert Virding, and others while working on telephony systems at Ericsson in the 1980s. It introduced the concepts of Actors in a mainstream language, as defined by Carl Hewitt. There are formal definitions of what actors can and cannot do, but actors have three main characteristics:

They could only communicate with each other by message passing
The behavior of the receiving actor could change based on a message received
The could be physically located on any host (location transparency)

Elastically Scaling Up/Down

The message-passing semantics had a couple of interesting impacts to the way you programmed. As a sender, I didn’t have to know about where the other actor was located, whether it was on the same host or elsewhere. But it also meant that I had non-determinism — I couldn’t be sure that the message would be received by the other actor. And an actor could have multiple message receiving blocks, and can transition between them based on state machine-like rules. So, if an actor has a receive block that waits until it receives a START message, it could then transition to listen for DO_SOMETHING message once START has been received, and ignore all START messages it receives after that.

Since you do not know the physical location of the actor who will receive messages, you could have multiples of the same actor type scattered across multiple hosts. You could scale up and down the number of hosts that can receive specific types of messages, just like instances of message listeners/handlers for Kafka topics. You’d need to integrate with an autoscaling orchestration platform like Kubernetes with a KEDA-like integration, but the semantics would be the same and the system can scale up/down as needed.

Supervisor Hierarchies

If you’ve ever heard of Domain-Driven Design, you might know that it is a set of approaches and artifacts for helping understand what a system is supposed to do, so that you can create an application architecture that best supports those requirements. You come up with some core concepts, such as a Ubiquitous Language, to ensure that everyone using domain-specific terms is assured that they always know what the other person means, and there are no verbal or written namespace collisions. You also define the behaviors of the system — the commands that flow into a component, and the events that occur as a result of that command being handled.

When you create an actor in Erlang, you always have a parent actor, all the way to the root of the tree. That parent actor is notified when a failure occurs in a child actor that it manages, and if the direct parent doesn’t handle the failure explicitly, the failure bubbles up to the next level of the hierarchy. And because these are actors, everything is passed as a message, so the parent actor is building a specific set of behaviors for what to do when an failure is received. Some of these failures could be non-functional, such as running out of memory and crashing, so you could automatically restart that actor. But some of that behavior are failures due to the DOMAIN — such as an inability to increment the miles of an airline passenger after a flight because they bought the ticket with miles, for example. For the functional domain of the application, there are many success case messages to pass, and there are also many messages that can be passed to represent domain failure. This is a much stronger and flexible paradigm than try/catch blocks, especially in a distributed or multi-threaded environment.

BEAM Runtime

As a final note about why Erlang has such excellent resilience design, consider the runtime of the language itself. BEAM is a virtual machine that executes Erlang applications, much like the Java Virtual Machine, or the Python Interpreter. However, where the JVM and Python have one “heap” for shared memory, BEAM has isolation of memory by actor. This means that one actor cannot take down the entire BEAM VM because it uses too many resources, it can crash in isolation.

Erlang Adoption

You may be asking, if Erlang is such an amazing solution, why isn’t it in use more? It has a loyal and dedicated group of advocates and fans, but there have traditionally been a few aspects that people haven’t liked (or at least the FUD people bring up):

It’s a dynamic language, without compiler-enforced type checking
It can be slower than native executables
The syntax is considered wonky by some (though you can code Ruby for Erlang using Elixir)
The enterprise toolchain isn’t as strong as other runtimes, such as the JVM or CLR)

Erlang Solutions, headed by Francesco Cesarini, has long held a leadership position in providing consulting services around Erlang, and been a major supporter of the community. They will rightly point out that some major use cases have been enabled by Erlang, and WhatsApp was developed almost entirely with it. Other enterprise customers on their site include PepsiCo, Cisco, AWS, and others. It is a worthwhile language and approach to study if you want to better understand how application resilience can become a first-class citizen of software development.

Why Is This Relevant to SREs?

As SREs, our teams are not just responsible for monitoring and toil reduction. We also need to be part of the application architecture discussions to help identify critical dependencies and build mitigation plans, as well as help the software engineers identify potential failures that could bubble up through the system. When you think about how Erlang systems are built, the handling of those failures becomes much more explicit. If you think about the way a system can fail due to non-functional reasons (Linux crash, network split, database corruption, power outage, etc), you can also work with the product owners and software engineers to define all of the functional ways the application or service can fail. And you can layer behaviors into the system to auto-remediate those issues, and build tests to validate those behaviors.

Let’s use the earlier example of an airline mile accrual service as a domain to decompose. Here are some of the ways accrual in the domain might fail:

The customer’s miles account was not found
The flight was cancelled
The passenger did not board the flight
They were accrued on a non-mileage partner airline
They were accrued on a flight purchased with miles
etc

Most of these represent business logic that would be layered into a service via conditional statements (if statements, pattern matching, etc). We can now build behaviors into the system that make it more resilient to these domain errors, so that we respond in specific ways that create new customer experiences. For example, we could send an email or notification to the customer letting them know why their miles weren’t accrued, hopefully preventing a call to customer service later when they realize they didn’t get them. We can also flag the record in the database about their miles saying that the accrual didn’t happen on this date for this reason, so someone can understand the issue later without having to dig into a bunch of data to figure it out.

What Next?

SREs embedded in software engineering teams are expected to become domain experts in the space where these teams build solutions. These SREs need to partner closely with the product leadership and software engineers to ensure that the domain being defined in Epics and User Stories also reflect the areas where failure can occur, and define the mitigating behavior to be built into the service or application.