Testing Distributed Systems for Linearizability

Distributed systems are challenging to implement correctly because they must handle concurrency and failure. Networks can delay, duplicate, reorder, and drop packets, and machines can fail at any time. Even when designs are proven correct on paper, it is difficult to avoid subtle bugs in implementations.

Unless we want to use formal methods1, we have to test systems if we want assurance that implementations are correct. Testing distributed systems is challenging, too. Concurrency and nondeterminism make it difficult to catch bugs in tests, especially when the most subtle bugs surface only under scenarios that are uncommon in regular operation, such as simultaneous machine failure or extreme network delays.

Correctness

Before we can discuss testing distributed systems for correctness, we need to define what we mean by “correct”. Even for seemingly simple systems, specifying exactly how the system is supposed to behave is an involved process2.

Consider a simple key-value store, similar to etcd, that maps strings to strings and supports two operations: Put(key, value) and Get(key). First, we consider how it behaves in the sequential case.

Sequential Specifications

We probably have a good intuitive understanding of how a key-value store is supposed to behave under sequential operation: Get operations must reflect the result of applying all previous Put operations. For example, we could run a Put("x", "y") and then a subsequent Get("x") should return "y". If the operation returned, say, a "z", that would be incorrect.

More formal than an English-language description, we can write a specification for our key-value store as executable code:

class KVStore:
    def __init__(self):
        self._data = {}

    def put(self, key, value):
        self._data[key] = value

    def get(self, key):
        return self._data.get(key, "")

The code is short, but it nails down all the important details: the start state, how the internal state is modified as a result of operations, and what values are returned as a result of calls on the key-value store. The spec solidifies some details like what happens when Get() is called on a nonexistent key, but in general, it lines up with our intuitive definition of a key-value store.

Linearizability

Next, we consider how our key-value store can behave under concurrent operation. Note that the sequential specification does not tell us what happens under concurrent operation. For example, the sequential spec doesn’t say how our key-value store is allowed to behave in this scenario:

operation-1

It’s not immediately obvious what value the Get("x") operation should be allowed to return. Intuitively, we might say that because the Get("x") is concurrent with the Put("x", "y") and Put("x", "z"), it can return either value or even "". If we had a situation where another client executed a Get("x") much later, we might say that the operation must return "z", because that was the value written by the last write, and the last write operation was not concurrent with any other writes.

We formally specify correctness for concurrent operations based on a sequential specification using a consistency model known as linearizability. In a linearizable system, every operation appears to execute atomically and instantaneously at some point between the invocation and response. There are other consistency models besides linearizability, but many distributed systems provide linearizable behavior: linearizability is a strong consistency model, so it’s relatively easy to build other systems on top of linearizable systems.

Consider an example history with invocations and return values of operations on a key-value store:

history-1

This history is linearizable. We can show this by explicitly finding linearization points for all operations (drawn in blue below). The induced sequential history, Put("x", "0"), Get("x") -> "0", Put("x", "1"), Get("x") -> "1", is a correct history with respect to the sequential specification.

history-1-linearization

In contrast, this history is not linearizable:

history-2

There is no linearization of this history with respect to the sequential specification: there is no way to assign linearization points to operations in this history. We could start assigning linearization points to the operations from clients 1, 2, and 3, but then there would be no way to assign a linearization point for client 4: it would be observing a stale value. Similarly, we could start assigning linearization points to the operations from clients 1, 2, and 4, but then the linearization point of client 2’s operation would be after the start of client 4’s operation, and then we wouldn’t be able to assign a linearization point for client 3: it could legally only read a value of "" or "0".

Testing

With a solid definition of correctness, we can think about how to test distributed systems. The general approach is to test for correct operation while randomly injecting faults such as machine failures and network partitions. We could even simulate the entire network so it’s possible to do things like cause extremely long network delays. Because tests are randomized, we would want to run them a bunch of times to gain assurance that a system implementation is correct.

Ad-hoc testing

How do we actually test for correct operation? With the simplest software, we test it using input-output cases like assert(expected_output == f(input)). We could use a similar approach with distributed systems. For example, with our key-value store, we could have the following test where multiple clients are executing operations on the key-value store in parallel:

for client_id = 0..10 {
    spawn thread {
        for i = 0..1000 {
            value = rand()
            kvstore.put(client_id, value)
            assert(kvstore.get(client_id) == value)
        }
    }
}
wait for threads

It is certainly the case that if the above test fails, then the key-value store is not linearizable. However, this test is not that thorough: there are non-linearizable key-value stores that would always pass this test.

Linearizability

A better test would be to have parallel clients run completely random operations: e.g. repeatedly calling kvstore.put(rand(), rand()) and kvstore.get(rand()), perhaps limited to a small set of keys to increase contention. But in this case, how would we determine what is “correct” operation? With the simpler test, we had each client operating on a separate key, so we could always predict exactly what the output had to be.

When clients are operating concurrently on the same set of keys, things get more complicated: we can’t predict what the output of every operation has to be because there isn’t only one right answer. So we have to take an alternative approach: we can test for correctness by recording an entire history of operations on the system and then checking if the history is linearizable with respect to the sequential specification.

Linearizability Checking

A linearizability checker takes as input a sequential specification and a concurrent history, and it runs a decision procedure to check whether the history is linearizable with respect to the spec.

NP-Completeness

Unfortunately, linearizability checking is NP-complete. The proof is actually quite simple: we can show that linearizability checking is in NP, and we can show that an NP-hard problem can be reduced to linearizability checking. Clearly, linearizability checking is in NP: given a linearization, i.e. the linearization points of all operations, we can check in polynomial time if it is a valid linearization with respect to the sequential spec.

To show that linearizability checking is NP-hard, we can reduce the subset sum problem to linearizability checking. Recall that in the subset sum problem, we are given a set of non-negative integers and a target value , and we have to determine whether there exists a subset of that sums to . We can reduce this problem to linearizability checking as follows. Consider the sequential spec:

class Adder:
    def __init__(self):
        self._total = 0

    def add(self, value):
        self._total += value

    def get(self):
        return self._total

And consider this history:

subset-sum

This history is linearizable if and only if the answer to the subset sum problem is “yes”. If the history is linearizable, then we can take all the operations Add(s_i) that have linearization points before that of the Get() operation, and those correspond to elements in a subset whose sum is . If the set does have a subset that sums to , then we can construct a linearization by having the operations Add(s_i) corresponding to the elements in the subset take place before the Get() operation and having the rest of the operations take place after the Get() operation.

Implementation

Even though linearizability checking is NP-complete, in practice, it can work pretty well on small histories. Implementations of linearizability checkers take an executable specification along with a history, and they run a search procedure to try to construct a linearization, using tricks to constrain the size of the search space.

There are existing linearizability checkers like Knossos, which is used in the Jepsen test system. Unfortunately, when trying to test an implementation of a distributed key-value store that I had written, I couldn’t get Knossos to check my histories. It seemed to work okay on histories with a couple concurrent clients, with about a hundred history events in total, but in my tests, I had tens of clients generating histories of thousands of events.

To be able to test my key-value store, I wrote Porcupine, a fast linearizability checker implemented in Go. Porcupine checks if histories are linearizable with respect to executable specifications written in Go. Empirically, Porcupine is thousands of times faster than Knossos. I was able to use it to test my key-value store because it is capable of checking histories of thousands of events in a couple seconds.

Effectiveness

Testing linearizable distributed systems using fault injection along with linearizability checking is an effective approach.

To compare ad-hoc testing with linearizability checking using Porcupine, I tried testing my distributed key-value store using the two approaches. I tried introducing different kinds of design bugs into the implementation of the key-value store, such as modifications that would result in stale reads, and I checked to see which tests failed. The ad-hoc tests caught some of the most egregious bugs, but the tests were incapable of catching the more subtle bugs. In contrast, I couldn’t introduce a single correctness bug that the linearizability test couldn’t catch.

  1. Formal methods can provide strong guarantees about the correctness of distributed systems. For example, the UW PLSE research group has recently verified an implementation of the Raft consensus protocol using the Coq proof assistant. Unfortunately, verification requires specialized knowledge, and verifying realistic systems involves huge effort. Perhaps one day systems used in the real world will be proven correct, but for now, production systems are tested but not verified. 

  2. Ideally, all production systems would have formal specifications. Some systems that are being used in the real world today do have formal specs: for example, Raft has a formal spec written in TLA+. But unfortunately, the majority of real-world systems do not have formal specs. 

μWWVB: A Tiny WWVB Station

μWWVB is a watch stand that automatically sets the time on atomic wristwatches where regular WWVB signal isn’t available. The system acquires the correct time via GPS and sets radio-controlled clocks by emulating the amplitude-modulated WWVB time signal.

Watch stand with watch

Background

Atomic Clocks

Most so-called atomic clocks aren’t true atomic clocks; rather, they are radio-controlled clocks that are synchronized to true atomic clocks. Radio clocks maintain time by using an internal quartz crystal oscillator and periodically synchronizing with an atomic clock radio signal. Quartz clocks have a fractional inaccuracy , which means that they can gain or lose about 15 seconds every month. Official NIST US time is kept by an ensemble of cesium fountain atomic clocks — their newest clock, NIST-F2, has a fractional inaccuracy , meaning that the clock would neither gain nor lose one second in about 300 million years.

Most radio-controlled clocks in the United States are synchronized to the WWVB radio station, which continuously broadcasts official NIST US time. WWVB broadcasts from Fort Collins, Colorado, using a two-transmitter system with an effective radiated power of 70 kW. Theoretically, during good atmospheric conditions, the signal should cover the continental United States. Unfortunately, I can’t get my wristwatch to receive the 60 kHz amplitude-modulated time signal in my dorm room in Cambridge, Massachusetts.

Getting Accurate Time

Taking into account frequency uncertainty, WWVB can provide time with an accuracy of about 100 microseconds. In the absence of WWVB, there are other sources that can provide reasonably accurate time. The Network Time Protocol (NTP), which operates over the Internet, can provide time with an accuracy of about 1 millisecond. GPS can theoretically provide time with an accuracy of tens of nanoseconds. I decided to use GPS, mostly because I didn’t want to make my WWVB emulator dependent on an Internet connection.

Legality

Building a WWVB emulator involves transmitting on 60 kHz. In general, it’s not legal to broadcast on arbitrary frequencies at an arbitrary transmit power, because transmissions cause interference. Many parts of the radio spectrum are already in use, as allocated by the Federal Communications Commission (FCC).

Luckily, the FCC grants exemptions for certain unlicensed transmissions, as specified by 47 CFR 15. This is explained in some detail in “Understanding the FCC Regulations for Low-Power Non-Licensed Transmitters”.

Transmitters in the 60 kHz band are allowed, and the emission limit at that frequency is given in 47 CFR 15.209. As long as the field strength is under as measured at 300 meters, it’s fine. In my use case, I have the transmitter within a couple inches of the receiver in my wristwatch, so I don’t need to transmit at a high power.

Electronics

Board

I designed and fabricated a tiny custom board designed to interface with a GPS and an antenna:

Circuit board

The board is powered by a $1 ATtiny44A microcontroller. I used a 20 MHz external crystal oscillator for the microcontroller so I’d have a more accurate clock than I would with the internal RC oscillator. The board has a Mini-USB connector for power, an AVR ISP header for programming the microcontroller, and a JST-SH 6 pin connector for the GPS. I included pin headers for the antenna, making sure to connect them to a port that works with fast PWM. I also included 3 LEDs as status indicators — a red LED for power, a green LED to indicate a GPS lock, and a blue LED to show the unmodulated WWVB signal.

I designed the board using the EAGLE PCB design software and milled the board from a single-sided FR-1 circuit board blank on an Othermill v2:

Once the board was finished, I used solder paste and a hot air gun to solder my components. Hand soldering surface-mount components is pretty painful, but using solder paste, the entire soldering process took only ten minutes.

GPS

For my GPS module, I used a USGlobalSat EM-506, a high-sensitivity GPS powered by the SiRFstarIV chipset.

Antenna

The 60 kHz WWVB signal has a very long wavelength: , so the wavelength is approximately . It’s challenging to design good antennas for such long wavelengths — a quarter-wavelength antenna would be about 1250 meters long! WWVB uses a sophisticated antenna setup that’s automatically tuned using a computer to achieve an efficiency of about 70%. Luckily, for my use case, I didn’t need to worry about designing a really efficient antenna and doing careful impedance matching — I was transmitting over such a small distance that efficiency didn’t matter too much.

I didn’t want to build my own antenna, so I gutted a radio clock and repurposed its ferrite core loopstick antenna. Thanks to antenna reciprocity, which says that the receive and transmit properties of an antenna are identical, I knew that this should work.

Clock disassembly

Software

I wrote software to periodically get accurate time via GPS and continuously rebroadcast the time following the WWVB protocol. The software is written in plain C and doesn’t use any libraries or anything. I used the CrossPack development environment on macOS for compiling my code and flashing my microcontroller.

Getting the software to work just right took a good amount of effort. To make it easier, I initially designed each component separately, and still, I ended up spending a lot of time debugging:

Debugging using an oscilloscope

NMEA GPS Interface

According to the datasheet, the EM-506 has a UART interface and supports both the SiRF Binary protocol and the NMEA protocol. NMEA 0183 is a standardized ASCII-based protocol, so I opted to use that over SiRF Binary.

After implementing software UART on the ATtiny44A, getting time data from the GPS was as simple as sending over a command to query for the ZDA (date and time) NMEA message:

$PSRF103,08,01,00,01*2D

In response, I’d get back a message with the current date and time (in UTC). For example, for 26 December 2016, 18:00:00, I’d get the following NMEA message1:

$GPZDA,180000,26,12,2016,,*43

Date and Time Calculations

It was easy to parse the ZDA information to get the current date and time. However, the WWVB protocol required some extra date/time information not directly available in the ZDA data, so I had to write some date/time conversion utilities.

Leap year calculation was simple, and calculating the day of year was also straightforward.

Calculating whether daylight savings time was in effect took a little bit more effort. In the process of implementing it, I learned of a neat way to calculate the day of the week given the month, day, and year:

int day_of_week(long day, long month, long year) {
    // via https://en.wikipedia.org/wiki/Julian_day
    long a = (14 - month) / 12;
    long y = year + 4800 - a;
    long m = month + 12 * a - 3;
    long jdn = day + (153 * m + 2) / 5 + 365 * y +
        (y / 4) - (y / 100) + (y / 400) - 32045;

    return (jdn + 1) % 7;
}

int is_daylight_savings_time(int day, int month, int year) {
    // according to NIST
    // begins at 2:00 a.m. on the second Sunday of March
    // ends at 2:00 a.m. on the first Sunday of November

    if (month <= 2 || 12 <= month) return 0;
    if (4 <= month && month <= 10) return 1;

    // only march and november left
    int dow = day_of_week(day, month, year);
    if (month == 3) {
        return (day - dow > 7);
    } else {
        // month == 11
        return (day - dow <= 0);
    }
}

WWVB-format Time Signal

WWVB uses amplitude modulation of a 60 kHz carrier to transmit data at a rate of 1 bit per second, sending a full frame every minute. Every second, WWVB transmits a marker, a zero bit, or a one bit. A marker is sent by reducing the power of the carrier for 0.8 seconds and then restoring the power of the carrier for the remaining 0.2 seconds. A zero is sent by reducing the power of the carrier for 0.2 seconds, and a one is sent by reducing power for 0.5 seconds.

Here is the format of the WWVB time code, as documented by NIST:

WWVB time code format

I made use of the hardware PWM built into the ATtiny44A to generate and modulate the 60 kHz carrier for emulating WWVB. Working out exactly how to configure the microcontroller required careful reading of the section in the datasheet on fast PWM.

I used the following code to set up the 16-bit timer/counter:

// set system clock prescaler to /1
CLKPR = (1 << CLKPCE);
CLKPR = (0 << CLKPS3) | (0 << CLKPS2) | (0 << CLKPS1) | (0 << CLKPS0);

// initialize non-inverting fast PWM on OC1B (PA5)
// count from BOTTOM to ICR1 (mode 14), using /1 prescaler
TCCR1A = (1 << COM1B1) | (0 << COM1B0) | (1 << WGM11) | (0 << WGM10);
TCCR1B = (1 << WGM13) | (1 << WGM12) | (0 << CS12) | (0 << CS11) | (1 << CS10);
// fast PWM:
// f = f_clk / (N * (1 + TOP)), where N is the prescaler divider
// we have f_clk = 20 MHz
// for f = 60 kHz, we want N * (1 + TOP) = 333.3
// we're using a prescaler of 1, so we want ICR1 = TOP = 332
// this gives an f = 60.06 kHz
// we can use OCR1B to set duty cycle (a fraction of ICR1)
ICR1 = 332;
OCR1B = 0; // by default, have a low output
DDRA |= (1 << PA5); // set PA5 to an output port

After this setup, I could modulate the carrier by setting OCR1B. Setting OCR1B = 166 made a 50% duty cycle 60 kHz square wave, and setting OCR1B = 0 resulted in a reduction in power of the carrier. With this setup, for example, I could generate a zero bit as follows:

void gen_zero() {
    OCR1B = 0;
    _delay_ms(200);
    OCR1B = 166;
    _delay_ms(800);
}

After I had this set up, I implemented functionality to broadcast WWVB-format data by repeatedly broadcasting the appropriate data for the current second and then incrementing the current time.

Physical Design

I wanted to keep the physical design simple, so I opted to go with a press-fit design consisting of a 3D-printed top and bottom with laser-cut sides to form a box.

3D Parts

I used OpenSCAD, a programming-based 3D modeler, to design my 3D parts:

OpenSCAD

I used a Stratasys uPrint SE to print my parts out of ABS thermoplastic:

Print

2D Parts

I used Adobe Illustrator to design my 2D parts, and I cut them out of acrylic on a 75-watt Universal PLS 6.75:

Cut

Assembly

Because it was a press-fit design, assembly took about two minutes! Here’s the final product:

Watch stand

Evaluation

μWWVB works really well for me, consistently synchronizing my watch in about three minutes. My watch is set up to automatically receive the WWVB signal every night, so by leaving my watch on its stand overnight, it’s automatically synchronized every day!

In the current implementation, μWWVB syncs my watch to an accuracy of about 500 milliseconds of UTC. By putting a little more effort into making the timing in my software more precise, doing things like using the milliseconds value from the ZDA NMEA message instead of ignoring it, I could probably get the error down to about 100 milliseconds. There would still be some error, mostly due to the ZDA NMEA message being sent over UART, which is an asynchronous connection.

If I wanted the system to be much more accurate, I’d probably need to switch to a pulse per second (1PPS) GPS. A 1PPS GPS outputs a signal that has a sharp edge every second precisely at the start of the second — such a signal could be used to clock the WWVB time code such that each bit starts precisely at the start of the second.

But for now, for my purposes, μWWVB works really well!

  1. Actually, for my device, I was getting data in the format $GPZDA,hhmmss.sss,dd,mm,yyyy,,*CC, contradictory to the SiRF NMEA reference manual. So for 26 December 2016, 18:00:00.000, I’d get the NMEA message $GPZDA,180000.000,26,12,2016,,*5D

Algorithms in the Real World: Committee Assignment

I recently had another chance to use a fancy algorithm to solve a real-world problem. These opportunities don’t come up all that often, but when they do, it’s pretty exciting!

Every year, MIT HKN has a bunch of eligible students who need to be matched to committees. As part of the assignment process, the officers decide how many spots are available on each committee, and then we have every eligible rank the committees. In the past, officers matched people manually, looking at the data and trying to give each person one of their 1st or 2nd choices. Unfortunately, this is time-consuming and unlikely to result in an optimal assignment if we’re trying to maximize overall happiness.

Example

To see how assignments can be suboptimal, we can go through an example.

Committee Capacity
Tutoring 2
Outreach 1
Social 2
Person Tutoring Outreach Social
Alice 1 3 2
Bob 1 3 2
Charlie 1 2 3
Dave 2 1 3
Eve 2 1 3

In the above table, 1 means first choice, and 3 means third choice. We could imagine assigning people by going down the list and assigning each person to their highest-ranked committee that has available slots. This would result in assigning Alice to Tutoring (1st choice), Bob to Tutoring (1st choice), Charlie to Outreach (2nd choice), Dave to Social (3rd choice), and Eve to Social (3rd choice).

Not only is this algorithm unfair to the people at the bottom of the list, but it’s also suboptimal. If we want to minimize the sum of the rankings for committees we placed people on, we could go with Alice–Social, Bob–Social, Charlie–Tutoring, Dave–Outreach, Eve–Tutoring. This results in a “cost” of 8, which is optimal, rather than the cost of 10 we got with the first assignment, which was constructed greedily.

In the actual data set, there were 8 committees and 57 eligible members, so it wouldn’t have been feasible to manually find an optimal assignment.

Problem Statement

More formally: we have committees and people. Each committee has a capacity of people, where we’re guaranteed that . We know people’s preferences, which for any given person is a permutation of the committees, , mapping the highest ranked committee to and the lowest-ranked committee to . Our goal is to find an assignment of people to committees that solves the following optimization problem:

Above, is the Kronecker delta. Essentially, we want to find the assignment that minimizes cost while satisfying our constraints of having a specific number of people on each committee.

Algorithm

It turns out that the committee assignment problem can be transformed into an instance of the assignment problem (which can also be thought of as finding a minimum weight matching in a bipartite graph).

In the assignment problem, you have people, jobs, and a cost matrix where is the cost of having person do job , and you want to find the minimum cost assignment such that each person is assigned to a unique job.

The committee assignment problem can be trivially transformed into an instance of the assignment problem, simply by making copies of each committee , counting each as a separate “job”, and constructing an appropriate cost matrix from the s.

Luckily, the assignment problem is well-studied in computer science, and there’s a known solution — the Hungarian algorithm solves this problem. There’s even an implementation of the algorithm built into SciPy. This makes solving the committee assignment problem really easy — it only requires a little bit of code to implement the transformation described above.

Results

Using this algorithmic approach to solve the committee assignment problem worked really well for us! We made some slight modifications to the process — one of the committees hand-picked their members, and then we used the algorithm on the remaining 52 members and 7 committees. When running the program, we decided to minimize the sum of the squares of the costs rather than minimizing just the sum.

With 52 people and 7 committees, our implementation ran in less than half a second and gave us an assignment with 37 people getting their first choice, 13 getting their second choice, and 2 getting their third choice.