From Hero to Zero
You’ve spent months developing features for your distributed, fault-tolerant system. Your system dynamically balances load, handles new components entering the cluster, and chugs along when parts of the cluster fail. Production deployment is just a couple of weeks away. And then, it happens. You realize the system works most of the time, instead of all of the time.
Maybe it’s a Bad Build
“How is this possible?” I asked myself while staring at the output of my failed integration tests, which were running in a clustered environment for the first time. The tests had been consistently passing in a single instance environment, so why would they fail in a clustered environment? I realized that I had probably uncovered a large problem, but rather than assume the worst, I began with denial. I told my team lead, “Maybe it’s a bad build.” We all know it probably wasn’t a bad build, but remember, it was the denial stage. I re-ran the integration tests and they passed. Whew! It was a bad build. Another two builds passed all the integration tests, and then the same set of tests failed again. Tests failing once might be a coincidence, but not twice. How did we get here?
The Cost of Waiting
Like most teams, my team has a Hudson build server running unit tests and zero tolerance for breaking the build. While the unit tests have been continuously running since day one (over a year ago), my team did not have a set of automated integration tests until just a few months ago. At first, the integration tests were run against a non-clustered environment. Although this setup did not mimic a true production environment (i.e. multiple instances of each component), it is valuable to have a minimally complete system setup as a first step to ensure correctness. Many days of debugging later, the integration tests consistently passed against the single instance environment.
Only in the last few weeks did my team have a story to continuously run integration tests against an environment that emulates production. Unsurprisingly, the distributed system functions differently in a distributed environment. Suddenly, tests that once passed, now failed. When were broken changes introduced? Was the logic always broken? Since so much code was written before continuously testing against a production environment, it becomes exponentially more challenging to identify the source of the problem. The lack of proper integration testing is a form of technical debt. Like all forms of debt, it must be paid off, and the longer you wait, the more expensive it becomes.
Staying out of Debt
My recent experiences have shown me the true cost of debugging distributed system bugs. It’s extremely expensive and it’s complex. Below are my thoughts on how to approach integration testing to avoid going into a lot of technical debt. These thoughts come from working on a distributed system, but should be considered general principles of system design.
Include Integration Testing in the Cost of a Story
Pricing out the cost of integration testing makes it clear to stakeholders how long it will take to have a working feature. If developers don’t include the cost of integration testing a feature, then the debt still exists, but it’s just temporarily hidden from view. There is no way to avoid proper integration testing, so don’t fudge the numbers.
In my case, since my team lacked a proper set of tools to repeatedly integration test in an automated manner, integration testing was effectively brushed aside because it was practically impossible to do (simple scenarios took approximately an hour to setup and execute). If integration testing were included in the price of each story, each story would have become exorbitantly expensive, which leads to the next item.
Automate Integration Test Execution
It must be a priority early in the development of a project to have some form of automated integration testing. Just like it is a top priority to immediately setup a continuous integration server to run unit tests, so should be the case for integration tests. If the end-to-end tests are not run automatically, they won’t be run at all. And if end-to-end tests are not running, how do you know if your system is working?
After the successful completion of each project build (i.e. unit tests pass), my team has configured Hudson to run integration tests against a non-clustered environment. This is a first pass to ensure that the most recent commits have not broken end-to-end functionality. Then, integration tests are run against a replica of the production environment each night. At most, one day’s worth of changes can be committed without discovering errors.
Simplify Writing Integration Tests
Now that tests are run automatically, there must be tests written to exercise new functionality. As with most things, the more complex something is, the less likely someone is to do it. Consider the pain of writing unit tests without a great mocking framework like Mockito. If writing integration tests is complicated, then new tests will be hard to write and existing tests will be harder to maintain. The cost of your stories will still be expensive.
Drawing on Mockito’s architecture for inspiration, my team has developed an integration test framework with a straight-forward DSL for defining tests (this is a great Scala use case!). The framework loads and creates all of its data based on the user-specified environment host, which provides the flexibility to run all integration tests locally and remotely via configuration. Now that my team is equipped with a powerful and simple-to-use framework for end-to-end testing, integration testing is easy enough to do that it is part of estimating every user story.
Keep Integration Test Data Independent
It is very alluring to share data among tests because if multiple tests need the same data, why bother recreating it? Resist the urge! Sure, two tests need the same data today, but what about two weeks from now? Once data are shared, tests become fragile because they depend on each other. It will not be a nightmare on day one, but once there is a sufficient number of tests, it will be nearly impossible to ensure that the shared data works across tests. Believe me, I’ve tried. As a bonus, if you avoid sharing the data, you are able to easily focus on the last two items.
Run Integration Tests on Multiple Environments
Inevitably, integration tests will not always pass. If you run integration tests against multiple setups, it will simplify identifying the source of the problem. The system my team is building is horizontally scalable, meaning additional instances of a component can be added to handle load. With this type of system, consider executing integration tests against a minimally complete environment (i.e. one instance of each type of component) as well as environments where there are multiple instances of each component.
Make Integration Tests as Fast as Possible
A short feedback loop is essential to tracking down the source of errors. Two ways to attack the speed problem are to focus on the execution of a single test and to focus on the execution of all tests.
First, attempt to replace all Thread.sleep() calls with either a polling mechanism (e.g. CountDownLatch) or an event-driven abstraction (e.g Futures). Not only will tests execute faster because the entire sleep duration will not be exhausted during execution, but you will have also replaced a code smell.
Once you have replaced as many Thread.sleep() calls as possible, consider executing tests concurrently. If you have maintained independent data for each test, it should be trivial to parallelize test execution with an Executor (or Actors if you’re an Akka fan).
When writing integration tests, it is often simpler to consider the system under test to be a black box. While it is important to verify end users are seeing correct data, it is equally important to verify internal side-effects throughout test execution.
Let’s consider the highly simplified representation of my team’s financial trading system shown below, in order to find side-effects that need to be verified. In this system, market data (e.g. the price of the currency pair EUR/USD) and trades (requests to buy/sell currency pairs) enter the system at one end and exit at the other.
What side-effects might be present?
- Market data (or a lack thereof) may trigger alerts in the system to notify end users of a bad system state.
- Added market data volume (i.e. load) might cause resource redistribution in a cluster.
- Orders have a lifecycle consisting of multiple states that need to be properly represented in the database.
- Completed orders are published to other internal systems (e.g. reporting).
Verifying the state of an order in the database, knowing alerts are raised at proper times, and ensuring the cluster works as designed gives you the confidence to say, “Yes, my system is always working.”
How have you dealt with exercising complex systems end-to-end? I’m always interested in learning more to make my life easier! In an upcoming post, I will share the lessons I’ve learned while debugging a distributed system.