How to Get More for Less When Testing Your Solutions

This article was written by Kristo Koert, a Technical Lead at Twilio.

It is probably fair to say that in the vast majority of cases, testing isn’t about “how much to test” but rather about “how to strike the right balance between effort spent and value gained”. Being serious about trying to strike the optimal balance is likely to lead engineers down the path of discovering wholly new patterns of testing.Here are some patterns of testing that came up in Twilio everyday work as we navigate complex requirements in high availability Tier 1 systems.

Snapshot testing

In one of Twilio's core services, there exists a Tier 1 service that calls into another ~15 services. Each of these services houses a significant bit of logic and has significantly sized API data structures. As one can imagine, it would be fairly simple to change logic surrounding some data and inadvertently cause something to change in one integration of 15 by accident. This is where snapshot testing comes into play.

Using it over outgoing requests in functional tests means that today this service is always keeping track of what outgoing requests look like in their entirety for all 15 downstreams as “snapshot files”.

In effect, this means a very large amount of logic is automatically covered with “change detection”. If the outgoing request changes, the appropriate snapshot file will also change. This results in a VCS diff.

If this was an intended change, it can just be committed. It also acts as an easy to reason about example of how logic will be changing in the pull request.
If this was unintended, this diff change acts as a failing test to flag this unintended change.

Specific technology used: https://skuzzle.github.io/snapshot-tests/reference/latest/

Test data generation

Twilio messaging has a cross-service centralized core data structure for describing a message's life throughout processing in fairly fine detail. Because of the wide breadth of Twilio features, this data structure is quite large but also needs to be versatile for various operations done to it. To achieve versatility, reflection was used. However, as one can imagine, with large data structures and very high traffic loads, reflection can become a significant bottleneck.

Difficulties surrounding splitting of the data structure itself lead to a code generation-driven approach. Instead of reflection in the common operations, code generation was used.

This is where testing came in. Obviously, replacing complicated reflection-based logic with complicated code generation logic on a critical cross-service data structure is a very dangerous rewrite of functionality. However, with test data generation via Instancio Twilio, engineers were able to easily create massive permutations of data states to test with. Passing this test data into the old reflection-based logic and comparing results with the new codegeneration results made it possible to fairly easily set a high baseline of confidence in the solution compatibility. Plus easy 100% code coverage for what that is worth.

Specific technology used: https://www.instancio.org/

“Millibenchmarking”

One day after the deployment of a core Twilio service, CPU usage increased approximately 20% over previous versions. The team investigated metrics and introduced code changes but wasn’t really able to identify what might have caused the slowdown. Basic profiling locally also wasn’t able to really identify any concrete bottlenecks.

The next approach was:

using a simple pre-existing functional test that exercised logic much like production;

running this functional test in a loop for 5 minutes to create some load;

turning on java flight recorder for the jar being run in local functional tests.

After this, a .jfc file got produced that could just be clicked on in IntelliJ to attach profiling information to the code in IntelliJ. This finally exercised the code in a similar enough pattern to surface the issue production was also seeing.

This looping functional test was left in the project as a “millibenchmark”. Something between a microbenchmark and a load test.

Specific technology used: java flight recorder

Conclusion

Using these patterns of testing allowed Twilio engineers to:

Vastly simplify code review and increase practical code coverage nearly for free (snapshot testing).

Vastly increase capacity to validate complex compatibility edge cases between old and new code (test data generation).

Ergonomically find a tricky performance bottleneck and ability to do the same in the future easily ("millibenchmarking").

There are nuances surrounding these patterns, as there are no silver bullets. But all in all, introducing these tools to our toolbox helps us move easier and with higher confidence in Twilio.
Kristo Koert will be going more in detail on these topics at the Digit Conference on May 9, giving a workshop "Smarter Testing with Snapshot Testing, Instancio and “Millibenchmarking”" .