Be Careful Relying on Third Party Library Internals

You’re working hard on a project, making progress, but still not getting the final performance numbers you want. Then it strikes you…a third party library providing your utilities can be extended to open up its internals. Because the library is open source you can see its implementation and can take advantage of how it’s doing its job to increase the performance of your own application.

This is where you should start seeing red flags. Making assumptions about an entity’s internals (whether it’s the data itself, the data organization, or even how its methods are implemented) violates safe coding practices. Depending on a third party library’s internal implementation can in the best cases pin your project to a particular library version; preventing incorporation of important fixes, security updates, and performance enhancements but in the worst cases cause mysterious instabilities, random crashes, and hours of maintenance time.

While this performance increase might seem like low hanging fruit ripe for the picking you probably shouldn’t be considering it and if you are understand the risks accompanied with the decision.

A Cautionary Tale

Over the course of this last year our group upgraded our version of the boost libraries. This was a much needed upgrade as we were falling farther behind the latest release and were building up more technical debt as releases passed.

Upon completion of the boost uplift we encountered some of the expected problems with a job of this nature; API changes, compilation errors, slight behavioral differences. But one very worrisome crash kept occurring under load. Our core file analysis showed a limited number of back traces, the crash was reproducible under a specific set of load profiles, existing unit tests did not reproduce the problem, and every branch with the new boost libraries would see the condition.

Two engineers dedicated their time to discovery with multiple others pulled in for consultation. Unit test coverage was increased, core files were pored over, countless lines of code inspected, and myriad hypotheses created and subsequently proved to be incorrect.

In the end we found the problem caused by a previous performance enhancement around an extension to

boost::unordered_map.

To achieve more parallelism in a concurrent environment data structures can be designed to allow localized locking instead of global locks around the structure itself. If discrete areas can be identified only those specific locations need be locked allowing other threads access to the remaining structure for read and/or write. For instance, a hash table is often implemented as an array of buckets where each bucket can then store multiple items in a list. With this example each bucket can be locked without denying access to any other bucket.

It happens that is precisely how

boost::unordered_map

was implemented. A structure analogous to this:

With some clever extensions and a wrapper layer for locking, individual bucket locking was achieved and a robust concurrent data structure created. However, the newest boost libraries changed their internal implementation rendering bucket locking unsafe and now an area of instability. The new implementation is a structure analogous to this second diagram:

Notice that each tail item actually links back to what should be the head of the next bucket. The implementation is now a single linked list. This change allows the data structure to walk from one bucket to the next invalidating the wrapper’s assumed bucket locking. The traversal from Item3 to Item4 in the diagram now means one process or thread will access data which could be “locked” by a completely different process or thread. This discovery took multiple engineers writing code to reproduce the issue, using resources for reproduction, analyzing code and core files, and days of otherwise fruitful development spent. While the issue was fixed it was introduced because of an assumption about the static nature of a library’s internal implementation.

Finally...TL;DR

Using external libraries in your project is a welcome practice helping to create robust software without needing to get caught up reinventing widely known utilities. You as an engineer should remember your libraries change and evolve themselves, especially under the hood. That is to say, just as you follow the principle of encapsulation you should always be following the principle of abstraction in your designs as well as honoring it with the software you consume. You shouldn’t make assumptions about how your dependencies handle, organize, or manage their internals as it can paint yourself into a corner by locking yourself to a specific library and even specific version of that library, and worse yet, consume your evenings and weekends searching for subtle and unpredictable errors.

Before picking that low hanging fruit consider other options, is it possible to find a more appropriate library that does provide what you’re looking? For instance, does a concurrent data structure already exist, maybe this (libcds)? Can you patch an open source library specifically for your needs? And if not contributing back upstream use a patch management system to help you upgrade later?

Whatever you decide attempt other things first but then be cognizant of what risks there are in the end.

Updated Jun 06, 2023

Version 2.0

linerate