“I build processes that never fail.”
As I was chatting with a peer who was pitching me on the robustness of the systems they developed, I was struck by the boldness of those words I had just heard. As we chatted about data in general and data pipelines in particular, this person claimed that they prided themselves on building processes that simply did not fail, for any reason. “Tell me more…“, said the curious technologist in me, as I wondered whether there was some elusive design magic I had been missing out on all these years.
As the conversation continued, I quickly surmised that this bold prediction was a recipe for disaster: one part wishful thinking, one part foolish overconfidence, with a side of short-sightedness. I’ve been a data professional for 17-someodd years now, and every data process I have ever seen has one thing in common: they have all failed at some point. Just like every application, every batch file, every operating system that has ever been written.
Any time I build a new data architecture, or modify an existing one, one of my principal goals is to create as robust an architecture as possible: minimize downtime, prevent errors, avoid logical flaws in the processing of data. But my experience has taught me that one should never expect that any such process will never fail. There are simply too many things that can go wrong, many of which are out of the control of the person or team building the process: internet connections go down, data types change unexpectedly, service account passwords expire, software updates break previously-working functionality. It’s going to happen at some point.
Failing gracefully
Rather than predicting a failure-proof outcome, architects and developers can build a far more resilient system by first asking, “What are the possible ways in which this could fail?” and then building contingencies to minimize the impact of a failure. With data architectures, this means anticipating delays or failure in the underlying hardware and software, coding for changes to the data structures, and identifying potential points of user error. Some such failures can be corrected as part of the data process; in other cases, there should be a soft landing to limit the damage.
Data processes, and applications in general, should be built to fail. More specifically, they should be built to be as resilient as possible, but with enough smarts to address the inevitable failure or anomaly.
[This post first appeared in the Data Geek Newsletter.]