A Data Blunder
This is the first in a series of updates exploring data and how we can better use it to support our digital transformation.
Little did the technology community know what it was letting itself in for in the 1970s when someone thought batch microprocessor computing was a good idea. Batch computing must have racked up the biggest technical debt industry has ever seen, billions!
Before we dive in, let’s take a step back in time. You probably haven’t heard of Herman Hollerith, but in 1890, Herman invented the punch card, a method to process data from the U.S. Census. The cards were manually input into the “Electronic Tabulating Machine”. This machine formed the basis of the ‘Computer Tabulating Company’, now known as IBM.
At a fundamental level, data processing systems gather information, structure the information into data, then use the data to create information, mostly presented in a different context than it was received. The information flow is denoted below:
Information sources are gathered and stored as data. That data can then be used to create new, derived information. The information at both ends of this chain means something different depending on who or what is using it.
For example, your name is important to you, it is your identifier, and it is valuable to you. The meaning of a collection of many names in a list with a total at the bottom has a very different meaning such as how many customers they have, which is valuable to them. It is the value on the right-hand side that batch computing exploits.
Information can be turned into data, which can be turned into information with more value. However, there is a big problem. Entropy.
In the strictest sense entropy means everything changes and it isn’t reversible. Take a coffee with a layer of cream on top, when you stir it, the coffee and cream mix and can never go back to how it was before. Entropy creates direction and increases disorder.
This means as soon as we capture some information, and turn it into data, it is potentially out of date, possibly immediately. The source could have changed and may not return to the state you have recorded it as data. The data is a snapshot, a point in time, not a reflection of the true position of the source.
Why does this matter?
Well in some cases, it doesn’t matter, materially speaking, less than perfect data is used to represent a broader trend or set of numbers where the margins of error can go unnoticed with no effect. But, in many cases, this misrepresentation of data can, ‘go noticed’, and often this is recognised only by the source (left-hand side) of the information flow. In the real world, a customer with a different address to the one the organisation has on file can be extremely frustrating when their mail goes missing.
So, let’s get back to batch, batch was a system that focuses on the value to the consumer of the processed or derived data, and not the source. Access to our information has improved significantly over the past 20 years to the point where people carry devices that provides them access their data immediately, anytime, anywhere. People rely on that information to be correct and up to date. It is now clear that the source of the information is just a valuable as the consumer of the derived data. Regulation in data has also increased the demand for source data to be correct, up to date and secure.
Companies are now actively looking at ways to address this imbalance, they have lived with batch-based systems for so long, it is expensive, complex and difficult to transition to near real-time data. There are patterns, systems and technologies now available to make this a reality, converting to these approaches is the price we are now paying for exploiting source data whilst failing to consider the future needs of the creation of this source data.
We cannot go on transforming, deriving and changing this source data without considering the downstream impacts. Potentially as further channels open up through distributed computing, there will be more options to access this information through the lifecycle, greater transparency and fewer tolerances for this data to be incorrect.
The batch computing data blunder is a mistake we are only now really learning from.
Next time we will be exploring ways to use data to support digital transformation whilst learning from the mistakes of the past.