The Internet is a giant edifice teetering on a foundation of dirty data
Your houses are built upon the sand
Perfecting Equilibrium Volume Two, Issue 48
But Daniel the prophet, a man of God
He saw the writing on the wall in blood
Belshazzar asked him what it said
And Daniel turned to the wall and read
"My friend you're weighed in the balance and found wanting
Your kingdom is divided, it can't stand
You're weighed in the balance and found wanting
Your houses are built upon the sand"
The Sunday Reader, September 17, 2023
It sounded impressive: Our Sales AI will generate more and better leads for your company!
It would have been more impressive if it wasn’t addressed to a company that was shut down and delisted a half-decade ago.
No, this isn’t another column dunking on the shortcomings of Large Language Models – so called “AIs.”
AIs are not the problem; they are the proverbial canary in the coal mine telling everyone the dirty secret tech pros have always known:
All our shiny, glittering tech is built on a foundation of data, and that foundation is a mountain of garbage.
There were concerns in the early days of the World Wide Web that, famously, On the Internet nobody knows you are a dog.
Google largely solved this problem with its system of backlink analysis: if Stephen Hawking’s physics page is linked by CERN and NASA, and Fido’s physics page is linked by Fido and Spot, chances are Hawking’s page is the authoritative one. (We’ll leave for another day discussing appeals to authority, “the science is settled” as an oxymoron, and that “scientific consensus” includes stuff like the decades-long consensus that Gregor Mendal’s work was the boring/squared merger of tedious pea plants and even more tedious statistics.)
But once you’re past that, and you’re pulling data from carefully maintained IT systems such as government databases, it’s all good, yes?
No.
Computer-Assisted Reporting and Research largely focuses on analysis of government data. And the first thing that’s taught is the Iron Law of Data: All data is dirty.
Here’s a story that has been told repeatedly at CARR conferences: A reporter had pulled down the entire line-item state budget from the government mainframe computer. He soon thought he had an amazing story - there was a field labeled “Employee Raises,” and it added up to an egregious amount of money.
Too egregious, in fact. Fortunately he was able to check it with his Mark1 Eyeball; the total raises were more than the entire state budget. That...is not possible. So he called around the government and tracked down the tech who maintained the database.
“Oh yeah,” the tech said. “We didn’t have the budget to redo the table structure, so we repurposed the “Employee Raises” field to “Employee ID Number.”
Is this an egregious example? Yes. But here is the reality: the second any database is up and running, techs start maintaining and updating it. This work is largely undocumented and often driven by conflicting needs from different users and departments.
So databases are a mess. And then the real trouble starts.
Even something that seems simple, such as reordering batteries from Amazon, runs across multiple systems. Consider: what’s the address?
There needs to be a shipping address to deliver the batteries; that’s in the fulfillment system. There needs to be a billing address; that’s in the billing system. Then the Customer Relationship Management system needs to be updated with all this data from this new order.
Data needs context to have value. Think of it this way: Is $6 million good or bad?
Well, it’s good if it is the balance in your checking account. It’s bad if it’s an unexpected charge on your credit card. And it’s a disaster if it is Microsoft’s market cap.
Corporations and governments try to maintain context across databases like this by merging all the data into one big database, called variously a data warehouse, a data mart, and lately, a data lake. This is done by Extracting, Transforming and Loading the data into one big new database.
This...is problematic. How do you maintain proper context while transforming the data? Even better, all those guys updating and maintaining the primary databases are still updating and maintaining them. So even if the transforms are set up correctly they will soon be incorrect.
Even worse, these new databases have to send data back to the old primary ones. Let’s say you decide to update your address on that battery order. Is that your home address? Shipping address? Billing address? Billing address for your personal card? Your business card?
How big is this problem? Corporations spent $16.7 billion last year on Master Data Management, which is a set of technologies that does nothing but attempt to preserve context across databases.
How is this possible? Why haven’t all these smart technologists just fixed this?
Because the fundamental architecture that underlies all of our existing system is fatally flawed.
Programmers are, unsurprisingly, focused on the programming. Programmers think of data the way butchers think of a side of beef: something to be hacked into shape and then cooked. Consequently, they put all the logic and intelligence into the code.
You can see this easily with any spreadsheet. Take a spreadsheet filled with formulas, and save it as a Comma Separated Value file. When you reopen that file it’s a mess. Sure, all the data is there, but all the formulas are gone and the references are broken. The logic is all in the software, not in the data.
As with so many current computer problems, this is another vestige of the Age of Mainframes. And it wasn’t a bad architecture for a mainframe, where the software and data lived together on a single computer, often in the equivalent of one big data table.
It was less good as an architecture for large database applications. Once you start ETLing that data into data lakes, it simply doesn’t work.
And now that the Internet has inverted and there are more edge devices with more processing power and storage than all the central data centers combined, it’s a disaster.
One reason is that the Mark-1 eyeball cannot be used. That CARR reporter could compare the total of the “Raises” column with the total state budget. When data has been ETL’d across multiple databases into something new, there’s no way to compare that result with the original component data.
While IT pros have known this all along, most of the public didn’t because, let’s face it, who looks that deeply into the data? When was the last time you went to, say, the second page of search results?
But now these AI packages are scooping up millions of documents during their training process, all that dirty data is getting exposed. And it’s difficult to unsee the mountains of dirty data once they’ve been exposed.
But surely these are problems with commercial systems. Surely the cutting-edge top-secret intelligence systems run by government agencies aren’t built on mountains of garbage data. It’s not like the Pentagon announced that a Predator drone had killed a “senior Al Queda leader” in Syria, then days later admit they had blown up a sheep herder.
Actually, that’s exactly what happened. Repeatedly.
Here’s how The New York Times described it: The promise was a war waged by all-seeing drones and precision bombs. The documents show flawed intelligence, faulty targeting, years of civilian deaths — and scant accountability.
(Perfecting Equilibrium covers Web3 technology and the creator economy, so this article is focused on the data and databases. The legality and morality of drone strikes are outside our purview.)
The solution, of course, is to finally give up on the mainframe data architecture and centralized systems, and move to something that fits our edge-processing world.
It’s time for Distributed Data Management Systems. That’s why we invented Privacy Chain.