By Siddharth Pai,
In 2016, I wrote in another publication where I also have a regular space that the internet was facing “data inundation”. Firms, too, are awash with data on their own operations and plow through it incessantly to find information that will allow them to ‘transform’ their operations — in a never-ending attempt to get more by spending less. This relentless pursuit, akin to dogs chasing their own tails, is not only entertaining but also a stark reminder of the inefficiency of our current data management approach.
We see the same cycle today with generative artificial intelligence (GenAI). The amount of data available on the internet has grown exponentially, and most of it is rubbish — especially for a corporation trying to focus on a specific problem that is germane only to its own company. No amount of generative AI-produced images, prose, or poetry is specific to their situation, and they are therefore stuck without the ability to build a business case on how to use GenAI within their firms.
I am not gainsaying the need for big data analytics or focused AI solutions to deal with specific corporate issues. However, the explosion of digital data in the wake of the Internet boom has caused a justifiable fear of “data inundation”. And a series of Noahs have sprung up to build arks to keep companies afloat during the data deluge. They peddle ever-larger database storage software and machines to handle this data flood. We now even have the concept of “data lakes” instead of databases, referring to “cold” data which is not required for immediate data modelling purposes but can presumably be called upon someday to increase the efficiency of AI learning algorithms.
Speaking of Noah’s arks, the serendipitously named investment advisory outfit Ark Invest cites that in 2006, the internet produced about 0.16 zettabytes of data while the available storage capacity was only 0.09 zettabytes — already indicating a capacity deficit of 1.7. This data then increased at a compound annual growth rate (CAGR) of 25% for the next decade, causing a shortage of 500% in storage space. Fast forward to now when a current prediction from Aston University says, “The next three years will be crucial. The global datasphere is predicted to increase to 175 zettabytes, with one zettabyte being approximately equal to one billion terabytes.” A terabyte is 1,000 gigabytes (GB). Figure the math.
Companies then add an “analytics” team to try to make sense of all this data. These teams usually have one or two expert statisticians or “data scientists” and a bunch of young, eager offshore number crunchers who are willing to pore through reams of data like gaunt prospectors panning for gold. This data surplus has spawned an entire genre of AI analytics firms. The problem they blithely ignore is that they are often sifting through a load of junk in the first place, and even more junk falls upon this load every day at the Internet’s warp speed.
Purveyors of AI are quick to point out that they cleanse the data before working on manipulating it so that they can make sense of it. But as Ark Invest says, there’s just too much of this data floating around and what’s more, some of it has been kept alive on company servers for many years without ever being looked at. The problem isn’t that the data is dirty — it’s old. And data ages alarmingly quickly.
Addressing data storage shortage by building new mega data centres worldwide is not the solution, and it’s not a sustainable approach. Apart from the high costs, their massive energy consumption has been well documented, as well as statements from generative AI players like Google and Microsoft that they are nowhere near reaching their net zero targets for carbon any time in the next several years, after having first loudly predicted that they would be able to! It’s a hefty price to pay.
My late mother, a successful doctor, managed the house with the same precision with which she handled her surgeries and obstetric procedures. When she got down to spring cleaning the house, she threw out everything that hadn’t been used for a year. It didn’t matter if it was still unused and in its original packaging; if it hadn’t been used for a year out it went, despite howls of protest from the rest of us.
Well, it seems as if someone was listening. Around the time I wrote the said article in 2016, MySpace, the first large-scale social network, deleted every photo, video, and audio file uploaded to it before 2016, seemingly inadvertently. Entire tranches of Usenet newsgroups, home to some of the internet’s earliest conversations, have gone offline forever and vanished from history. The problem for many users (or hoarders, depending on your perspective) is that data they consider invaluable may also have been purged.
The real need of the hour for many firms is a data purge, not more data science or GenAI. What I am suggesting does need corporate courage: instead of spending huge sums on buying or renting more space to store data and then more money on generative AI to find even more ways to analyse fast-decaying data, some of the money may be better spent on having a smart set of young people offshore trained exclusively to look for anything that is too old to be used or was “dead on arrival”, and having that data purged. This might limit the growth of useless computing and storage capability to some extent, leading instead to the use of more relevant and real-time data from which the conclusions reached by data analytics can be readily acted upon.
The author is technology consultant and venture capitalist.
Disclaimer: Views expressed are personal and do not reflect the official position or policy of Financial Express Online. Reproducing this content without permission is prohibited.