Quick Thoughts on AI and the 80-20 Rule
One of the first things I learned in AI, many decades ago, was that the 80-20 rule, also known as the Pareto Principle, generally holds. These early AI systems, called Expert Systems, are based on “rules and “facts” programmed into the knowledge base. These rules, when selected by input data, (e.g., patient symptoms) could draw inferences (e.g., medical diagnosis). A few rules work well, but many additional rules are needed to hit the edge cases. An intriguing question is whether today’s data-driven generative AI (gAI) systems follow the same rule with data. Do we get reasonable performance with small data sets, and does training the AI with more and more data have diminishing returns?
If we follow the AI hype, there is a lot of discussion about the big foundational models like ChatGPT and Bard that are trained by large tech companies on massive amounts of public data (and some not-so-public data that come with a host of legal questions). From the discourse, it seems like data network effects are in play, and more data, bigger models improve performance. However, it seems that these companies are running out of data. Open AI, for instance, had to resort to training runs on audio from YouTube videos to get data for GPT4. Some predictions indicate that quality data on the Internet will run out in a year or two. The improved capabilities of these models, like passing the medical and bar exam, seem to indicate that more data pays off in spades. But does it? The large models are difficult to handle and very expensive to train (think electricity costs), and there just may be diminishing returns on data.
So, what is the alternative? One option is to synthetically generate data. If AI generates the data it is trained on, regardless of the “smartness” of the AI, there will be a reinforcing loop that would acerbate problems in the data. Or we could get one AI to check another AI’s data – perhaps mitigating a bit of the reinforcement problem. Further, buttressing low data with human expertise or laws of physics (e.g., to better understand physical environments) can help.
Or we could modify the Pareto Principle to include tradeoffs between size and quality. High-volume data can be traded off for quality. It is true that large models hallucinate, particularly in areas where data is sparse or noisy. Not all gAIs can be experts in everything. So, smaller models with dense data bounded in a specialized area can perform very well. Moreover, these are cheaper and easier to manage. They might also be easier to trace and understand, making the AI more explainable.
So where does that leave us? We have large foundational models with their own products pushed by Big Tech. These could be accessed via API and fine-tuned to individual needs. These models and their access and fine-tuning are largely controlled by their developers. We have small models trained with local, proprietary data. These models are often open source and affordable for smaller and mid-sized companies. But here too, the more specialized the domain, the tougher it is to get adequate data for training. We could also have models trained on synthetic data. This will be particularly useful in fields where observational data is limited (astrophysics) and experimental data is expensive (material science). More recently, there has been the emergence of decentralized AI, boosted by blockchain technology, where data is secured across a network of nodes. This opens the door for a more accountable, scalable, and cooperative approach to AI applications.
In conclusion, the Pareto Principle is somewhat fuzzy in the AI context. It’s not only about the size of datasets, but the key question is what characteristics of the 20% of data will provide the 80% impact on AI outputs.
Varun Grover
George and Boyce Billingsley Endowed Chair and Distinguished Professor, Walton College of Business at University of Arkansas