Data Focus and Control of Digital Infrastructures

Data as an Upstream Resource

Training AI requires massive amounts of data drawn mainly from commercial data (high-quality, access-restricted content, like copyrighted books), user data (interaction and preference data from platforms), open data (freely accessible but currently insufficient in scale/quality), and synthetic data. While it can be said that data is ubiquitous and created by users themselves, the largest tech companies are uniquely positioned to centralize it, given the large number of users that they have and the large market shares that they have (e.g. Google for Web search, Meta for social media, Amazon for e-commerce). This makes it a particularly concentrated resource in the context of AI, with a high barrier to entry -- while new companies can acquire or scrape data from online data bases and repositories such as the Common Crawl, this data is mostly useful for pretraining base models and less for approaches such as reinforcement learning from human feedback (RLHF) or fine-tuning.

Note

High Priority TODO: Explain why we start with compute even though data is a similarly important upstream resource. Artificial Intelligence broadly defined as data + compute, data and centralization of the digital infrastructure that stores and transmits it was already underlying the first big tech concentration wave, compute is the new part in many ways. Additionally, compute arguments are now enabling companies to gather more data by giving them dominant positions through AI or creating excuses to gather more data. TODO: better articulate this primary/secondary resource situation.