Compute Focus: Resources and Diversity of Approaches
"Artificial Intelligence" has become a basic commodity for most companies. We need to cover background on:
- The role of scale and pre-training in the generative AI "revolution"
- Externalities - who's paying for energy, data, impact on the digital and natural environment
- Give some background on (the crisis of) benchmarks and performance as it applies to general-purpose vs customized models
- crisis of evaluation
- tiny recursive model beats frontier on ARC...
- "general-purpose" models vs specific enterprise use cases
- What it means to customize a model for a domain (Qwen-2.5-math, biomedical models, etc.) or task, including growing commercial offering for fine-tune-as-a-service
- Examples of power at play in AI - bargaining in WGA and SAG strikes - licensing deals with media
High Priority TODO: Explain why we start with compute even though data is a similarly important upstream resource. Artificial Intelligence broadly defined as data + compute, data and centralization of the digital infrastructure that stores and transmits it was already underlying the first big tech concentration wave, compute is the new part in many ways. Additionally, compute arguments are now enabling companies to gather more data by giving them dominant positions through AI or creating excuses to gather more data. TODO: better articulate this primary/secondary resource situation.
The Commercial Value of Artificial Intelligence
While the term "Artificial Intelligence" has become commonplace, its definition is diffuse and used to refer to many algorithmic approaches, ranging from simple decision trees to trillion-parameter language models. In recent years, it has predominantly been used to refer to the so-called "frontier models" or "large language models" (LLMs), increasingly deployed in user-facing contexts such as chatbots and Web search(1). While these models are built upon the Transformers architecture(2), increased size and scale have become core contributors to their success. This is operationalized via extensive pre-training phases, often requiring millions of hours of GPU compute and billions of tokens of training data(3). This comes with high costs both in terms of the environmental impacts of this training(4) (5), but also has consequences in terms of industry influence on the field of AI as a whole, given the rising dependence of researchers on the compute and funding provided by for-profit corporations, which allows them to afford model development(6) (7).
Editorial Note: Q from Sasha: should we be clearer about externalities here?
A core characteristic of machine learning-based AI models is their inherent stochasticity, given their reliance on factors such as random seeds and stochastic gradient descent to find the optimal values for model weights. This has resulted in what is referred to as a reproducibility crisis in the field, given that neither model training nor deployment can be reliably replicated by peers or even model developers themselves(8). This translates into a crisis of model benchmarking and performance evaluation writ large, since it is increasingly difficult to compare models. This is particularly the case for LLM-based systems, since their generations can vary widely based on factors such as temperature and can include any token from the model vocabulary(9). Furthermore, the evaluation benchmarks that have been created to evaluate LLM performance are often mismatched with regards to real-life applications of models. Most benchmarks such as Big Bench(10) and MMLU(11) and initiatives such as the Open LLM Leaderboard(12), are explicitly meant to be extensive, covering a variety of tasks and topics - however, given the very task-specific application
Editorial Note: Yacine: all relevant, I was hoping to say more as well about the mismatch between benchmark and utility, and especially given the effort large companies put in fitting benchmarks, their training on user inputs that reflect the benchmarks, and investment in data annotation at huge scales
Editorial Note: Focus les discussions de eval questions sur l'applicabilite a des usages commerciaux, le hacking through billions in data annotations, et la relation entre general benchmarks, companies that have their own evals, and fine-tuning
High Priority TODO: Focus on the narrative act of overstating the role of computation, which can skew competition by abusing position of authority to discourage potential customers from investing in their own tech stack, as an anti-competitive act. Outline which actors are culpable (large tech companies with access to centralized computation) and knowingly or unwittingly complicit (journalists acting as advertisers for those systems, researchers who buy into "bigger is better" unquestioningly).
Market Structure of AI
The AI value chain is long and complex and spans both hardware and software. Each layer of the value chain constitutes a distinct market, yet it is closely connected to the others. There are three main layers of the AI value chain, each comprised of certain components: (i) infrastructure needed for AI, comprising chips, data centers, and networks, both on-premise and cloud-based; (ii) data; and (iii) products. Products can be further segmented into foundation models, and applications, where applications consist of foundation models that are used or fine-tuned and deployed through a user interface for a defined user base.
There are different components that together constitute the economic environment of AI. These components map onto the different layers of the AI value chain described above. The main components are: data (for training, evaluation and customization), compute (for model development, training and deployment) and products (i.e. the mechanisms by which AI models are provided to end users). Each portion of this environment can be described in terms of the number of actors involved, the barriers to entry, and the pricing power that it contains.
Editorial Note: Yacine: the data/compute/access to products-customers upstream resources could make for a good flow graph if we can find a good way to show concentration dynamics.
Data
Training AI requires massive amounts of data drawn mainly from commercial data (high-quality, access-restricted content, like copyrighted books), user data (interaction and preference data from platforms), open data (freely accessible but currently insufficient in scale/quality), and synthetic data. While it can be said that data is ubiquitous and created by users themselves, the largest tech companies are uniquely positioned to centralize it, given the large number of users that they have and the large market shares that they have (e.g. Google for Web search, Meta for social media, Amazon for e-commerce). This makes it a particularly concentrated resource in the context of AI, with a high barrier to entry -- while new companies can acquire or scrape data from online data bases and repositories such as the Common Crawl, this data is mostly useful for pretraining base models and less for approaches such as reinforcement learning from human feedback (RLHF) or fine-tuning.
Compute
As increasing quantities of specialized compute are necessary to train and deploy AI systems, the more concentrated access to this compute becomes. Compute constitutes a core part of the AI infrastructure layer of the value chain, with physical chips, data centers (firms use cloud, on premise and hybrid infrastructure), and networking capabilities. When it was conceivable to train state-of-the-art AI models on local, general-purpose compute (i.e. CPUs), this made the field at large more accessible to individuals and smaller organizations with access to laptops and personal computers. However, as access to large quantities of distributed, specialized compute (i.e. GPUs) is needed, this costs more money and raises the barrier to entry.
Products (customer base for)
This product layer includes both foundation models themselves and downstream applications that integrate or fine-tune these models and make them accessible to users through interfaces and workflows. Given the resources needed to train foundation models, there is a much higher number of providers for applications than for foundation models. Given the saturation of the current AI market and the different services that continue to be created, access to a user base is a crucial part of remaining competitive. Firms with existing user bases that can integrate (generative) AI into their products therefore have a competitive advantage compared to those that are starting from scratch. This includes both consumer-focused companies such as Alphabet and Apple, as well as business-to-business companies such as Salesforce and Service Now, which are increasingly integrating AI capabilities in their products and services.
Above and beyond AI products that can be integrated in a variety of domains and contexts, there are also a handful of domains that have had very either very early or visible integrations with AI. For instance, healthcare has seen extensive adoption of AI for analyzing electronic health records, as well as integration into insurance billing and appointment transcription. OpenAI's recent announcement of a health-oriented product illustrates the continued expansion of generative AI offerings into the health space(13). Other fields like journalism have been at the heart of questions of intellectual property and data, as well as the role of AI in the media at large(14). AI has been a key issue in the acting union strikes that took place in 2023 and 2024(15) as well as at the center of extensive legal battles around copyright(16) and data sharing deals between major AI companies and media providers such as Disney(17). Other domains such as finance were early adopters of AI for market prediction and analysis applications, but given the highly-regulated nature of the field, have been somewhat slower to adopt newer generative tools(18).
Editorial Note: TODO cover:
- The main components: data for training and evaluation, types of compute including private data centers, cloud, and personal, data pipelines, and product integration
- Ownership and production structures for all of the above
- Focus on dynamics for specific domains that have had very visible or early integrations with AI:
- Healthcare - adoption of AI for EHR, starting with prediction, billing, transcription, etc. - nurses organizing and communicating on I wanAI
- Journalism and media - at the heart of questions of intellectual property and data - particular dynamics - bargaining in WGA and SAG strikes - expensive copyright fines for Anthropic - IP deals from OpenAI with newspapers + Disney etc.
- Financial services - early adopters of AI technology in general - more regulated setup
- (Optional) Education - also large deals and horizontal and vertical integration - data about education and students should be particularly sensitive