What Open Technology Tells us about the Costs of AI

Now that we have a better sense of what to ask, we need to quantify how much of an inherent barrier to entry compute actually represents. Thankfully, while we only have limited information about the largest AI providers, there are still several actors training models from smol to frontier who disclose their training expenses. This Section extracts findings from the published information to have some number in hands for different scenarios. One important but tricky point is that open models and GPAI providers have different technical models, where the former have task adaptation as part of their technical setup but the latter.

Methodological approach for this section:

  • Define desired outcome: automation of a task that is relevant to the commercial user
  • Define two paths to outcome:
    • Use vended "general AI" API from large provider
    • Develop in-house tech stack
  • Analyze costs and benefits of each option, for providers and for adopters

We broadly identify vended solutions with compute-intensive options (technical aspect) and market concentration dynamics (market structure) because:

  • Providers share most of the market between each other: OpenAI(-Microsoft, NVIDIA, Oracle), Anthropic(-Amazon, Google, Salesforce), "local champions" like Mistral (-Microsoft, NVIDIA, Salesforce), rising stars like Cursor (Google, NVIDIA), and the models developed directly by Meta, Google, Microsoft.
  • TODO: the relationship between TESLA and xAI is more complex. Direct investment was declined by stakeholders but contracts and redirection of resources (large GPU orders) point to common direction and interests
  • Providers tend to focus on "generalist models" that need to top benchmarks, which is what is driving most of the computational intensity
  • Providers also benefit from vertical integration, capital-intensive data access, and ability to act in "legally gray" areas

We broadly identify development of AI technology by the actors who want to leverage it commercially with less compute-intensive options (technical aspects) and more active competition (market structure) because:

  • Companies with defined use cases can adapt smaller and less compute-intensive models
  • The data, model, inference, and software layers are all modular and giving rise to price and efficiency competition
  • Whether organizations act as their own providers or use intermediaries who develop models for a domain (health? media?) the risk of horizontal integration is significantly lowered

Comparing Product Structures and Performance

Everything else we need to consider when comparing self-development and deployment to vended solutions

  • commercial provider costs include other resources, including access to training data and use cases and sometimes licensing deals
  • costs include training data, harder to quantify. Open models typically trained only on publicly available data and synthetic data. We have information about the synthetic data. Commercial models have licensing agreements.
  • self-development requires employees with the right skills. Open organizations have significant resources, but still requires salaries.
  • commercial APIs are not just models, they're product. Lots of engineering work in e.g. memory management + liability - although recent terms move away from liability shielding.

Comparing Costs

In recent years, especially since the advent of "foundation models"(1) which are meant to be as general-purpose as possible, the paradigm for evaluating AI models has shifted towards equally generic benchmark evaluations. Initiatives such as BIG-Bench(2) and MMLU(3) are intentionally designed to be as broad and far-reaching as possible. When new models are trained, these kinds of benchmarks are used to evaluate them and to track progress over time. However, they are not necessarily representative of enterprise use cases of AI, which are much more context-specific and often call upon structured data formats such as tabular data. Sectors from health care to manufacturing and finance -- require models that are specifically tailored for their use cases, not only in terms of domain-specific knowledge, but also issues such as data privacy and regulatory constraints which preclude the usage of models that are, for instance, hosted in another country or legislation.

How Much Does it Cost to Train a Top Model?

Depending on whether organizations have to go from scratch or just fine-tune, compute costs go from an order of magnitude of 10M$ to 10k$, which can either be significant or negligible compared to other setup costs (e.g. finding a use case, setting up a data pipeline, building evaluation datasets). While getting exact information regarding the compute cost of training and fine-tuning models up to "frontier" capability is difficult, there is some third-party data on the subject. Notably, the average cost of training a "notable model"1 between 2022 and 2025 is around 18 million dollars(4). This number is indicative of the cost of the final model training, and does not necessarily account for the additional costs of experiments outside of the main training run, as well as steps such as data generation, which can double total compute expenses(5).

How Much Does it Cost to Run a Top Model?

For open models, compute costs are highly dependent on the deployment setup and required inference load required. Notably, compared to vended solutions, self-deployed models hosted on cloud instances offer some flexibility with respect to variable loads as machines can be turned on and off depending on the number of requests, but economies of scale for specific models do not work in favor of users in the same way.

For open-source models that are most directly comparable to commercial APIs, OpenRouter offers one point of comparison (with significant caveats) -- for instance, comparing GPT-5.2 vs DeepSeek V3.2 Speciale, GPT-5.2 is found to have an estimated cost of $14 for 1M tokens of output, whereas DeepSeek costs significantly less, at $0.41 for 1M tokens generated(6).

Open models deployed on shared infrastructure by "inference providers" are typically significantly cheaper. For example, the most expensive provider for DeepSeek 5.2 is about 10 times cheaper than ChatGPT or Claude on the basis of per-token pricing.

For self-deployment on cloud, the price of instances on commercial clouds provides an upper bound - but does not account for peak load use. One instance of deploying full-precision Kimi-K2 for example is would cost 29k/month in GCP costs for example. However, deployment typically uses lower-precision versions which can bring these down to 6.5k/month, and the model is unlikely to run 24 hours a day. Conversely, models that run on a L4 GPU, which includes most mid-sized models up to 32B in their quantized versions, have a base running cost of 500/month on GCP. This, however, does not include the overhead costs of adapting to increased requests, which would require powering up additional instances

Note

Editorial Note: Sasha: I wonder if we should talk about our compute blog post here? https://huggingface.co/blog/sasha/energy-cost-compute Yacine: Yes!

There could be a market for better optimized compute on demand for arbitrary model, especially at the cost of latency, but the market would need to exist already.

From Neuralwatt: "We're getting ready to launch a hosted inference service built on our energy-optimized AI infrastructure (including demand-response participation and full energy reporting). We'll support standard token-based pricing, but we're also considering offering an optional kWh-based billing model to better align incentives around efficiency."

For commercial models on the other hand, the main payment structure remains subscriptions, which are typically sold at about 20-30 dollars per seat per month, or 200 for "pro" users.

PLACEHOLDER: Table Comparing Different Options

We provide a large table showing the range of possible prices for different approaches to AI. The table includes recent models that are top of their class, and the last column explains what that means.

The table provides a Language Model focus because this is where we see the worst excesses of scaling, but companies also work on other modalities - see Pinterest paper on PinCLIP

Note

Editorial Note: Yacine: my vision here is to use the open-weight models as a bridge between self-developed smaller models and proprietary model APIs to show that we gain an order of magnitude of cost at each stage from own model to using open-weight to using proprietary models

  • Purpose-specific models remain competitive and used
    • While the attention to general-purpose model tends to eclipse other approaches, "general intelligence" is not a commercial concept and large companies are training more useful models for the tasks they need
      • Tiny Recursion Models - 7M parameters - extreme example to show the limitation of "general" benchmarks - performance higher than frontier of the time
        • Task: "general intelligence" benchmarks
        • Training costs: negligible
        • Inference costs: negligible
        • Performance: shows the limitations of metrics
      • PinCLIP - a Pinterest-trained replacement for OpenAI CLIP
        • Task: image-text alignment
        • training cost: low-ish, trained on 1B image-pairs
        • inference cost: low - TODO
        • Performance: better for the company's particular domain than generic alternatives
      • Tencent HY-MT1.5-7B - 7B dense (1.5B option) - continued from v1 trained from scratch
        • Task: Machine Translation
        • Training tokens - 1.2T tokens for pre-training - <1M$ (TODO: better back-of-the-envelope, probably <100K$)
        • Inference cloud full precision - 0.8$/hour on L4
        • Performance - 7B version on par with to better than Gemini 3 Pro for translation tasks
  • Tailored, and fine-tuned models at more reasonable sizes
    • We provide examples of models that have been tailored to or trained for a specific purpose to see that, given an application (or a business case), smaller models can usually do the job
    • We need to stress the different model of sharing resources on pre-training, with 2-3 actors or more regularly involved along the development chain
      • Qwen-2.5-Math and Qwen-2.5-Coder - 1.5B dense and 7B dense - important examples of training to domains under a budget with selected data - mid point to support varied domain applications - we have information about training cost to go from Qwen-2.5 to Qwen-2.5 Coder - and can infer Qwen-2.5 from other similar models (can take OLMO 3.1 as upper bound)
      • Qwen3 4B Instruct and Qwen3 8B - TODO - smallest general-purpose out-of-the-box, backbone for other applications
      • SmolLM3 - TODO - performance approaches similar-size Qwen3 with full disclosure of training costs, and SmolLM2 has been purpose-finetuned
      • Weibo VibeThinker - 1.5B dense - Qwen2.5Math base
        • Inference cloud full precision - 0.5$/hour on T4 (cheapest available option)
        • Performance - Beats Gemini 2.5 Flash and Claude Opus 4 on selected math and coding
      • OlympicCoder - 7B dense (1.5B option) - Qwen2.5-Coder-7B-Instruct base
        • Training costs - we have data size and I think GPU hours in blog posts
        • Performance - 7B version on matched best Claude at time of release on Math Olympiads
      • MOLMo2 - 8B dense - Qwen3 8B base - how to add a new modality to your model - close to Gemini, beats 4.5 Sonnet vision capabilities
  • Open-weight models from medium to large:
    • Generic open-weight models have reached benchmark performance that reach or approaches either the current "frontier" on benchmarks, especially for software engineering. These make them a priori viable options especially considering other benefits
    • Open weight models give us significantly more information about the computational requirements of hosting a model. We provide two sources of information here. First, we provide the cost of self-hosting full-precision and quantized version based on market price of virtual instances on cloud providers. Second, as many of these are hosted by commercial inference providers, we also provide the rates to have a direct comparison to proprietary models
    • Several of the selected models also disclose sufficient information about their training process to additionally assess the cost of developing them from scratch
    • Model training goes from under one million to up to 12 million dollars for the largest open options
      • AI2 OLMo 3.1 - dense 32B - include because we have the most extensive information on training costs (tokens, size, flops AND GPU hours), it's a US model, and performance approaches equivalent-sized latest Qwen dense on Benchmarks
      • NVIDIA Nemotron Nano - MoE 3/30B - TODO include because it's a US model by a commercial company that discloses training information and approaches Qwen MoE on several benchmarks - one of the companies we're looking at
      • Qwen3 Next - MoE 3B/80B - standard multi-purpose (see AirBnB comment), needs some work to trace training information
        • Inference Cloud full precision - 8.3$/hour
        • Inference Cloud quantized Q8 - 3.8$/hour; Q6K - 2.5$/hour
        • Purchase cost for Q6 inference server: 15-20k$
        • API cost (Vertex) - 0.15/1.2
        • Performance: beat Gemini 2.5-Flash on baselines, most liked Qwen on HF, advertised by AirBnB
      • K-EXAONE-236B-A23B - MoE 23B/236B - Include because we have training information, inference is much cheaper given total number of parameters but performance matches DeepSeek on some common benchmarks. Also important because it was trained with public funding as part of South Korea's sovereign technology strategy2
      • DeepSeek v3.2 - MoE 37B/685B parameters
        • Training tokens - 16T tokens adding up v3 to v3.2
        • Training GPU - 3M H800 GPU-hours, 12M dollars at market rate
        • Inference Cloud full precision - 8xB200 - 74$/hour
        • Inference Cloud quantized Q8 - 8xH100 - 36$/hour; Q6K - 8xA100 - 20$/hour
        • Purchase cost for Q6 inference server: 150-200k$
        • API cost Vertex - 0.56/1.68
        • Performance: Similar to GPT5-Chat, Claude4.5-Sonnet, slightly below Gemini-3.0 pro on four "reasoning" and four "agentic" benchmarks
      • Kimi-K2 - MoE 32B/1T parameters
        • Training tokens - 16T tokens adding up v3 to v3.2
        • Inference Cloud quantized Q8 40$/hour;
        • API cost Vertex - 0.6/2.5
        • Performance: first open-weight to surpass frontier at the time of release
  • Grounding in API prices for commercial models:
    • For commercial APIs, we use their token API cost as an indication of their cost. Since LM usage has a marginal cost, this is likely more representative of equilibrium prices than subscriptions. API prices are provided as dollars per million tokens for input and output separately
    • For all models under consideration, the cheapest option matches the most expensive hosted solution for open-weight models
    • Training costs are not known so we don't speculate in the table but we provide context about estimation and what we know about overall research spending in the text
      • Anthropic: Haiku 1/5, Sonnet 3/5, Opus 5/25
      • Google: Flash 0.5/3, Pro 2/12
      • OpenAI: Mini 0.25/2, Chat 1.75/14, Pro 21/168
Note

Editorial Note: Brief description of categories of artificial intelligence for discussion of costs:

  • domain | traditional/personal
  • smol | easy to adapt, traditional NLP task, "pro" sub of traditional software, hobbyist ish
  • medium | good at a bunch of things with work and some things eg code, "industrial" infra and cost mostly
  • large | super industrial, brute force most things
1.
et al. RB. On the Opportunities and Risks of Foundation Models [Internet]. 2022. Available from: https://arxiv.org/abs/2108.07258
2.
et al. AS. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models [Internet]. 2023. Available from: https://arxiv.org/abs/2206.04615
3.
Wang Y, Ma X, Zhang G, Ni Y, Chandra A, Guo S, et al. MMLU-pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems. 2024;37:95266–90.
4.
Epoch AI. Data on AI Models [Internet]. 2025. Available from: https://epoch.ai/data/ai-models
5.
You J. Most of OpenAI’s 2024 compute went to experiments [Internet]. 2025. Available from: https://epoch.ai/data-insights/openai-compute-spend
6.
OpenRouter AI. GPT-5.2 vs DeepSeek V3.2 Speciale [Internet]. 2025. Available from: https://openrouter.ai/compare/openai/gpt-5.2/deepseek/deepseek-v3.2-speciale

Footnotes

  1. Epoch defines "notable models" as those that reach state-of-the-art improvement on a recognized benchmark, are highly cited (over 1000 citations), are of historical relevance, and have shown significant use.

  2. https://www.koreaherald.com/article/10546363