Updating Compute Thresholds to Include Inference-Time Compute

Austin R. Ellis-Mohr

January 29, 2025

Update: I have since developed a research paper expanding on the preliminary ideas discussed here.

See: “A Theory of Inference Compute Scaling: Reasoning through Directed Stochastic Skill Search”

This work formalizes a theoretical framework for inference compute scaling, which provides technical and policy guidance on system design/regulation by jointly considering LLM training and reasoning..

Updating Compute Thresholds to Include Inference-Time Compute

Introduction

1. Training Compute vs. Inference Compute

2. Why Inference Compute Matters

Scaling Performance Through Inference

3. Policy Implications

Gap in Current Frameworks

I. Compute, What Is It Good For?

II. Inference Compute Scaling

III. Regulatory Considerations

IV. Some Light Technical Details

V. Some Model Numbers

References

Introduction

AI governance discussions often focus on the computational power used to train large models. While the use of compute as a metric for governance is debated—due to questions of validity and the challenges of implementation—policymakers and regulators worldwide have set thresholds that categorize models as “frontier” or “high risk” based on how many computational operations (i.e., integer or floating-point operations (I/FLOPs)) they require during training. However, this approach overlooks a critical factor in modern AI: inference compute (IC)—the computational work done after a model is trained.

This post explores why inference compute deserves attention, how scaling it can boost AI performance, and why considering both training and inference compute could improve AI policy.

1. Training Compute vs. Inference Compute

Training Compute (TC): The learning phase where a model processes large volumes of data to optimize performance by altering its parameters. For frontier models, this is computationally intensive and should be thought of as the capital expenditure.
Inference Compute (IC): The thinking phase where a trained model is put to application, generating responses from inputs. This is the operational expenditure amassed across the lifetime of deployment and total utilization of the model. Also referred to as search or test-time compute.

For many years, training dominated discussions about AI capabilities and thus policy. But advances in models like OpenAI’s o-series and DeepSeek’s R1 highlight that spending more compute on inference (e.g., by generating longer “chains of thought” or taking the consensus among multiple queries) can significantly enhance a model’s performance with limited or even no additional training.

2. Why Inference Compute Matters

The phenomenon of spending more compute—and therefore, time, energy, and money—on thinking is far from a new idea in AI and numerous technical approaches exist for both games and now large-language models.

Scaling Performance Through Inference

Traditionally, if you wanted higher model accuracy, you paid for more training. Now, industry is showing a willingness to devote significant additional compute to inference—allowing the model to think longer—to boost accuracy and complexity of reasoning. OpenAI's technical blog post, Learning to reason with LLMs, illustrates how performance scales with both training and inference compute, stating, "o1 performance smoothly improves with both train-time and test-time compute."

o1 performance smoothly improves with both train-time and test-time compute

OpenAI Technical Blog Post, Learning to reason with LLMs

For instance, DeepSeek R1 uses a previous model, V3, as the base generator and then trains it further to use chain-of-thought reasoning which can balloon to many times the length of its final output, significantly increasing performance. Also, one can generate many responses and use a grading criteria to choose a result or simply ask a model to iterate on its previous response. DeepSeek shows that generating multiple answers and choosing a consensus can still yield improvements even on their chain-of-thought model.

There are a few likely reasons why this scaling did not occur sooner and why it is now likely to continue at a dramatic pace going forward. To understand this, it helps to consider the incentive structure surrounding the frontier models.

Training compute—the capital expenditure—is a one-time, upfront investment. Inference compute—the operational expenditure—is an ongoing cost. For widely used applications (e.g., chatbots), it’s often more cost-effective to invest more heavily in training so each inference query is cheaper and faster than it would otherwise need to be to achieve the same performance. But for specialized or low-volume tasks—consider complex scientific simulations or planning—scaling up inference might make more sense economically and technically. A model with moderate training might still achieve high performance by reasoning longer.

Increasingly, major industry players are scaling inference rather than focusing solely on training, signaling a shift in priorities. This is certainly in part due to the need for some baseline skills needed before it makes sense to scale inference. That is, if a model nearly never outputs anything of value, it is unlikely to find value with further outputs. There were certainly technical difficulties as DeepSeek writes in their paper that they "encountered failures and setbacks along the way," describing some challenges with existing approaches (i.e., PRM and MCTS). Furthermore, although inference had already been the major cost, perhaps curating high quality base training sets at the scale necessary is increasing in comparative cost.

With the new and powerful capabilities in science, coding, and planning, there will be a market for extended reasoning. DeepSeek's paper showed that with their training technique, the model continues to increase its reasoning time, "DeepSeek-R1-Zero naturally learns to solve reasoning tasks with more thinking time."

3. Policy Implications

Previous research by EpochAI and others have documented direct trade-offs between training and inference. Inference-scaling can be done at the edge and alters previous centralized energy considerations.

Gap in Current Frameworks

Existing regulations, such as certain U.S. outbound investment rules and the EU AI Act, explicitly use training compute thresholds (e.g., 10^25 FLOPs). The assumption is that high I/FLOPs for training equate to high capabilities and risks. However, by not explicitly including inference compute, policymakers risk underestimating a model’s real-world impact. For example, a model like DeepSeek R1 (estimated to use less than 10^25 TC) would neither be prohibited by outbound investment rules or be classified to "have high impact capabilities". Models trained below current thresholds could still exceed capability cut-offs through heavy inference usage. This need not be malicious—some applications naturally require more inference compute or may be more tractable for developers with less total compute capacity.

4. A Unified Metric

A straightforward (albeit nuanced) way to factor inference compute into policy is to combine the logs of both training and inference compute:

COOM = log(Training Compute) + log(Inference Compute)

This formula builds on 2023 research by EpochAI, which found that the trade-off between inference and training compute operates at the level of orders of magnitude. That is, to achieve the same performance, it costs 1-2 orders of magnitude of either to save on the expense of 1 order of magnitude on the other. Assuming a one-for-one trade-off uses the 'safer' bound.

This treats training compute and inference compute as equally important on a logarithmic scale, reflecting how both contribute to a model’s overall capacity to perform complex tasks. By summing their orders of magnitude, we capture a measure of the total computational ability from both creating and operating the model. A multiplicative scaling factor can be included if new research suggests a different, but still logarithmic, trade-off for frontier models. If a model is trained at 10^24 I/FLOPs but requires 10^15 I/FLOPs per query, the combined value of their logarithmic sum would act as an effective measure of the total compute order of magnitude—39 in this case.

Key Challenges

Defining IC: Determining inference I/FLOPs is not straightforward. While models directly fine-tuned to chain-of-thought reasoning or using in-house reward models to select from multiple outputs are relatively straightforward to define, one can still query multiple times or ask the model to iterate. Although more research should be done to understand the marginal risk of external inference scaling, the models may already be well optimized across most tasks of interest without requiring significant further external resources. IC can be defined by the I/FLOPs used to produce a single token inference accumulated over the number of tokens used. If a chain-of-thought model used 10^4 tokens and each token uses 10^11 I/FLOPs, the IC would be 10^15 I/FLOPs.
Trade-Off Nuances: In some contexts, more training can offset less inference, and vice versa. Future research should explore how trade-offs between training and inference compute influence model performance, especially in frontier-level models using chain-of-thought reasoning. This understanding can help regulators set thresholds that reflect real-world capabilities.

Regardless, adopting a metric that includes inference compute provides a fuller view of a model’s potential. It can help policymakers avoid underestimating capabilities and ensure regulations remain effective as the technology evolves. Therefore, we should attempt to adopt a rigorous definition for IC.

Conclusion

Focusing solely on training compute made sense when most AI systems had limited, straightforward inference phases. However, new developments in chain-of-thought reasoning and other advanced techniques show that how much a model thinks at inference time can significantly change its abilities and potential risks.

To maintain robust oversight, policymakers should consider:

Incorporating Inference Compute into regulatory thresholds.
Refining Reporting Requirements so developers disclose both training and high-volume inference scenarios.
Supporting Further Research into the trade-offs between training and inference as well as studying marginal risks, guiding evidence-based regulation.

Inference compute is already transforming how today’s AI systems operate, enabling advances in reasoning, problem-solving, and planning. By updating regulations to account for both training and inference compute, policymakers can create frameworks that reflect the realities of modern AI—ensuring governance keeps pace with innovation while safeguarding against potential risks.

Thank you for reading! Feel free to reach out with any thoughts, critiques, or additional data points on how inference compute is shaping the future of AI. Additional details and references are below!

Additional Details

I. Compute, What Is It Good For?

Ideally, regulatory frameworks would directly assess AI capabilities, but capabilities are inherently defined by an AI model’s ability to perform tasks within a given application space. For instance, one might seek to evaluate frontier models across multiple potential threat vectors—such as chemical, biological, radiological, nuclear, and cyber risks. However, such testing is highly resource-intensive and typically conducted only after a model has been classified as "frontier." As defined in the U.S. National Security Memorandum on AI, "the term "frontier model" means a general-purpose AI system near the cutting-edge of performance, as measured by widely accepted publicly available benchmarks, or similar assessments of reasoning, science, and overall capabilities" [1]. Given the breadth of both known and likely unknown application spaces, the transferability of AI models, and the possibility of adversarial misuse, relying on such benchmarks alone may be insufficient—especially as the number of frontier models grows. To better categorize and understand risks, it is essential to develop a deeper understanding of fundamental capabilities and the regulatory tools available to manage them.

It is well established that TC strongly correlates with model performance, as demonstrated by scaling laws across a variety of models [2,3,4]. Historically, TC has served as a reasonable proxy for AI capabilities due to its quantifiability, detectability, excludability, and concentrated supply chain (i.e., chip manufacturing; cf. uranium enrichment for nuclear risk or synthetic nucleic acids for biological risk) [5,6].

However, there are concerns about the longevity of TC as a regulatory metric. Emerging factors, such as decentralized training techniques, model stacking, and a broader supply chain, complicate oversight. Additionally, advancements in data efficiency and algorithmic improvements could shift scaling laws and lower-compute, narrow models may achieve higher capabilities in specialized domains. TC is often estimated to be proportional to the product between the size of the model and training data [4]. Open-source models can be incrementally fine-tuned ("stacked") to exceed their initial capabilities. But if a regulation were to limit which models were to be released by compute, then to increase by an order of magnitude would mean to train the model with a tenfold increase in compute—something effectively intractable for most users.

There is no universal agreement on how compute should be empirically measured or regulated for a given model. Companies do not necessarily disclose or even track the exact compute used in training their models. Moreover, indirect proxies—such as hardware availability, training duration, or total energy consumption—may be insufficient or misleading. Developers could claim their compute was spread across multiple models or that large portions of compute were used inefficiently or did not contribute meaningfully to the final trained system. This could lead to either a reasonable lack of knowledge when actions are impermissible or an apprehension even when actions are permissible regarding investment, development, or use. Future regulatory advancements may address some of these concerns through a combination of AI governance mechanisms (e.g., hardware enabled governance [7]). However, many open questions remain about the feasibility and enforcement of such approaches [8].

II. Inference Compute Scaling

As described above, model performance has been shown to improve not only by increasing TC but also by leveraging IC [9,10,11]. In fact, IC played a central role in early AI systems, which often did not rely on TC in the modern sense. A notable example is IBM’s Deep Blue, which used a vast search-based inference strategy (i.e., IC) to evaluate possible chess moves and their downstream effects, relying heavily on a hard-coded human understanding of the game [12]. In contrast, Google’s AlphaGo also emphasizes IC—specifically Monte Carlo Tree Search (MCTS)—while using reinforcement learning (RL) to determine the values of moves [10].

These inference-based scaling techniques are no longer limited to traditional game-playing AI. They are now actively being used in large language models (LLMs) from OpenAI, DeepSeek, and others to improve reasoning, accuracy, and overall model performance [13,14,15]. DeepSeek’s recent research highlights some of the ongoing technical challenges associated with scaling IC through some of these techniques [14].

IC performance scaling has been shown to follow a power law [16]. This is similar to TC, however TC and IC scaling are asymmetric (e.g., some fundamental capabilities must be established during training before inference-based improvements become effective) and not completely separable (e.g., many IC-scaling techniques still require some degree of additional training, meaning IC and TC are not completely independent). Nonetheless, direct trade-offs between TC and IC have been observed across a broad compute range [17]. The relationship between TC and IC depends on task complexity, the current levels of TC and IC, and—critically—their ratio [18]. For large models, IC per single prediction is typically estimated based on network size and is often approximately the square root of TC in magnitude [17]. Below, we provide some example numbers for models illustrating this relationship.

III. Regulatory Considerations

OpenAI’s o1 suggests that inference compute (IC) scales smoothly across orders of magnitude [9]. This scaling has lead to substantial performance improvements, particularly in science, coding, and mathematics. However, it has also increased pre-mitigation capabilities in domains with potential cybersecurity and biological risks. OpenAI further reports that o1's IC implementation and post-mitigation techniques contributes to safety and alignment improvements, but goals need not be aligned with capabilities in all models. Additionally, research has identified signs of deception in o1’s hidden reasoning processes [13,19], raising questions about the potential risks of scaling inference.

Energy and monetary expenditures increase with greater IC which is significant since inference already consumes a considerable amount of energy. For example, in Google’s total machine learning (ML) energy usage across 2019, 2020, and 2021, it is estimated that the majority, 60%, of the energy was consumed by inference [20]. Additionally, increasing IC may facilitate the proliferation of highly capable smaller models. Given a fixed computational budget, smaller models can outperform larger models when inference techniques are used effectively [21]. This could lead to a greater distribution of high-performance models without the need for extremely large-scale training, further complicating governance and control efforts.

Performance scaling with inference compute (IC) primarily involves assembling multiple predictions to generate an improved output. For example, one can search over many strings of predictions and choose the best or allow the model to iteratively refine its answer. However, precisely defining IC remains an open challenge. For instance, Google’s Gemini 1.5 has a maximum token output length of ~10^4, but it has processed data with a context length of up to 10^7 tokens [22]—which is likely to increase with targeted search. IC can be increased by continually feeding generated outputs back into the model, effectively extending the reasoning chain. However, in practice, performance gains from IC saturate at some point and are likely to even degrade beyond a certain threshold due to overfitting. Further training may be required to improve verification or grading mechanisms that allow models to assess and refine their own reasoning [23]. Internal scaling (e.g., training for chain-of-thought) may be more effective at improving fundamental capabilities than external inference techniques, but this does not mean there are not additional increases available. Given these considerations, it seems prudent to at least include the maximum reasoning tokens, as described in DeepSeek's paper. However, defining IC more rigorously remains an important area for further research. For a separate discussion on integrating inference compute into AI governance, see Lennart Heim’s blog post on the topic [24].

The examples described above which use the TC thresholds are given in more detail here:

The U.S. Department of Treasury released a Final Rule which sets the TC thresholds for certain outbound investment transactions as notifiable at 10^23 computational operations and prohibited at 10^25 computational operations or 10^24 computational operations using primarily biological sequence data [25].
The EU AI Act distinguishes model risks by TC, stating that "general-purpose AI models shall be presumed to have high impact capabilities" [26].

IV. Some Light Technical Details

Many techniques exist to increase inference compute (IC) and improve model performance. These approaches often involve either generating multiple independent answers and selecting the best through a grading criterion (e.g., majority voting, best-of-N) or broadly searching across possible actions, such as Monte Carlo Tree Search (MCTS), which applies Monte Carlo (random sampling) techniques to a structured search tree.

Since verifying a solution is often easier than generating one (e.g., solving Sudoku puzzles), a separate, significantly smaller model can be trained to evaluate the quality of model outputs. This outcome-supervised reward model (ORM)is used to select the best answer among multiple generated outputs—a technique commonly known as Best-of-N selection. The ORM can be trained with human feedback or hard-coded verification rules (e.g., checking whether a math problem has a correct numerical answer). The quality of the reward model imposes a bottleneck on overall performance—meaning IC scaling cannot be increased indefinitely (e.g., by generating more samples) without also improving the reward model [23].

Rather than only considering the final output, another approach involves breaking a problem down into sequential steps—a method widely used in human reasoning. Instead of producing multiple independent answers, a model can sequentially refine a single answer toward a goal. Prior work has shown that simply prompting with "Let's think step by step" can significantly improve reasoning performance in LLMs [27]. However, further improvements can be achieved by training a step-by-step grader known as a process-supervised reward model (PRM). PRMs evaluate and assign grades to each intermediate step rather than just the final output. The cumulative grade assigned by the PRM can lead to substantial improvements over ORMs, which only evaluate the final result [28]. PRMs can be used to fine-tune the base model via reinforcement learning (RL) or techniques such as beam search and lookahead search can leverage PRMs by generating multiple step-by-step reasoning chains and selecting only the highest-graded sequences.

Stringing together step-by-step reasoning into a chain-of-thought aligns with the intuition that sequential reasoning outperforms multiple independent attempts by enabling a strategic refinement of solutions and correction of prior mistakes. Additionally, these methods allow models to dynamically allocate thinking time based on task complexity. Task difficulty can be estimated explicitly (e.g., through complexity estimation techniques) or emerge naturally as the model learns to adjust its reasoning length based on past success. DeepSeek experimented with PRMs and MCTS but ultimately found greater success with Group Relative Policy Optimization. Notably, their model naturally increased its reasoning length as it underwent further chain-of-thought training [14].

These techniques are not mutually exclusive—models can combine different IC scaling methods for greater performance in specific domains. External users may also employ these techniques outside of the original model training process, some even without fine-tuning. However, reward model quality remains the bottleneck for many IC-based performance gains. While IC can be scaled up significantly, it is possible that fundamental capability limits are still determined by TC. Yet, this is not obvious based on current findings. Regardless, carefully considering inference scaling is critical for any regulatory framework aiming to assess AI capabilities accurately.

To better understand 'mathematical operations' (i.e., I/FLOPs), see this description from the technical notes of Framework for Artificial Intelligence Diffusion [29]:

"'Operations' refers to mathematical operations used for pre-training and any subsequent training, such as fine-tuning the pre-trained model, but does not include the collection and curation of the input training data. Training `operations' should account for operations required to perform all steps in the pre-training and subsequent training process, including those for forward and backward propagation for all relevant layers, pooling, and convolutions, regardless of the implementation and hardware limitations, and applied to all relevant operations. For example, consider a model composed of a single densely connected layer with I input neurons, O output neurons, and no biases being trained with backpropagation. Such a model would have a total of N = I * O learned parameters. Each forward pass would require N multiply accumulate operations, or (assuming floating point arithmetic) 2N FLOP. Each backward pass would require 2N multiply accumulate operations, or 4N FLOP. Then, in total, each training data point would require 6N FLOP. Training on a data set of size D would require 6ND total FLOP."

V. Some Model Numbers

Model

GPT-1

GPT-2

GPT-3

GPT-4

DS-V3

DS-R1

Approximate Training Compute

10^19

10^21

10^23

10^25

10^24

Approximate Inference Compute per Token

10^8

10^9

10^11

10^12

10^11

Approximate Max 'Reasoning Tokens' per Query

10^4

* All data in FLOPs and are rough estimates. Data is mostly from EpochAI (excluding reasoning tokens taken from DeepSeek's paper) [30]. Inference compute per token is estimated to be twice the number of parameters (as described in Section IV for forward pass). Note that there is additional training for R1, but I am assuming it does not alter the OOM which seems most likely given the updated RL data size of 800k examples and original training tokens of 14.8 trillion. Yet, this, of course, may be incorrect.

* There is some reasoning directly within the answer prompts which can be quite long depending on the max output size. This may be considered when defining total IC. For now, I propose IC = IC per Token x Max 'Reasoning Tokens' per Query

References

Page updated

Google Sites

Report abuse