For years, the most powerful artificial intelligence systems have been trained behind closed doors–inside massive data centers owned by a select few technology giants. These facilities concentrate thousands of GPUs in a single location, connected by ultra-fast internal networks that allow models to be trained as one tightly synchronized system.
This setup has long been treated as a technical necessity. However, it is increasingly clear that centralized data centers are not only expensive and risky, but also reaching physical limits. Large language models are growing rapidly, and systems trained just months ago quickly become outdated, pushing each new training cycle toward significantly larger scales. The question is no longer only about the concentration of power, but whether centralized infrastructure can physically scale fast enough to keep up.
Today's frontier models already consume the full capacity of top-tier data centers. Training a meaningfully larger model increasingly requires building an entirely new facility or fundamentally upgrading an existing one, at a time when co-located data centers are approaching limits on how much energy can be concentrated in a single location. Much of that energy is spent not only on raw silicon but on the cooling systems required to keep it operational. As a result, the ability to train frontier AI models remains concentrated among a handful of companies, primarily in the United States andChina.
This concentration has consequences far beyond engineering. Access to AI capabilities is shaped by geopolitics, export controls, energy constraints, and corporate priorities. As AI becomes foundational to economic productivity, scientific research, and national competitiveness, reliance on a small number of centralized hubs turns infrastructure decisions into strategic vulnerabilities.
What if this concentration is not inevitable – but instead a side effect of the algorithms we use to train AI?
Modern AI models are simply too large to be trained on a single machine. Foundation models with billions of parameters require many GPUs working in parallel, synchronizing their work after extremely small increments of progress, often every few seconds, millions of times over the course of training.
The industry's default solution has been centralized, co-located training: thousands of GPUs placed together in purpose-built data centers, connected by specialized networking hardware capable of transferring data at extreme speeds. These networks allow every processor to constantly synchronize with the others, ensuring that each copy of the model remains perfectly aligned during training.
This approach works well–but only under very specific conditions. It assumes ultra-fast internal networks, physical proximity between machines, reliable energy supply, and centralized operational control. Once training needs to scale beyond a single facility– across cities, countries, or continents – the system begins to break down.
Standard internet connections are orders of magnitude slower than the specialized links inside data centers. As a result, powerful GPUs spend most of their time stalled, waiting for synchronization rather than doing useful work. In practice, this doesn't make training slower–it makes it infeasible. Estimates suggest that attempting to train modern large language models over standard internet links would stretch training timelines from months into centuries, which is why such setups are rarely even attempted.
Over time, this technical constraint has shaped the entire AI ecosystem. Only organizations with access to massive capital and privileged infrastructure can afford to train large-scale models. Researchers, startups, and institutions outside these hubs are effectively locked out–not due to a lack of expertise, but because the training process itself is designed for centralization.
Source: International Business Times UK