Centralized Compute Limits AI Scalability and Uptime: Why Federated Learning Is the Future

The Dangers of Centralized Compute for AI Development

By Daniel Keller, CEO & Co-founder of InFlux Technologies Lead

Introduction

AI development is powered by compute supplied through massive data centers that house circuit boards with millions of graphics processing units (GPUs). These GPUs execute extensive AI workloads for numerous models across countless applications.

A single entity typically centralizes the computing resources supplied through these data centers, physically locating the data center in one place and controlling all the processing power it provides.

Massive corporations such as Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure, collectively referred to as the “big three” in cloud computing, have oligopolized the compute market for AI workloads through centralized data centers, leaving consumers hard-pressed to find a cloud solutihd n outside of these three major providers.

Exceptionally High Demand for AI

Considering how expansive and well-established the infrastructure of the big three is, it’s no surprise that the result is vendor lock-in: customers relying on a single vendor and finding it too costly to switch.

Nonetheless, with the exceptionally high demand for AI and an even greater demand for resources to power its development, compute provisioning has become a landscape dominated by hyperscalers, characterized by cloud providers that operate massive data centers to offer proclaimed “scalable and flexible” bandwidth.

An Inherent Lack of Scalability and Flexibility

Scalability: Demand Changes

Centralized data center ownership is convenient and produces reliable enough compute for most applications. However, in reality, the resources generated by massive data centers are not optimized to scale or be flexible.

Real-time demand fluctuations should measure scalability: if an app’s traffic rises, compute resources scale dynamically to meet the increased demand. Centralized compute scalability is instead measured by estimated app needs, with resource provisioning static and unable to meet real-time demand changes.

Connections Slow Over Long Distances

Furthermore, the centralized location of data centers limits scalability because users in regions far from the data center may experience latency as connections slow over long distances and encounter geographical barriers.

Data centers concentrate large numbers of GPUs, which creates surplus compute and drives operators to over-provision resources; as a result, apps receive more processing power than they need, and providers waste the excess compute.

Flexibility: Pricing & Data Transmission

Although claimed to be, centralized compute provision is not very flexible either. Flexibility is closely tied to two critical areas of cloud computing: pricing and data transmission.

Pricing needs to be proportional to the computing resources provided. For example, hyperscalers claim they provide dedicated compute (stable, fast, and resilient connections), which justifies the high price points of their cloud services. But we know that their static provision, measured by estimated app needs, rather than dynamic provision, measured by real-time demand changes, results in delayed resource scaling, increasing network congestion.

Customers End Up Paying Exorbitant Prices

As a result, customers end up paying exorbitant prices for unpredictable resources. It’s important to note that pricing for centralized compute provision is purposefully structured to deter customers from switching providers.

Now, data transmission is essential for AI development; high-performance models with billions of training parameters require seamless data access to generate outputs efficiently. Centralized data centers controlled by a single company create storage siloes where app data lives in isolated environments.

The big three charge customers aggressive fees to export their app data and migrate it to another environment. These fees, often more expensive than monthly cloud bills, prevent many customers from affording them, which further exacerbates vendor lock-in and leads to data stagnation by restricting seamless transmission. This is disastrous for AI development as training AI solutions, such as Large Language Models (LLMs), rely on the mass collection, transmission, and inference of input data.

AI Models Require Constant Uptime

Volatile Uptime

AI models also require constant uptime, something that centralized compute provision through massive data centers cannot achieve. The big three claim to meet Service Level Agreements (SLAs) with 99.9% uptime, the average industry benchmark for cloud networks.

Uptime benchmarks budget for annual downtime as network outages can still occur, regardless of infrastructure performance (when computing resources are live vs when they are not).

The 99.9% uptime SLA benchmark allows for approximately 1 minute and 30 seconds of downtime per day, equating to an annual downtime rate of just under 9 hours. This means that if a compute provider’s network fails and experiences downtime of more than 1 minute and 30 seconds per day, the provider will fail to meet their benchmarked uptime requirements outlined in their SLA.

The recent AWS outage lasted nearly 15 hours, highlighting the vulnerability of centralized compute provision through data centers. A single random outage caused the AWS network to experience more downtime in less than a day than the 99.9% uptime SLA benchmark allows in one year.

Centralized Data Centers Catalyze Network Outages

Additionally, the fixed locations of centralized data centers catalyze network outages because they are literal single points of failure. There is an inherent lack of redundancy in a network that utilizes centralized servers to route traffic and maintain uptime, because if the central server fails, the entire network collapses.

The big three claim that fault tolerance is a core design principle of their data centers, and that redundancy is built in through strategies such as physically distributing data center locations. However, distributed hardware does not mean decentralized, and it remains under the control and ownership of a single company, which still creates a single point of failure.

Therefore, even with increased redundancy from data center distribution, network outages can still occur due to the company’s negligence or mismanagement.

AI Training and Uptime

To train an AI model more efficiently, it is connected to an external database outside its training environment, which it references to gain additional context. This is achieved through a Retrieval Augmented Generation (RAG) framework that combines two components: a data retrieval system that accesses externally hosted search indexes and a generative LLM.

Consistent Uptime and High Availability

When a user input is queried, the retrieval system fetches the referenced data, and the LLM distills and processes it to generate an output. Consistent uptime and high availability are operationally critical for both the retriever and LLM components of a RAG framework because, without them, the entire system fails.

A network outage for a RAG framework powered by centralized compute means the retriever system can’t access the search index to fetch relevant and referenced data. Even if the index has a local cache, index queries still won’t be able to refresh to access the most up-to-date information.

Additionally, when a network outage occurs, it prevents the LLM from training on new external datasets, significantly disrupting its learning parameters and reducing output accuracy. The longer the downtime, the greater the impact on model training.

Solution: Federated Learning

So, what can we do? AI models need to stop relying on centralized computing provided by hyperscalers to power their development and instead adopt federated learning.

Federated learning is a training structure for AI, particularly well-suited for IoT networks, that involves a shared model distributed across multiple decentralized hardware devices. An independent party operates each device and serves as its own local training environment, with model input and output data stored exclusively on that device.

A central consortium server, set up collaboratively by the independent device operators, acts as an orchestrator between all devices. It periodically collects model performance data from the devices, aggregates it into an improved shared model update, and then sends that update back to the devices. The orchestrator never shares raw training data across devices and centralizes device communication only when pushing overall updates.

Federated Learning Can Mitigate the Dangers

By spreading training across multiple devices, federated learning can mitigate the dangers that centralized computing poses to AI development by relocating resources to the edge of the network, closer to the source of a model or application, rather than routing all activity through a single, centralized cluster. Federated learning eliminates single points of failure by utilizing distributed hardware with decentralized control, ensuring uptime and redundancy.

Conclusion

Centralized compute supplied through hyperscale data centers powers most AI workloads today, but it also encompasses inherent fragility that makes long-term development unsustainable. When the big three control most of the infrastructure and computing resources necessary to run AI models, pricing becomes unaffordable and data transmission stagnates. Due to centralized providers’ single points of failure, the uptime needed for consistent AI development is unpredictable, as network outages, although rare, can severely disrupt innovation.

Federated learning provides a clear path forward: shifting away from centralized compute provision to distributed hardware with decentralized control. Because centrally owned and operated infrastructure routes and stores most contemporary internet traffic and application data, computing networks must adapt to reduce excess compute waste and offer more cost-effective pricing if AI development continues to rise at the current rate.

Centralized Compute Limits AI Scalability and Uptime: Why Federated Learning Is the Future

Author’s Bio: Daniel Keller, CEO & Co-founder of InFlux Technologies Lead

Daniel has charismatic leadership and forward-thinking approach to world-changing innovations, and he passionately advocates for disruptive technology. With over 25 years of experience spanning the tech, healthcare, and non-profit sectors, he brings a deep understanding of Web3 and decentralized technologies and their potential to transform industries and empower individuals.

As an entrepreneur, investor, and visionary in transformative technology, Daniel holds a degree in Information Technology from the University of Capella, alongside certifications from Penn State University and Harvard Business School. His commitment to decentralization fuels his mission to build a truly free internet that is “for the people, by the people.”

A globally distributed network of user-operated, scalable computational nodes powers Flux and redefines how the internet operates. Daniel co-founded the company in 2018, originally under the name Zelcore Technologies.

Editor

Leave a Comment Cancel Reply

Alternate Data: Interview with Suryadip Ghoshal, Co-Founder & Chief AI Officer at Think360.ai

Refroid and TierX: The Infrastructure Behind AI-Powered CX

Furniture Hardware Manufacturing and Its Growing Impact on CX

MiQ Sigma: AI Platforms Reshape Cross-Channel Marketing

Himalayan Roundtable Explores Sustainable Development Strategies

Semiconductor Chip: India’s First Made-in-India by December 2025

Smartphone Manufacturing: India Is Now World Leader in...

Kompact AI / Accessible AI: A Customer-Centric Revolution

BRICS Nations Are Breaking Free from Dollar Dominance...

AWS Outage: What Caused the 2025 Internet Meltdown...