Clustering for AI: Innovating AI Training with Decentralized Compute Networks

4 min readNov 1, 2024

In traditional cloud computing, scaling is typically achieved by centralizing compute power in large data centers. While effective, this model has limitations, such as high costs, limited access to the latest GPUs, and centralized control, which can create bottlenecks for AI development.

@ionet takes a different approach by decentralizing AI training. Rather than relying solely on expensive, centralized infrastructure, io.net taps into decentralized clusters of idle and lower-performance GPUs. This enables a more cost-effective and scalable solution to run high-performance AI workloads. Using underused hardware distributed across a global network, io.net enables more affordable AI compute without sacrificing performance.

Unlike traditional setups that rely on centralized data centers, io.net dynamically allocates resources from a global pool of nodes, offering greater flexibility, faster scaling, and reduced latency. By integrating edge computing, io.net can process data closer to where it’s generated, further improving performance.

Managed Hosting and Legacy Data Centers

Legacy data centers often face fluctuating demand, leading to underutilized hardware. Managed hosting facilities rent out servers and bandwidth, absorbing operational burdens such as cooling and energy costs. However, many large data centers can struggle to scale efficiently.

Additionally, crypto mining facilities — especially those focused on Ethereum — have experienced dramatic shifts since Ethereum transitioned away from proof-of-work consensus in 2022. This left many mining data centers with surplus GPUs, particularly general-purpose GPUs that can be repurposed for AI workloads.

Hive, one of the largest Ethereum mining operations, redirected its resources post-Merge to focus on high-performance computing through its HIVE Performance Cloud. With an estimated 9.3 million GPU units made redundant after the Merge, based on Ethereum’s hash rate, this surplus has opened new opportunities for AI training.

Unlocking Idle GPU Power

io.net helps unlock the potential of idle GPUs in legacy data centers and mining facilities by making this underutilized hardware available for AI compute. While these GPUs may not always be the latest generation, they still deliver sufficient performance for many use cases, particularly for startups, researchers, and institutions that don’t require top-tier speeds.

io.net manages a pool of nodes sourced from independent suppliers and data centers, all visible through the io.net Explorer. To ensure high-quality service, io.net uses a sophisticated resource allocation system to optimize GPU utilization based on workload demand. Additionally, reliability is built into the platform with continuous monitoring, redundancy, and a distributed failover system that reroutes tasks when a node goes offline.

By clustering lower-performance GPUs, io.net offers two significant benefits:

For customers, the ability to access more affordable compute resources.
For suppliers, the opportunity to monetize idle hardware rather than leaving it unused.

However, distributing workloads across a decentralized network of GPUs presents significant technical challenges that io.net has specifically addressed.

The Technical Challenge of Clustering GPUs

Clustering GPUs across decentralized networks requires advanced hardware and software solutions to ensure efficient task distribution and resource management. Key technologies include:

Interconnect Solutions: High-speed networking technologies like Ethernet and Infiniband facilitate fast data transfer between nodes.
Distributed Training Protocols: Frameworks like PyTorch and TensorFlow use techniques such as data parallelism to distribute tasks across multiple GPUs. PyTorch’s DistributedDataParallel (DDP) allows each GPU to process a portion of data and sync gradients across the cluster to update the model’s weights.
Clustering APIs: Tools like Message Passing Interface (MPI) and Ray.io provide the orchestration, scheduling, and auto-scaling needed to manage distributed workloads. Ray, developed by Anyscale, is a powerful tool for parallelizing Python code across multiple GPUs. io.net supports all major machine learning frameworks, including PyTorch, TensorFlow, and Predibase, making it a versatile platform for a wide range of AI projects.

io.net ’s Infrastructure

io.net is built on Ray.io, the distributed computing framework OpenAI uses to train GPT-3 on around 2,000 CPUs and 10,000 GPUs. This infrastructure allows io.net to distribute AI workloads — including reinforcement learning, deep learning, hyperparameter tuning, and model serving — across a global grid of GPUs.

To maintain high service levels, io.net employs advanced resource allocation that maximizes GPU usage across its decentralized pool. Continuous monitoring and built-in redundancy ensure that tasks are seamlessly rerouted in the event of node failure, minimizing downtime. Through the io.net Explorer, users can track and explore available resources, giving them full transparency and control over the decentralized infrastructure.

Clustering for AI: Innovating AI Training with Decentralized Compute Networks

Written by io.net

No responses yet