A decentralized cross-device model training system with model and tensor parallelism to reduce compute needed to train large models.
Arceus is a decentralized cross-device model training system that implements both model and tensor parallelism to dramatically reduce the computational requirements for training large-scale machine learning models.
Training large language models and neural networks requires massive computational resources, often costing millions of dollars and being accessible only to large tech companies. Current distributed training solutions are complex, expensive, and require homogeneous hardware clusters.
Unlike traditional centralized training systems, Arceus implements a truly decentralized approach:
class HybridParallelTrainer: def __init__(self, model, devices): self.model_partitions = self.partition_model(model) self.tensor_partitions = self.partition_tensors() self.device_topology = self.analyze_topology(devices) def distributed_forward(self, batch): # Model parallelism across devices for partition_id, partition in enumerate(self.model_partitions): device = self.device_topology[partition_id] output = device.forward(partition, batch) batch = self.communicate_gradients(output, partition_id) return self.aggregate_outputs() def tensor_parallel_backward(self, gradients): # Split gradients across available devices split_grads = self.split_tensors(gradients) parallel_updates = [] for device, grad_chunk in zip(self.devices, split_grads): update = device.compute_update(grad_chunk) parallel_updates.append(update) return self.merge_updates(parallel_updates)
Device Network Topology
| Model Size | Traditional Training | Arceus | Speedup |
|---|---|---|---|
| 1B params | 8x V100 GPUs | 16x consumer GPUs | 2.3x faster |
| 7B params | 64x A100 GPUs | 32x mixed hardware | 1.8x faster |
| 13B params | 128x H100 GPUs | 64x edge devices | 1.5x faster |
Cost Comparison
The system consists of several key components:
# Example usage from arceus import DistributedTrainer trainer = DistributedTrainer( model=my_transformer_model, discovery_method="auto", partition_strategy="hybrid", fault_tolerance=True ) # Automatically discovers and coordinates with nearby devices trainer.discover_peers() # Starts distributed training trainer.train( dataset=my_dataset, epochs=10, checkpoint_interval=100 )
Get started with the codebase and join our community of distributed ML researchers.