Abstract
Multi-Agent Reinforcement Learning (MARL) has emerged as a key
paradigm for solving complex real-world problems involving multiple
agents interacting in dynamic environments. However, training MARL
models, especially for cooperative reasoning tasks, remains
computationally intensive and sample-inefficient due to nonstationarity, credit assignment, and policy coupling issues.
Conventional policy gradient methods struggle with convergence and
scalability in multi-agent settings. Centralized training frameworks
suffer from bottlenecks and synchronization overheads. Evolutionary
algorithms, while more robust to non-differentiable objectives, are
often too slow when applied in single-node environments. To address
these challenges, we propose Distributed Co-evolutionary Policy
Optimization (DCPO), a hybrid learning framework that distributes
evolutionary computation across multiple nodes. DCPO decomposes
the global policy search into sub-population-based parallel
explorations, with each node evolving a subset of agent policies using
fitness-driven mutation, crossover, and local policy gradient updates. A
global coordinator aggregates top-performing policies periodically to
ensure cooperative learning convergence. DCPO was tested on
standard cooperative MARL benchmarks such as StarCraft II
Micromanagement and Multi-Agent Particle Environments (MPE).
Compared to traditional baselines such as MADDPG, QMIX, MAPPO,
COMA, and EPOpt, DCPO showd up to 37% faster convergence, 25%
higher final cumulative rewards, and enhanced generalization in
unseen environments.
Authors
A. Rajavel
Kamaraj College of Engineering and Technology, India
Keywords
Multi-Agent Reinforcement Learning, Evolutionary Algorithms, Distributed Learning, Policy Optimization, Cooperative Reasoning