In large, distributed systems composed of adaptive and interactive components (agents), ensuring coordination among agents so that the system achieves certain performance objectives is a challenging proposition. The key difficulty to overcome is one of credit assignment: how to apportion credit (or blame) to a particular agent based on the performance of the entire system. This problem is prevalent in many domains, including air- or ground-traffic control, multi-robot coordination, sensor networks and smart power grids.1,2 Here, we provide a general approach to coordinating learning agents and present examples from the multi-robot-coordination domain.3–5
Many complex exploration domains (e.g., planetary exploration, search and rescue) require use of autonomous robots. In addition, employment of multi-robot teams offers distinct advantages in efficiency and robustness over single-robot performance. However, these potential gains also come at a cost. One must ensure that the robots do not work at cross purposes and that their efforts support a common, system-level objective.
Directly extending single-robot approaches to multi-robot systems presents difficulties, because the learning problem is no longer the same. The robots have to learn both ‘good’ actions and also tasks that are complementary to one another in a constantly changing environment. Approaches that are particularly well suited to multi-robot systems include Markov decision processes for online mechanism design,6 development of new reinforcement-learning-based algorithms,7–10 and domain-based evolution.11 In addition, forming coalitions to reduce search costs,12 employing multilevel learning architectures for coalition formation13 and market-based approaches14 have been examined. Finally, in problems with limited or no communication, devising agent-specific objective functions that implicitly include coordination components has proved very successful.3,4
Here, we summarise recent advances in developing such agent-specific objective functions. Given some system-level objective function (e.g., number of areas explored), we aim to derive an individual function for the agents such that when they achieve their own objectives, the system objective is also achieved. For some system-level objective G(z), given as a function of the full system state z, consider the agent-specific objective function for agent i, where z−i is the counterfactual state that does not depend on agent i's state. (In some systems it may not be practical to entirely remove an agent, in which case the counterfactual state is set to an ‘expected’ action.)
This agent objective provides two benefits. First, each agent can ascertain the impact its actions have on the system as a whole, because the difference between the actual and counterfactual worlds removes many terms that do not depend on agent i. Therefore, this set of agent objectives have been called ‘difference objectives.’2,4 Second, because the counterfactual term does not depend on the states of agent i, Di and G have the same derivative with respect to changes in agent i's state. Intuitively, this means that an action that is beneficial for agent i is also beneficial for the system, although the agent does not explicitly need to know this.
This approach has been successfully applied to the multi-robot-coordination domain. In this formulation, multiple robots are required to explore an environment where different points of interest have different values. The system objective is to maximize aggregate information collection. The robots can observe the points of interest and each other, but they do not communicate. Instead, coordination is promoted through the use of the difference objective of Equation 1.
When robots use the system-level objective directly, learning is extremely slow (see Figure 1). This is because all agents receive the same information, making it difficult for them to determine which of their actions is beneficial. Using only local information, on the other hand, leads to agents competing for the points of interest, rather than cooperating. The difference objective, however, provides a signal that is both aligned with the system objective and sensitive to the agent's actions. Therefore, it leads to the agents quickly learning the correct actions and coordinating successfully.
System objective, G(z), as a function of the number of training episodes for an environment with 10 robots and 40 points of interest. The robots are trained with the system-level and difference objectives, as well as with a local or selfish goal.
Figure 2 shows a domain where tighter coordination is required. Points of interest provide higher values when observed by exactly two different types of robot (observations by one or more than two robots yield lower values), and the points of interest appear and disappear during the exploration stage. This domain severely tests the robots' coordination, since ‘incidental’ coordination is not sufficient to achieve good behaviour. The results show that the benefits of the difference objective are significantly more pronounced and that the agents form stable partnerships, even though they do not communicate. Using the system-level objective, the agents struggle to do better than if they made random decisions. As expected, using a selfish objective produces entirely inappropriate behaviour, resulting in almost no system benefit at all (a behaviour similar to the tragedy of the commons15).
System objective as a function of the number of training episodes for an environment with 40 robots and 50 points of interest. Two robots of different types must partner to observe a point of interest.
Whether the application is multi-robot exploration, distributed-sensor networks, traffic management, or a host of other options, many agents are tasked with coordinating to achieve a system-level goal. As the system grows more complex, becoming dynamic and containing hundreds or thousands of agents, the structural problem of assigning credit to individuals such that they can learn what benefits the system as a whole can quickly become intractable. Through use of difference objectives, a system designer can develop a specific objective to give to each agent operating in the system that balances, providing important global information with low signal-to-noise ratio. In addition to the benefits to robot coordination, the difference objective has proved successful in air-traffic management. We (and others working in this field) continue development in the areas of distributed-sensor networks, multi-objective optimization and complex-system decomposition.