Nature论文Grandmaster level in StarCraft II using multi-agent reinforcement learning |
来源:一起赢论文网 日期:2022-08-19 浏览数:758 【 字体: 大 中 小 大 中 小 大 中 小 】 |
Many real-world applications require artificial agents to compete and coordinate 许多现实世界的应用需要人工智能体来竞争和协调 with other agents in complex environments. As a stepping stone to this goal, the 与复杂环境中的其他代理协作。作为实现这一目标的垫脚石 domain of StarCraft has emerged as an important challenge for artificial intelligence 星际争霸已经成为人工智能的一个重要挑战 research, owing to its iconic and enduring status among the most difficult 由于其标志性和持久的地位,研究是最困难的 professional esports and its relevance to the real world in terms of its raw complexity 专业电子竞技及其原始复杂性与现实世界的相关性 and multi-agent challenges. Over the course of a decade and numerous 以及多代理挑战。在十年多的时间里 competitions1–3, the strongest agents have simplified important aspects of the game, 竞争1-3,最强的代理简化了游戏的重要方面, utilized superhuman capabilities, or employed hand-crafted sub-systems4. Despite 使用超人能力,或使用手工制作的子系统4。尽管 these advantages, no previous agent has come close to matching the overall skill of 有了这些优势,以前的代理都无法与 top StarCraft players. We chose to address the challenge of StarCraft using generalpurpose 顶级星际争霸玩家。我们选择使用通用功能来应对星际争霸的挑战 learning methods that are in principle applicable to other complex 原则上适用于其他复杂系统的学习方法 domains: a multi-agent reinforcement learning algorithm that uses data from both 域:一种多智能体强化学习算法,使用来自两个域的数据 human and agent games within a diverse league of continually adapting strategies 不断调整策略的不同联盟中的人和代理博弈 and counter-strategies, each represented by deep neural networks5,6. We evaluated 和反策略,每个策略都由深度神经网络5,6表示。我们评估了 our agent, AlphaStar, in the full game of StarCraft II, through a series of online games 我们的代理人AlphaStar通过一系列在线游戏参与星际争霸II的全游戏 against human players. AlphaStar was rated at Grandmaster level for all three 对抗人类玩家。AlphaStar在所有三个方面都被评为特级大师 StarCraft races and above 99.8% of officially ranked human players. 星际争霸比赛和99.8%以上官方排名的人类玩家。 StarCraft is a real-time strategy game in which players balance highlevel 《星际争霸》是一款实时战略游戏,玩家在游戏中保持高度平衡 economic decisions with individual control of hundreds of units. 个人控制数百台机组的经济决策。 This domain raises important game-theoretic challenges: it features a 这个领域提出了重要的博弈论挑战:它的特点是 vast space of cyclic, non-transitive strategies and counter-strategies; 广阔的循环、非传递策略和反策略空间; discovering novel strategies is intractable with naive self-play 发现新的策略与天真的自我游戏是很难的 exploration methods; and those strategies may not be effective when 勘探方法;而这些策略在以下情况下可能无效: deployed in real-world play with humans. Furthermore, StarCraft 部署在现实世界中与人类玩耍。此外,星际争霸 has a combinatorial action space, a planning horizon that extends 有一个组合行动空间,一个扩展的规划视野 over thousands of real-time decisions, and imperfect information7. 超过数千个实时决策和不完全信息7。 Each game consists of tens of thousands of time-steps and thousands 每个游戏由成千上万个时间步和数千个时间步组成 of actions, selected in real-time throughout approximately ten minutes 在大约十分钟内实时选择的动作的数量 of gameplay. At each step t, our agent AlphaStar receives an observation 游戏性。在每个步骤t,我们的代理AlphaStar都会收到一个观察 ot that includes a list of all observable units and their attributes. This ot,包括所有可观察单位及其属性的列表。这 information is imperfect; the game includes only opponent units seen 信息不完善;该游戏只包括看到的对手单位 by the player’s own units, and excludes some opponent unit attributes 由玩家自己的单位,并排除一些对手单位属性 outside the camera view. 在摄像机视野之外。 Each action at is highly structured: it selects what action type, out of 上的每个操作都是高度结构化的:它从 several hundred (for example, move or build worker); who to issue that 几百人(例如,移动或建造工人);谁来发这个 action to, for any subset of the agent’s units; where to target, among 对于代理单位的任何子集,采取行动;目标位置,其中 locations on the map or units within the camera view; and when to 地图上的位置或摄像机视图内的单位;什么时候 observe and act next (Fig. 1a). This representation of actions results 观察并采取下一步行动(图1a)。这种动作结果的表示 in approximately 1026 possible choices at each step. Similar to human 在每个步骤的大约1026个可能的选择中。与人类相似 players, a special action is available to move the camera view, so as to 玩家,可以使用特殊动作来移动摄像机视图,以便 gather more information. 收集更多信息。 Humans play StarCraft under physical constraints that limit their 人类在物理限制下玩星际争霸,这限制了他们的生活 reaction time and the rate of their actions. The game was designed with 反应时间和它们的动作速率。游戏的设计 those limitations in mind, and removing those constraints changes the 考虑到这些限制,删除这些限制将改变 nature of the game. We therefore chose to impose constraints upon 游戏的本质。因此,我们选择对其施加限制 AlphaStar: it suffers from delays due to network latency and computation AlphaStar:由于网络延迟和计算,它遭受延迟 time; and its actions per minute (APM) are limited, with peak 时间其每分钟动作(APM)有限,峰值 statistics substantially lower than those of humans (Figs. 2c, 3g for 统计数据显著低于人类(图2c、3g performance analysis). AlphaStar’s play with this interface and these 性能分析)。AlphaStar的这个界面和这些 https://doi.org/10.1038/s41586-019-1724-z https://doi.org/10.1038/s41586-019-1724-z Received: 30 August 2019 收到日期:2019年8月30日 Accepted: 10 October 2019 接受日期:2019年10月10日 Published online: 30 October 2019 在线发布:2019年10月30日 1DeepMind, London, UK. 2Team Liquid, Utrecht, Netherlands. 3These authors contributed equally: Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, 1DeepMind,英国伦敦。2Team Liquid,荷兰乌得勒支。3这些作者做出了同样的贡献:Oriol Vinyals,Igor Babuschkin,Wojciech M.Czarnecki,Michaël Mathieu,Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, Laurent Sifre, Trevor Cai, John P. Agapiou, 钟俊英、蔡国伟、理查德·鲍威尔、蒂莫·埃瓦尔德、佩特科·乔治耶夫、吴俊赫、丹·霍根、曼努埃尔·克罗斯、伊沃·丹尼尔卡、阿贾·黄、劳伦特·西弗、特雷弗·蔡、约翰·P·阿加皮欧、, Chris Apps, David Silver. *e-mail: vinyals@google.com; davidsilver@google.com 克里斯,大卫·西尔弗*电子邮件:vinyals@google.com; davidsilver@google.com Nature | Vol 575 | 14 November 2019 | 351 《自然》第575卷2019年11月14日第351页 constraints was approved by a professional player (see ‘Professional 约束已由专业玩家批准(见“专业玩家”) player statement’ in Methods). 方法中的player语句)。 Learning algorithm 学习算法 To address the complexity and game-theoretic challenges of StarCraft, 为了应对星际争霸的复杂性和博弈论挑战, AlphaStar uses a combination of new and existing general-purpose AlphaStar综合使用了新的和现有的通用软件 techniques for neural network architectures, imitation learning, reinforcement 神经网络架构、模拟学习、增强技术 learning, and multi-agent learning. Further details about 学习和多智能体学习。进一步详情 these techniques are given in the Methods. 方法中给出了这些技术。 Central to AlphaStar is a policy πθ(at | st, z) = ℙ[at | st, z], represented AlphaStar的中心是策略πθ(在|st,z)=ℙ[at|st,z]表示 by a neural network with parameters θ that receives all observations 通过具有参数θ的神经网络接收所有观测值 st = (o1:t, a1:t − 1) from the start of the game as inputs, and selects actions st=(o1:t,a1:t− 1) 从游戏开始作为输入,并选择动作 as outputs. The policy is also conditioned on a statistic z that summarizes 作为输出。该政策还以统计数据z为条件,该统计数据总结: a strategy sampled from human data (for example, a build order). 从人类数据中采样的策略(例如,构建顺序)。 Our agent architecture consists of general-purpose neural network 我们的代理架构由通用神经网络组成 components that handl 处理的组件 |
[返回] |