欢迎访问一起赢论文辅导网
本站动态
联系我们
 
 
 
 
 
 
 
 
 
 
 
QQ:3949358033

工作时间:9:00-24:00
博士论文
当前位置:首页 > 博士论文
Nature论文Grandmaster level in StarCraft II using multi-agent reinforcement learning
来源:一起赢论文网     日期:2022-08-19     浏览数:599     【 字体:

 Many real-world applications require artificial agents to compete and coordinate  许多现实世界的应用需要人工智能体来竞争和协调  with other agents in complex environments. As a stepping stone to this goal, the  与复杂环境中的其他代理协作。作为实现这一目标的垫脚石  domain of StarCraft has emerged as an important challenge for artificial intelligence  星际争霸已经成为人工智能的一个重要挑战  research, owing to its iconic and enduring status among the most difficult  由于其标志性和持久的地位,研究是最困难的  professional esports and its relevance to the real world in terms of its raw complexity  专业电子竞技及其原始复杂性与现实世界的相关性  and multi-agent challenges. Over the course of a decade and numerous  以及多代理挑战。在十年多的时间里  competitions1–3, the strongest agents have simplified important aspects of the game,  竞争1-3,最强的代理简化了游戏的重要方面,  utilized superhuman capabilities, or employed hand-crafted sub-systems4. Despite  使用超人能力,或使用手工制作的子系统4。尽管  these advantages, no previous agent has come close to matching the overall skill of  有了这些优势,以前的代理都无法与  top StarCraft players. We chose to address the challenge of StarCraft using generalpurpose  顶级星际争霸玩家。我们选择使用通用功能来应对星际争霸的挑战  learning methods that are in principle applicable to other complex  原则上适用于其他复杂系统的学习方法  domains: a multi-agent reinforcement learning algorithm that uses data from both  域:一种多智能体强化学习算法,使用来自两个域的数据  human and agent games within a diverse league of continually adapting strategies  不断调整策略的不同联盟中的人和代理博弈  and counter-strategies, each represented by deep neural networks5,6. We evaluated  和反策略,每个策略都由深度神经网络5,6表示。我们评估了  our agent, AlphaStar, in the full game of StarCraft II, through a series of online games  我们的代理人AlphaStar通过一系列在线游戏参与星际争霸II的全游戏  against human players. AlphaStar was rated at Grandmaster level for all three  对抗人类玩家。AlphaStar在所有三个方面都被评为特级大师  StarCraft races and above 99.8% of officially ranked human players.  星际争霸比赛和99.8%以上官方排名的人类玩家。  StarCraft is a real-time strategy game in which players balance highlevel  《星际争霸》是一款实时战略游戏,玩家在游戏中保持高度平衡  economic decisions with individual control of hundreds of units.  个人控制数百台机组的经济决策。  This domain raises important game-theoretic challenges: it features a  这个领域提出了重要的博弈论挑战:它的特点是  vast space of cyclic, non-transitive strategies and counter-strategies;  广阔的循环、非传递策略和反策略空间;  discovering novel strategies is intractable with naive self-play  发现新的策略与天真的自我游戏是很难的  exploration methods; and those strategies may not be effective when  勘探方法;而这些策略在以下情况下可能无效:  deployed in real-world play with humans. Furthermore, StarCraft  部署在现实世界中与人类玩耍。此外,星际争霸  has a combinatorial action space, a planning horizon that extends  有一个组合行动空间,一个扩展的规划视野  over thousands of real-time decisions, and imperfect information7.  超过数千个实时决策和不完全信息7。  Each game consists of tens of thousands of time-steps and thousands  每个游戏由成千上万个时间步和数千个时间步组成  of actions, selected in real-time throughout approximately ten minutes  在大约十分钟内实时选择的动作的数量  of gameplay. At each step t, our agent AlphaStar receives an observation  游戏性。在每个步骤t,我们的代理AlphaStar都会收到一个观察  ot that includes a list of all observable units and their attributes. This  ot,包括所有可观察单位及其属性的列表。这  information is imperfect; the game includes only opponent units seen  信息不完善;该游戏只包括看到的对手单位  by the player’s own units, and excludes some opponent unit attributes  由玩家自己的单位,并排除一些对手单位属性  outside the camera view.  在摄像机视野之外。  Each action at is highly structured: it selects what action type, out of  上的每个操作都是高度结构化的:它从  several hundred (for example, move or build worker); who to issue that  几百人(例如,移动或建造工人);谁来发这个  action to, for any subset of the agent’s units; where to target, among  对于代理单位的任何子集,采取行动;目标位置,其中  locations on the map or units within the camera view; and when to  地图上的位置或摄像机视图内的单位;什么时候  observe and act next (Fig. 1a). This representation of actions results  观察并采取下一步行动(图1a)。这种动作结果的表示  in approximately 1026 possible choices at each step. Similar to human  在每个步骤的大约1026个可能的选择中。与人类相似  players, a special action is available to move the camera view, so as to  玩家,可以使用特殊动作来移动摄像机视图,以便  gather more information.  收集更多信息。  Humans play StarCraft under physical constraints that limit their  人类在物理限制下玩星际争霸,这限制了他们的生活  reaction time and the rate of their actions. The game was designed with  反应时间和它们的动作速率。游戏的设计  those limitations in mind, and removing those constraints changes the  考虑到这些限制,删除这些限制将改变  nature of the game. We therefore chose to impose constraints upon  游戏的本质。因此,我们选择对其施加限制  AlphaStar: it suffers from delays due to network latency and computation  AlphaStar:由于网络延迟和计算,它遭受延迟  time; and its actions per minute (APM) are limited, with peak  时间其每分钟动作(APM)有限,峰值  statistics substantially lower than those of humans (Figs. 2c, 3g for  统计数据显著低于人类(图2c、3g  performance analysis). AlphaStar’s play with this interface and these  性能分析)。AlphaStar的这个界面和这些  https://doi.org/10.1038/s41586-019-1724-z  https://doi.org/10.1038/s41586-019-1724-z  Received: 30 August 2019  收到日期:2019年8月30日  Accepted: 10 October 2019  接受日期:2019年10月10日  Published online: 30 October 2019  在线发布:2019年10月30日  1DeepMind, London, UK. 2Team Liquid, Utrecht, Netherlands. 3These authors contributed equally: Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik,  1DeepMind,英国伦敦。2Team Liquid,荷兰乌得勒支。3这些作者做出了同样的贡献:Oriol Vinyals,Igor Babuschkin,Wojciech M.Czarnecki,Michaël Mathieu,Andrew Dudzik,  Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, Laurent Sifre, Trevor Cai, John P. Agapiou,  钟俊英、蔡国伟、理查德·鲍威尔、蒂莫·埃瓦尔德、佩特科·乔治耶夫、吴俊赫、丹·霍根、曼努埃尔·克罗斯、伊沃·丹尼尔卡、阿贾·黄、劳伦特·西弗、特雷弗·蔡、约翰·P·阿加皮欧、,  Chris Apps, David Silver. *e-mail: vinyals@google.com; davidsilver@google.com  克里斯,大卫·西尔弗*电子邮件:vinyals@google.com; davidsilver@google.com  Nature | Vol 575 | 14 November 2019 | 351  《自然》第575卷2019年11月14日第351页  constraints was approved by a professional player (see ‘Professional  约束已由专业玩家批准(见“专业玩家”)  player statement’ in Methods).  方法中的player语句)。  Learning algorithm  学习算法  To address the complexity and game-theoretic challenges of StarCraft,  为了应对星际争霸的复杂性和博弈论挑战,  AlphaStar uses a combination of new and existing general-purpose  AlphaStar综合使用了新的和现有的通用软件  techniques for neural network architectures, imitation learning, reinforcement  神经网络架构、模拟学习、增强技术  learning, and multi-agent learning. Further details about  学习和多智能体学习。进一步详情  these techniques are given in the Methods.  方法中给出了这些技术。  Central to AlphaStar is a policy πθ(at | st, z) = ℙ[at | st, z], represented  AlphaStar的中心是策略πθ(在|st,z)=ℙ[at|st,z]表示  by a neural network with parameters θ that receives all observations  通过具有参数θ的神经网络接收所有观测值  st = (o1:t, a1:t − 1) from the start of the game as inputs, and selects actions  st=(o1:t,a1:t− 1) 从游戏开始作为输入,并选择动作  as outputs. The policy is also conditioned on a statistic z that summarizes  作为输出。该政策还以统计数据z为条件,该统计数据总结:  a strategy sampled from human data (for example, a build order).  从人类数据中采样的策略(例如,构建顺序)。  Our agent architecture consists of general-purpose neural network  我们的代理架构由通用神经网络组成  components that handl  处理的组件

[返回]
上一篇:基于超图卷积的异质网络半监督节点分类
下一篇:无人驾驶汽车协同感知信息传输负载优化技术