favorite5In contrast, it is possible to have a single actor generating data into a local replay memory, and then have multiple learners process this data in parallel to learn as effectively as possible from this experience.
favorite4The learner applies an off-policy RL algorithm such as DQN (Mnih et al., 2013) to this minibatch of experience, in order to generate a gradient vector gi .1 The gradients gi are communicated to the parameter server; and the parameters 1 The experience in the replay memory is generated by old behavior policies which are most likely different to the current behavior of the agent; therefore all updates must be performed offpolicy (Sutton & Barto, 1998)..
favorite4We applied our distributed framework for RL, known as Gorila (General Reinforcement Learning Architecture), to create a massively distributed version of the DQN algorithm.
favorite10This architecture consists of four main components: parallel actors that generate new behaviour; parallel learners that are trained from stored experience; a distributed neural network to represent the value function or behaviour policy; and a distributed experience replay memory.
favorite12In order to exploit this scalability, deep learning algorithms have made extensive use of hardware advances such as GPUs. However, recent approaches have focused on massively distributed architectures that can learn from more data in parallel and therefore outperform training on a single machine (Coates et al., 2013; Dean et al., 2012).