THE BIG SQUEEZE: COMPRESSION IN REINFORCEMENT LEARNING

THE BIG SQUEEZE: COMPRESSION IN REINFORCEMENT LEARNING

HUGH ZHANG

THE SEARCH FOR SCALABILITY

Commentary on Sergey Levine’s "Near-Optimal Representation Learning for Hierarchical Reinforcement Learning"

Submitted to ICLR 2019.
Reinforcement Learning
- Standard reinforcement learning feeds raw environment observations to an agent and rewards it for good behavior.
- Huge recent success.
  - Deepmind’s Atari and more famously AlphaGo.
  - OpenAI’s Dota2
- However as tasks become more complex, we need better methods.
  - Go has an already enormous state space of size 3 ^ 361 ~= 10 ^ 171 which required Google level compute.
  - Simple games like Atari have a state space of size 256 ^ (84 * 84) ~= 10 ^ 17000 per time step.
  - 1080p videos have 256^(1920*1080*3*24) ~ 10^350000000 possible states per second.
  - Current deep learning techniques cannot scale directly.
  - However, most of this information is redundant, so cleverer techniques can succeed where naive reinforcement learning cannot.
Prior Work
- Techniques like Principal Component Analysis have long been used to compress high dimensional data.
- More recently, people have started to use transfer learning from Imagenet networks like Inception to extract the relevant aspects of imaging data.
- However, an ideal “compression” method would be tied to the agent’s behavior so that it could learn to discard information irrelevant to the reward.
- Early papers such as Deepmind’s feudal networks propose hierarchical reinforcement learning, splitting the learning into two roles.
  - A “manager” who compresses the state space into high level “goals”.
    - The manager receives rewards directly from the environment but has only coarse control of the agent.
  - An “employee” who oversees the low level details.
    - The employee is only loosely connected to the reward and is instead motivated to hit the manager’s goals.
  - Analogy of a company.
    - Imagine Coca-Cola decides to produce a new watermelon flavor.
    - The product flops if the employees can’t replicate the flavor.
    - The product also flops if consumers don’t like the new flavor, even if it is successfully created (managerial direction failure).
    - The employees don’t need to worry about maximizing revenue (directly) and can just focus on synthesizing flavor molecules.
    - The manager doesn’t worry about replicating watermelon flavor compounds and instead worries about the company’s direction.
    - Each can do their job better with division of labor.
Paper
- The question asked is “How much can you compress the state without severely affecting performance?”
  - If every action in your original space has an analogous action in your compressed space, no information is lost.
  - This turns out to be too strict as many actions / states are irrelevant and can be safely discarded.
- The mathematical details are slightly more involved, but the key insight is that you only need to be able to reach similar states to what the optimal policy would also reach.
  - If your compression excludes irrelevant states, who cares?
  - States further away are discounted in importance.
    - Standard trick in RL with rewards to help the models converge.
    - Humans behave similarly: would you rather have a million dollars now or in twenty years?
Experiments
- Tested on MuJoCo tasks, Ant Maze and Ant Push, navigating a simple maze against naive methods of representation learning.
- Although the paper certainly works for the toy example, it is difficult to guess its effectiveness when scaled up to harder problems.

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

Article, Artificial IntelligenceCraig SmithOctober 16, 2018