In order to understand what is all the buzz about DeepMind’s reinforcement learning papers, I decided to implement Deep Reinforcement Learning with:
- Double Q-learning
- Experience Replay
The neural network was then trained on the OpenAI Lunar Lander environment.
I did my best to implement the above in TensorFlow with just the paper published by DeepMind as a reference. However, I succumbed to referencing code snippets from multiple sources on the internet.
The artificial brain figures out how to land consistently on target after the equivalent of 52 hours of trying:
So what’s the big deal?
I know it looks like a simple game, but trust me, it’s not that simple.
- It simulates real-ish physics – meaning there is gravity, momentum, friction and the landing legs have spring in them!
- The engines do not fire consistently in the same direction all the time, if you look close enough you’ll notice that the particles shoot out in randomly varying angles.
- The artificial brain knows nothing about gravity or what the ground means. In fact, it does not even know that it’s trying to land something!
- All we give the brain is a score to work on. If it gets closer to the landing target, the score goes up and vice versa if it moves further away.
- The score also goes down whenever an engine is fired, apparently, we want the brain to be eco-friendly and not use fuel unnecessarily.
In the beginning…
It starts by just doing randomly choosing between:
- Fire Main Engine (the one below the lander)
- Fire Left Engine (to rotate clockwise and nudge it a little in the opposite direction)
- Fire Right Engine (does the opposite of Left Engine)
- Do nothing
Unsurprisingly the brain pilots the lander like a pro at this stage:
It copes with diverse situations
Everything the game starts the lander gets tossed in a random direction with varying force as a result, it learns to cope with less than ideal situations – like getting tossed hard to the left at the start:
Learns to stay alive
When it moves out of the screen or hits the ground on anything else but its legs, the score goes down by a hundred. Hence, very quickly (at just over 1 hour of trying) it learns to stay alive by just hovering and staying away from the edges:
Finding Terra Firma
At around 11 hours it starts to figure that the ground is kind of nice but drifts off to the right in the process (also notice that its piloting skills are a little shaky at this point):