The BEST Q-Learning example! | The Mountain Car Problem
By Marcus Koseck
Summary
## Key takeaways - **Mountain Car: Not Enough Power**: The mountain car needs to build momentum by rocking back and forth because it lacks the inherent power to directly climb the hill. [00:10] - **RL Framework: Agent, Environment, Reward, State**: In reinforcement learning, an agent takes actions in an environment, which then provides a reward and a new state back to the agent. [00:30] - **Reward Only for Goal Achievement**: To ensure the agent accomplishes the ultimate task, it should only be rewarded for reaching the final goal, not for intermediate actions like moving left or right. [01:02] - **State Space Visualization**: The state space graph, with position on the x-axis and velocity on the y-axis, visually represents the car's back-and-forth motion as it attempts to climb. [01:41] - **Tiling for Q-Value Parameterization**: Tiling creates a grid over the state space, allowing each grid square to be assigned a Q-value, which helps in solving the problem more easily. [01:56] - **Q-Value: Predicting Future Returns**: The Q-value represents the predicted future return, calculated by considering current rewards and future states, with the gamma parameter determining the importance of future rewards. [02:50]
Topics Covered
- Reward only the final goal in reinforcement learning.
- Tiling discretizes the state space for Q-value assignment.
- Q-values predict future returns in reinforcement learning.
- Gamma weighs the importance of future rewards.
- Alpha controls the learning rate of Q-values.
Full Transcript
this is the mountain car problem the car
at the bottom of the hill is controlled
by a reinforcement learning algorithm
the goal is to get the car to the top of
the hill but the caveat is that the car
does not have enough energy to climb the
hill by itself it needs to roll back and
forth to gain some momentum to
eventually get to the flag that's in the
top right corner a successful case looks
something like this but how do we
develop an algorithm to do the same
thing before we can develop an algorithm
we need to understand the reinforcement
learning framework first we need to
introduce the agent the agent is the
algorithm and to start the reinforcement
learning process the agent takes an
action this action is taken within an
environment which results in the
mountain car moving back and forth on
the hill then the environment gives the
agent a reward for the action and a new
state the overall pattern is each state
corresponds to an action that is chosen
by the agent then the environment gives
the agent a reward for set action and
produces is the next state in this case
we only reward the agent for reaching
the flag in the top right corner that's
because if we were to reward the agent
for say going right or going left at
certain times it'll exploit that in the
algorithm and so instead of actually
reaching the flag in the top right
corner it'll instead really exploit
going left or going right and it'll
never actually accomplish the task and
so a very fundamental rule of
reinforcement learning is reward the
agent for the end result otherwise it'll
will just get stuck in the middle and it
won't do what what you want it to do now
that we understand the reinforcement
learning framework we should collect
some data to get an understanding of how
the agent interacts with the environment
you can see how the data points move in
a circle this reflects the back and
forth motion of the car as it attempts
to climb the hill this specific graph is
called a state space where the x- axis
is the position of the car and the Y AIS
is the velocity of the car now we're
going to parameterize the space using a
concept called tiling
tiling is making a grid that allows each
Square to be a q value I can place this
grid wherever I'd like to make the
problem solving easier but for now we'll
just place the grid here I do understand
that I said the magic word that's Q
value so how about we go through a quick
derivation to understand what's going on
in reinforcement learning a sequence of
State actions and rewards that end with
the agent reaching the end goal is
called an episode we know that the
number of states and actions is finite
this means that the reward WS are also
finite and can be written like
this the rewards are collected
sequentially and and G is called the
return we can group The Future rewards
like this thanks to algebra but we
notice that the reward in the future is
the same as the future return so we need
to predict the future
return this is the Q value that everyone
talks
about the funny looking character in
front of Q is called gamma that allows
us to determine how important future
returns are in our calculation higher
gamma values mean the Q value has more
weight while smaller gamma values means
the Q value has less weight let's take
the current Q value and subtract it from
the next q value this generalization
allows us to examine every case instead
of looking at each case
individually subtracting the current Q
value from the next q value gives us a
gradient or rate of change for how well
the agent is performing to update our Q
value we take our current Q Value Plus
our gradient which is everything being
multiplied by the alpha
parameter you can think of alpha as a
parameter that decides how large the
gradient should be if you want our Q
value to change in very large chunks we
would want Alpha to be very large
whereas if you want very small micro
changes in our Q value Alpha should be
very small to turn this into Q learning
we change the next q value with a
maximum
this will force the algorithm to pick
the largest Q value which is
corresponding to the best action we
incorporate the Q values by using the
tying method each of the squares you see
on the screen gets assigned a q value
once we do that and allow the agent to
learn over a long enough period of time
is how we get the solution to the
mountain car problem I do understand
that people like compilation videos of
the agent learning over time so I will
release a video of the agent first
starting by taking random actions and
then later on by episode a th000 it will
just do what you just saw and it'll give
you good continuity of how the agent
learns over time so that video will be
coming out right after this video thanks
for watching
Loading video analysis...