TLDW logo

The BEST Q-Learning example! | The Mountain Car Problem

By Marcus Koseck

Summary

## Key takeaways - **Mountain Car: Not Enough Power**: The mountain car needs to build momentum by rocking back and forth because it lacks the inherent power to directly climb the hill. [00:10] - **RL Framework: Agent, Environment, Reward, State**: In reinforcement learning, an agent takes actions in an environment, which then provides a reward and a new state back to the agent. [00:30] - **Reward Only for Goal Achievement**: To ensure the agent accomplishes the ultimate task, it should only be rewarded for reaching the final goal, not for intermediate actions like moving left or right. [01:02] - **State Space Visualization**: The state space graph, with position on the x-axis and velocity on the y-axis, visually represents the car's back-and-forth motion as it attempts to climb. [01:41] - **Tiling for Q-Value Parameterization**: Tiling creates a grid over the state space, allowing each grid square to be assigned a Q-value, which helps in solving the problem more easily. [01:56] - **Q-Value: Predicting Future Returns**: The Q-value represents the predicted future return, calculated by considering current rewards and future states, with the gamma parameter determining the importance of future rewards. [02:50]

Topics Covered

  • Reward only the final goal in reinforcement learning.
  • Tiling discretizes the state space for Q-value assignment.
  • Q-values predict future returns in reinforcement learning.
  • Gamma weighs the importance of future rewards.
  • Alpha controls the learning rate of Q-values.

Full Transcript

this is the mountain car problem the car

at the bottom of the hill is controlled

by a reinforcement learning algorithm

the goal is to get the car to the top of

the hill but the caveat is that the car

does not have enough energy to climb the

hill by itself it needs to roll back and

forth to gain some momentum to

eventually get to the flag that's in the

top right corner a successful case looks

something like this but how do we

develop an algorithm to do the same

thing before we can develop an algorithm

we need to understand the reinforcement

learning framework first we need to

introduce the agent the agent is the

algorithm and to start the reinforcement

learning process the agent takes an

action this action is taken within an

environment which results in the

mountain car moving back and forth on

the hill then the environment gives the

agent a reward for the action and a new

state the overall pattern is each state

corresponds to an action that is chosen

by the agent then the environment gives

the agent a reward for set action and

produces is the next state in this case

we only reward the agent for reaching

the flag in the top right corner that's

because if we were to reward the agent

for say going right or going left at

certain times it'll exploit that in the

algorithm and so instead of actually

reaching the flag in the top right

corner it'll instead really exploit

going left or going right and it'll

never actually accomplish the task and

so a very fundamental rule of

reinforcement learning is reward the

agent for the end result otherwise it'll

will just get stuck in the middle and it

won't do what what you want it to do now

that we understand the reinforcement

learning framework we should collect

some data to get an understanding of how

the agent interacts with the environment

you can see how the data points move in

a circle this reflects the back and

forth motion of the car as it attempts

to climb the hill this specific graph is

called a state space where the x- axis

is the position of the car and the Y AIS

is the velocity of the car now we're

going to parameterize the space using a

concept called tiling

tiling is making a grid that allows each

Square to be a q value I can place this

grid wherever I'd like to make the

problem solving easier but for now we'll

just place the grid here I do understand

that I said the magic word that's Q

value so how about we go through a quick

derivation to understand what's going on

in reinforcement learning a sequence of

State actions and rewards that end with

the agent reaching the end goal is

called an episode we know that the

number of states and actions is finite

this means that the reward WS are also

finite and can be written like

this the rewards are collected

sequentially and and G is called the

return we can group The Future rewards

like this thanks to algebra but we

notice that the reward in the future is

the same as the future return so we need

to predict the future

return this is the Q value that everyone

talks

about the funny looking character in

front of Q is called gamma that allows

us to determine how important future

returns are in our calculation higher

gamma values mean the Q value has more

weight while smaller gamma values means

the Q value has less weight let's take

the current Q value and subtract it from

the next q value this generalization

allows us to examine every case instead

of looking at each case

individually subtracting the current Q

value from the next q value gives us a

gradient or rate of change for how well

the agent is performing to update our Q

value we take our current Q Value Plus

our gradient which is everything being

multiplied by the alpha

parameter you can think of alpha as a

parameter that decides how large the

gradient should be if you want our Q

value to change in very large chunks we

would want Alpha to be very large

whereas if you want very small micro

changes in our Q value Alpha should be

very small to turn this into Q learning

we change the next q value with a

maximum

this will force the algorithm to pick

the largest Q value which is

corresponding to the best action we

incorporate the Q values by using the

tying method each of the squares you see

on the screen gets assigned a q value

once we do that and allow the agent to

learn over a long enough period of time

is how we get the solution to the

mountain car problem I do understand

that people like compilation videos of

the agent learning over time so I will

release a video of the agent first

starting by taking random actions and

then later on by episode a th000 it will

just do what you just saw and it'll give

you good continuity of how the agent

learns over time so that video will be

coming out right after this video thanks

for watching

Loading...

Loading video analysis...