Q Learning

In Q-learning and related algorithms, an agent tries to learn the optimal policy from its history of interaction with the environment. A history an agent is a sequence of state-action-rewards:

⟨s₀,a₀,r₁,s₁,a₁,r₂,s₂,a₂,r₃,…⟩,

which means that the agent was in state s₀ and did action a₀, which resulted in it receiving reward r₁ and being in state s₁; then it did action a₁, received reward r₂, and ended up in state s₂; then it did action a₂, received reward r₃, and ended up in state s₃; and so on.

Q-Learning is a basic form of Reinforcement Learning which uses Q-values (also called action values) to iteratively improve the behavior of the learning agent.

We treat this history of interaction as a sequence of experiences, where an experience is a tuple

⟨s,a,r,s’⟩,

which means that the agent was in state s, it did action a, it received reward r, and it went into state s’.

In Q-learning, the agent maintains a table of Q[S,A], where S is the set of states and A is the set of actions. Q[s,a] represents its current estimate of Q^*(s,a).

Q^*(s,a) is the expected value (cumulative discounted reward) of doing a in state s and then following the optimal policy, where a is an action and s is a state,.

An experience ⟨s,a,r,s’⟩ provides one data point for the value of Q(s,a). The data point is that the agent received the future value of r+ γV(s’), where V(s’) =max_a’ Q(s’,a’) this is the actual current reward plus the discounted estimated future value. This new data point is called a return. The agent can use the temporal difference equation to update its estimate for Q(s,a):

Q[s,a] ←Q[s,a] + α(r+ γmax_a’ Q[s’,a’] – Q[s,a])

or, equivalently,

Q[s,a] ←(1-α) Q[s,a] + α(r+ γmax_a’ Q[s’,a’]).

Q Function

The Q-function uses the Bellman equation and takes two inputs: state (s) and action (a).

Function to get the values of Q for the cells in the table

When we start, all the values in the Q-table are zeros.
There is an iterative process of updating the values. As we start to explore the environment, the Q-function gives us better and better approximations by continuously updating the Q-values in the table. There is an iterative process of updating the values. As we start to explore the environment, the Q-function gives us better and better approximations by continuously updating the Q-values in the table.

Q-learning algorithm process

Q-Learning is a value-based reinforcement learning algorithm which is used to find the optimal action-selection policy using a Q function.
Our goal is to maximize the value function Q.
The Q table helps us to find the best action for each state.
It helps to maximize the expected reward by selecting the best of all possible actions.
Q(state, action) returns the expected future reward of that action at that state.
This function can be estimated using Q-Learning, which iteratively updates Q(s,a) using the Bellman equation.
Initially we explore the environment and update the Q-Table. When the Q-Table is ready, the agent will start to exploit the environment and start taking better actions.

Linear Value Function Approximation

Each action gets a set of n weights to represent the Q values for that dot product.
Weights give importance of all the features in contributing to an actions value.
Approximation accuracy is fundamentally limited by the information provided by the features.

Some terminology used in Q Learning equations

Current State of the agent.
Current Action Picked according to some policy.
Next State where the agent ends up.
Next best action to be picked using current Q-value estimation, i.e. pick the action with the maximum Q-value in the next state.
Current Reward observed from the environment in Response of current action.
(>0 and <=1) : Discounting Factor for Future Rewards. Future rewars are less valuable than current rewards so they must be discounted. Since Q-value is an estimation of expected rewards from a state, discounting rule applies here as well.
Step length taken to update the estimation of Q(S, A).

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Deep Learning and Machine Learning

Machine Learning and Deep Learning Concepts

Q Learning

Q Function

Q-learning algorithm process

Linear Value Function Approximation

Some terminology used in Q Learning equations

Leave a comment Cancel reply

Published by

Srijan

Q Function

Q-learning algorithm process

Linear Value Function Approximation

Some terminology used in Q Learning equations

Share this:

Related

Leave a comment Cancel reply

Published by

Srijan