In Q-learning and related algorithms, an agent tries to learn the optimal policy from its history of interaction with the environment. A history an agent is a sequence of state-action-rewards:
⟨s0,a0,r1,s1,a1,r2,s2,a2,r3,…⟩,
which means that the agent was in state s0 and did action a0, which resulted in it receiving reward r1 and being in state s1; then it did action a1, received reward r2, and ended up in state s2; then it did action a2, received reward r3, and ended up in state s3; and so on.
Q-Learning is a basic form of Reinforcement Learning which uses Q-values (also called action values) to iteratively improve the behavior of the learning agent.
We treat this history of interaction as a sequence of experiences, where an experience is a tuple
⟨s,a,r,s’⟩,
which means that the agent was in state s, it did action a, it received reward r, and it went into state s’.
In Q-learning, the agent maintains a table of Q[S,A], where S is the set of states and A is the set of actions. Q[s,a] represents its current estimate of Q*(s,a).
Q*(s,a) is the expected value (cumulative discounted reward) of doing a in state s and then following the optimal policy, where a is an action and s is a state,.
An experience ⟨s,a,r,s’⟩ provides one data point for the value of Q(s,a). The data point is that the agent received the future value of r+ γV(s’), where V(s’) =maxa’ Q(s’,a’) this is the actual current reward plus the discounted estimated future value. This new data point is called a return. The agent can use the temporal difference equation to update its estimate for Q(s,a):
Q[s,a] ←Q[s,a] + α(r+ γmaxa’ Q[s’,a’] – Q[s,a])
or, equivalently,
Q[s,a] ←(1-α) Q[s,a] + α(r+ γmaxa’ Q[s’,a’]).
Q Function
The Q-function uses the Bellman equation and takes two inputs: state (s) and action (a).

When we start, all the values in the Q-table are zeros.
There is an iterative process of updating the values. As we start to explore the environment, the Q-function gives us better and better approximations by continuously updating the Q-values in the table. There is an iterative process of updating the values. As we start to explore the environment, the Q-function gives us better and better approximations by continuously updating the Q-values in the table.
Q-learning algorithm process

- Q-Learning is a value-based reinforcement learning algorithm which is used to find the optimal action-selection policy using a Q function.
- Our goal is to maximize the value function Q.
- The Q table helps us to find the best action for each state.
- It helps to maximize the expected reward by selecting the best of all possible actions.
- Q(state, action) returns the expected future reward of that action at that state.
- This function can be estimated using Q-Learning, which iteratively updates Q(s,a) using the Bellman equation.
- Initially we explore the environment and update the Q-Table. When the Q-Table is ready, the agent will start to exploit the environment and start taking better actions.
Linear Value Function Approximation
Each action gets a set of n weights to represent the Q values for that dot product.
Weights give importance of all the features in contributing to an actions value.
Approximation accuracy is fundamentally limited by the information provided by the features.

Some terminology used in Q Learning equations
- Current State of the agent.
- Current Action Picked according to some policy.
- Next State where the agent ends up.
- Next best action to be picked using current Q-value estimation, i.e. pick the action with the maximum Q-value in the next state.
- Current Reward observed from the environment in Response of current action.
- (>0 and <=1) : Discounting Factor for Future Rewards. Future rewars are less valuable than current rewards so they must be discounted. Since Q-value is an estimation of expected rewards from a state, discounting rule applies here as well.
- Step length taken to update the estimation of Q(S, A).