Q-Learning with delayed updates

I’m sure someone has thought of this, but I didn’t look.

So, what if we delayed the update of the q-value and the policy for x number of time steps.  We would keep track of the history and create an average reward.  Initialize the policy for each action to be 1/|A| so that we can start with a large x.  Then we can update q-value and the policies for each of the actions based on that average reward.  Finally we can modify x based on a damped sine wave as time goes on.

This is similar to a lenient learner, but may also work against competitive players as well.  The time where I average is sort of like a learning period to see what type of opponent and player I am playing against.  This is what is like the lenient learner.

