I’m sure someone has thought of this, but I didn’t look.
So, what if we delayed the update of the q-value and the policy for x number of time steps. We would keep track of the history and create an average reward. Initialize the policy for each action to be 1/|A| so that we can start with a large x. Then we can update q-value and the policies for each of the actions based on that average reward. Finally we can modify x based on a damped sine wave as time goes on.
This is similar to a lenient learner, but may also work against competitive players as well. The time where I average is sort of like a learning period to see what type of opponent and player I am playing against. This is what is like the lenient learner.