Regarding the stagnation aspect...
In the usage section you wrote:
try a tradeoff between exploration with 1 and exploitation with 0 by choosing e.g. an
epsilon
of 0.5. This basically helps to avoid to be stuck in local maxima and avoids the stagnation of the system. Sometimes stagnation might be wanted, in our case not as it would mean that first visitors of the installation could teach the swarm while later the visitors can only experience and not change the selection of actions anymore.
While, you are right about the ratio between exploration and exploitation, I dont think the conclusion is correct.
Even though your epsilon
is 0 and the agent will always pick the same action you can still punish the agent for doing so, modifying the Qtable. When the old maximum has been pushed low enough a new Qvalue will be at its maximum leading to a new behaviour.