First, we need to define a set of states, actions, rewards and transition probabilities. From the definition of the states, we must observe that the Markov property exists (all of these states depend on their immediately previous state only). Then, we can use dynamic programming under the reinforcement learning framework [1] to find the optimal state-value function which maximizes the long term rewards.
References
[1] R. Sutton and A. Barto. Reinforcement learning: An introduction. Cambridge Univ Press, 1998.
No comments:
Post a Comment