-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interface for exploration policy #10
Comments
Yeah. We also need an interface for custom decay schedules for |
Hmm... yes this is a good question. I think the third option is reasonable. We might consider calling it I think We should also think about exactly what the arguments should be. Is action(::ExplorationStrategy, on_policy_action, obs, rng) We could also consider leaving out the on-policy action from the call altogether and say, if (note I might be saying some of the wrong words because I have less experience with RL) |
We can't pass in just the actions because for certain exploration strategies like |
Ah, I see - that makes sense
Did you mean "Should that be in another package then?" or "Should we create another package?" I think yes, it should be somewhere besides one of the learning packages, but I would hope to not create a new one. My philosophy on packages has changed a lot since we broke up POMDPToolbox. Now I think it would have been much better to create better documentation and perhaps use submodules than to have a bunch of small packages! |
For now I think this could live in |
Do we really want to have |
Suggestion: abstract type AbstractSchedule end # define linear decay, exponential decay and more
function update_value!(::AbstractSchedule, ::Real)
mutable struct EpsGreedyPolicy <: Policy
eps::Real
schedule::AbstractSchedule
policy::Policy
rng::AbstractRNG
actions::Vector{A}
end
function action(p::EpsGreedyPolicy, s)
update_value!(p.schedule, p.eps) # update the value of epsilon according to the schedule
if rand(p.rng) < p.eps
return rand(p.rng, p.actions)
else
return action(p.policy)
end
end |
Should |
No! I will submit a proper PR next week but it is just to give you an overview of the idea |
What would be a good interface for specifying the exploration policy?
It is implemented differently here and in
DeepQLearning.jl
.EpsGreedyPolicy
and uses the internal of that policy to access the Q value. I think it is pretty bad:EpsGreedyPolicy
should be agnostic to the type of policy for the greedy part (right now it assumes a tabular policy I think), if we improveEpsGreedyPolicy
then the code here will break.DeepQLearning.jl
, the user must pass in a functionf
andf(policy, env, obs, global_step, rng)
will be called to return the action. I took inspiration from MCTS.jl for this. However it is not super convenient to define decaying epsilon schedule with this approach.action(::ExplorationPolicy, current_policy, env, obs, rng)
. Dispatching on the type ofExplorationPolicy
and having users implement their own type seems more julian than passing a function. The methodaction
is not super consistent with the rest of the POMDPs.jl interface since it takes the current policy and the environment as input.Any thoughts?
The text was updated successfully, but these errors were encountered: