MineRL RLlib Baselines Implementation Goals

Environments

Treechop

MineRLTreechopVectorObf-v0

Navigate

MineRLNavigateVectorObf-v0
MineRLNavigateExtremeVectorObf-v0
MineRLNavigateDenseVectorObf-v0
MineRLNavigateExtremeDenseVectorObf-v0

Obtain Diamond

MineRLObtainDiamondVectorObf-v0 (competition evaluation task)
MineRLObtainDiamondDenseVectorObf-v0

Obtain Iron Pickaxe

We will ignore this because it is a task subset of Obtain Diamond. However using this category human data may be valuable.

MineRLObtainIronPickaxeVectorObf-v0
MineRLObtainIronPickaxeDenseVectorObf-v0

Metrics (TODO)

Final episode reward
Final episode reward with human normalized performance
Sample efficiency (0, 100k, 1M, 8M)
Episode reward curves

Action Space

Continuous
1. Naturally supported by minerl vector actions (64,)
Discrete
1. K-means clustering on human data and use as discretized action space
2. Helps lessen the exploration problem with a cost of restricting the action space

Observation Space

Tuple Observations (Image (64,64,3), vector (64,))

Model

Convolutional Neural Network for image observations
Concatenate hidden output with vector observations
Feed concatenation into feed-forward network (and possibly RNN) for latent state representation
Use latent state representation for policy, value, Q networks.

RL setting

Online

Online exploration
1. Use algorithm default exploration
Learn policy from environment sampled data
Both on-policy and off-policy RL algorithms

Offline

No exploration
Learn policy from human data
Only off-policy RL algorithms

Mixed

Online exploration in the environment
Learn from environment sampled data and human data
Only off-policy RL algorithms