Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replacing done with truncated and terminated #10

Open
ashok-arora opened this issue Jul 1, 2024 · 9 comments
Open

Replacing done with truncated and terminated #10

ashok-arora opened this issue Jul 1, 2024 · 9 comments

Comments

@ashok-arora
Copy link

Hey Kevin,
I hope you are doing well. I noticed a small bug where the step function returns only obs, reward, done, info instead of the obs, reward, terminated, truncated, info. I came across this article from gymansium that emphasised the need for both terminated and truncated. Can I help in updating the codebase?

@kevslinger
Copy link
Owner

Hi Ashok,

This is a somewhat complicated issue. In the older versions of gym, this truncated variable was stored inside info (see

DTQN/run.py

Line 371 in 79ccf8b

if info.get("TimeLimit.truncated", False):
). This new gym version with 5 returns values is a breaking change, and updating to it would mean one could no longer run DTQN on the original environments. At the same time, leaving it in its current state would mean you cannot run DTQN on "modern" (updated) environments.

With that in mind, there are a few ways we can proceed with this:
1.) Do nothing, and leave the repo in its current (outdated) state.
2.) Update the repo to account for the new version of gym, while keeping the paper branch in its original, old form. Then experiments could be run on the main branch with new environments, and the paper branch with old environments.
3.) Create a new branch from either the main branch and/or paper branch to accommodate for new gym environments and allow people to switch between branches based on the version of their environment.
4.) Come up with a sort of gym wrapper to interface between the two styles of environments. We could either wrap the old environments so that they now return obs, reward, terminated, truncated, info instead of obs, reward, done, info, and then update the codebase accordingly, OR we could wrap the new environments so that they return obs, reward, done, info and put truncated inside the info variable, the same as the old method.

Probably option 4 is the best way. Let me know what you think. I of course will welcome pull requests towards this issue

@ashok-arora
Copy link
Author

Thank you for the detailed response.

Yeah, the fourth option seems like the straightforward way since splitting branches may lead to confusion for future users. To maintain compatibility, we can pass in an arg of maybe --gym old (returns obs, reward, done, info) or --gym new (returns obs, reward, terminated, truncated, info) that handles it accordingly, with --gym old being the default to maintain compatibility with the paper branch.

On a sidenote, I am wondering how

DTQN/run.py

Line 371 in 79ccf8b

if info.get("TimeLimit.truncated", False):

is able to calculate the maximum time that should be decided for the truncated part. Would it make more sense to define the time limit ourselves depending upon the the velocity and the distance between heaven and hell?

@kevslinger
Copy link
Owner

That sounds like a good idea, we can use the gym-version flag to determine if we need to wrap the environment in a gym wrapper to shrink the size of the returned tuple from 5 arguments to 4.

As for the TimeLimit.truncated part, that is a bit of old gym magic -- by supplying the max_episode_steps argument when adding an environment to the gym register, the backend wraps the environment in a TimeLimitWrapper which then ends the episode and sets TimeLimit.truncated: True in the info as part of the return of the step function. For example,

max_episode_steps=200,
means the episode will always end after 200 steps. I don't think we need to do anything to change that

@ashok-arora
Copy link
Author

That sounds like a good idea, we can use the gym-version flag to determine if we need to wrap the environment in a gym wrapper to shrink the size of the returned tuple from 5 arguments to 4.

Okay sure, I'll send in a PR with the gym-version flag.

As for the TimeLimit.truncated part, that is a bit of old gym magic -- by supplying the max_episode_steps argument when adding an environment to the gym register, the backend wraps the environment in a TimeLimitWrapper which then ends the episode and sets TimeLimit.truncated: True in the info as part of the return of the step function. For example,

max_episode_steps=200,

means the episode will always end after 200 steps. I don't think we need to do anything to change that

and what is the basis for the choice of 200 for max_episode_steps, and similarly other values for the other environments?

@kevslinger
Copy link
Owner

In general, we want to give the agents the ability to successfully end an episode while exploring, so it the max_episode_steps needs to be large enough such that this is possible. However, we don't want to allow the episodes to go on forever, as this may fill the replay buffer with many useless experiences. This is not a value I spent a lot of time tuning

@ashok-arora
Copy link
Author

ashok-arora commented Aug 21, 2024

Apologies for the delay. I got sidetracked due to coursework. I am working on this and the info of gridverse envs is an empty dict {} for all the timesteps, is it supposed to be so?

@kevslinger
Copy link
Owner

Hi Ashok, no problem. The gym-gridverse environments have a time limit of 500 (see

env = TimeLimit(GridVerseWrapper(env), max_episode_steps=500)
). So after 500 timesteps the info dict should be {"TimeLimit.truncated": True}. Let me know if you are still seeing an empty dictionary for info even after 500 timesteps as that would be a bug we'd need to address

@ashok-arora
Copy link
Author

Hey Kevin, I tried to modify code at two places. First in the gv_wrapper.py file and then in the models, dqn.py file. Following is the output:

  1.  diff --git a/envs/gv_wrapper.py b/envs/gv_wrapper.py
     index df03c00..0b9a21a 100644
     --- a/envs/gv_wrapper.py
     +++ b/envs/gv_wrapper.py
     @@ -17,14 +17,24 @@ class GridVerseWrapper(gym.Wrapper):
                      )
                  ]
              )
     +        self.time = 0
      
          def reset(self, seed: Optional[int] = None):
     +        self.time = 0
              if seed is not None:
                  self._np_random, seed = seeding.np_random(seed)
              obs = self.env.reset()
              return (obs["grid"][:, :, 0] + obs["grid"][:, :, 2]).flatten()
      
          def step(self, action: Union[int, np.ndarray]):
     +        self.time += 1
              obs, reward, done, info = self.env.step(action)
              obs = (obs["grid"][:, :, 0] + obs["grid"][:, :, 2]).flatten()
     +        print(self.time, info, done)
              return obs, reward, done, info
     
    which gave output:
    495 {} False
    496 {} False
    497 {} False
    498 {} False
    499 {} False
    500 {} False
    [ September 01, 19:28:36 ] Eval #2 Success Rate: 0.00, Return: -25.00, Episode Length: 500.00, Hours: 0.07
  2.    diff --git a/dtqn/agents/dqn.py b/dtqn/agents/dqn.py
       index ca3ec34..90dd4ba 100644
       --- a/dtqn/agents/dqn.py
       +++ b/dtqn/agents/dqn.py
       @@ -87,6 +87,7 @@ class DqnAgent:
        
                # Reset the environment
                self.current_obs = np.array([self.env.reset()]).flatten().copy()
       +        self.time = 0
        
            def prepopulate(self, prepop_steps: int) -> None:
                """Prepopulate the replay buffer with `prepop_steps` of experience"""
       @@ -115,6 +116,7 @@ class DqnAgent:
                self.current_obs = obs.copy()
        
            def step(self) -> bool:
       +        self.time += 1
                """Take one step of the environment"""
                action = self.get_action(
                    np.expand_dims(self.current_obs, axis=0), epsilon=self.exp_coef.val
       @@ -122,6 +124,8 @@ class DqnAgent:
                obs, reward, done, info = self.env.step(action)
                obs = np.array([obs]).flatten().copy()
        
       +        print(self.time, info.get("TimeLimit.truncated", False), done)
                if info.get("TimeLimit.truncated", False):
                    buffer_done = False
                else:
       @@ -139,6 +143,10 @@ class DqnAgent:
                    self.current_episode += 1
                    obs = self.env.reset()
        
       +        if done or info.get("TimeLimit.truncated", False):
       +            self.time = 0
                self.current_obs = np.array([obs]).flatten().copy()
    which gave output:
    498 False False
    499 False False
    500 True True
    

It seems a bit inconsistent because the wrapper should have reflected the correct info dict in 1 and both values shouldn't have been True in 2. What do you think?

@kevslinger
Copy link
Owner

Hi Ashok, very sorry I have left you hanging for so long.

It may seem surprising, but I believe this is the intended behaviour for the 2 places you put print statements. For the first example, it returns an empty dict and False because the GridverseWrapper class is the inner Env for the TimeLimit Wrapper

env = TimeLimit(GridVerseWrapper(env), max_episode_steps=500)

We can take a look at the step function of the TimeLimit class from gym 0.18

  def step(self, action):
      assert self._elapsed_steps is not None, "Cannot call env.step() before calling reset()"
      observation, reward, done, info = self.env.step(action)
      self._elapsed_steps += 1
      if self._elapsed_steps >= self._max_episode_steps:
          info['TimeLimit.truncated'] = not done
          done = True
      return observation, reward, done, info

We can see it calls the inner (in this case, GridverseWrapper) step function first, which would return an empty dict and False, and then adds the "TimeLimit.truncated" and sets done = True.

This also explains why both are True in the 2nd example. Again, sorry for taking so long to respond, I hope this helps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants