Replacing `done` with `truncated` and `terminated` #10

ashok-arora · 2024-07-01T07:15:19Z

Hey Kevin,
I hope you are doing well. I noticed a small bug where the step function returns only obs, reward, done, info instead of the obs, reward, terminated, truncated, info. I came across this article from gymansium that emphasised the need for both terminated and truncated. Can I help in updating the codebase?

The text was updated successfully, but these errors were encountered:

kevslinger · 2024-07-01T16:04:51Z

Hi Ashok,

This is a somewhat complicated issue. In the older versions of gym, this truncated variable was stored inside info (see

DTQN/run.py

Line 371 in 79ccf8b

if info.get("TimeLimit.truncated", False):

). This new gym version with 5 returns values is a breaking change, and updating to it would mean one could no longer run DTQN on the original environments. At the same time, leaving it in its current state would mean you cannot run DTQN on "modern" (updated) environments.

With that in mind, there are a few ways we can proceed with this:
1.) Do nothing, and leave the repo in its current (outdated) state.
2.) Update the repo to account for the new version of gym, while keeping the paper branch in its original, old form. Then experiments could be run on the main branch with new environments, and the paper branch with old environments.
3.) Create a new branch from either the main branch and/or paper branch to accommodate for new gym environments and allow people to switch between branches based on the version of their environment.
4.) Come up with a sort of gym wrapper to interface between the two styles of environments. We could either wrap the old environments so that they now return obs, reward, terminated, truncated, info instead of obs, reward, done, info, and then update the codebase accordingly, OR we could wrap the new environments so that they return obs, reward, done, info and put truncated inside the info variable, the same as the old method.

Probably option 4 is the best way. Let me know what you think. I of course will welcome pull requests towards this issue

ashok-arora · 2024-07-02T05:39:51Z

Thank you for the detailed response.

Yeah, the fourth option seems like the straightforward way since splitting branches may lead to confusion for future users. To maintain compatibility, we can pass in an arg of maybe --gym old (returns obs, reward, done, info) or --gym new (returns obs, reward, terminated, truncated, info) that handles it accordingly, with --gym old being the default to maintain compatibility with the paper branch.

On a sidenote, I am wondering how

DTQN/run.py

Line 371 in 79ccf8b

if info.get("TimeLimit.truncated", False):

is able to calculate the maximum time that should be decided for the truncated part. Would it make more sense to define the time limit ourselves depending upon the the velocity and the distance between heaven and hell?

kevslinger · 2024-07-02T17:53:12Z

That sounds like a good idea, we can use the gym-version flag to determine if we need to wrap the environment in a gym wrapper to shrink the size of the returned tuple from 5 arguments to 4.

As for the TimeLimit.truncated part, that is a bit of old gym magic -- by supplying the max_episode_steps argument when adding an environment to the gym register, the backend wraps the environment in a TimeLimitWrapper which then ends the episode and sets TimeLimit.truncated: True in the info as part of the return of the step function. For example,

DTQN/envs/__init__.py

Line 47 in 79ccf8b

max_episode_steps=200,

means the episode will always end after 200 steps. I don't think we need to do anything to change that

ashok-arora · 2024-07-03T08:52:20Z

That sounds like a good idea, we can use the gym-version flag to determine if we need to wrap the environment in a gym wrapper to shrink the size of the returned tuple from 5 arguments to 4.

Okay sure, I'll send in a PR with the gym-version flag.

As for the TimeLimit.truncated part, that is a bit of old gym magic -- by supplying the max_episode_steps argument when adding an environment to the gym register, the backend wraps the environment in a TimeLimitWrapper which then ends the episode and sets TimeLimit.truncated: True in the info as part of the return of the step function. For example,

DTQN/envs/__init__.py

Line 47 in 79ccf8b

max_episode_steps=200,

means the episode will always end after 200 steps. I don't think we need to do anything to change that

and what is the basis for the choice of 200 for max_episode_steps, and similarly other values for the other environments?

kevslinger · 2024-07-05T17:56:30Z

In general, we want to give the agents the ability to successfully end an episode while exploring, so it the max_episode_steps needs to be large enough such that this is possible. However, we don't want to allow the episodes to go on forever, as this may fill the replay buffer with many useless experiences. This is not a value I spent a lot of time tuning

ashok-arora · 2024-08-21T17:11:43Z

Apologies for the delay. I got sidetracked due to coursework. I am working on this and the info of gridverse envs is an empty dict {} for all the timesteps, is it supposed to be so?

kevslinger · 2024-08-31T10:52:46Z

Hi Ashok, no problem. The gym-gridverse environments have a time limit of 500 (see

DTQN/utils/env_processing.py

Line 66 in bc46144

env = TimeLimit(GridVerseWrapper(env), max_episode_steps=500)

). So after 500 timesteps the info dict should be {"TimeLimit.truncated": True}. Let me know if you are still seeing an empty dictionary for info even after 500 timesteps as that would be a bug we'd need to address

ashok-arora · 2024-09-01T14:24:46Z

Hey Kevin, I tried to modify code at two places. First in the gv_wrapper.py file and then in the models, dqn.py file. Following is the output:

 diff --git a/envs/gv_wrapper.py b/envs/gv_wrapper.py
 index df03c00..0b9a21a 100644
 --- a/envs/gv_wrapper.py
 +++ b/envs/gv_wrapper.py
 @@ -17,14 +17,24 @@ class GridVerseWrapper(gym.Wrapper):
                  )
              ]
          )
 +        self.time = 0
  
      def reset(self, seed: Optional[int] = None):
 +        self.time = 0
          if seed is not None:
              self._np_random, seed = seeding.np_random(seed)
          obs = self.env.reset()
          return (obs["grid"][:, :, 0] + obs["grid"][:, :, 2]).flatten()
  
      def step(self, action: Union[int, np.ndarray]):
 +        self.time += 1
          obs, reward, done, info = self.env.step(action)
          obs = (obs["grid"][:, :, 0] + obs["grid"][:, :, 2]).flatten()
 +        print(self.time, info, done)
          return obs, reward, done, info

which gave output:

495 {} False
496 {} False
497 {} False
498 {} False
499 {} False
500 {} False
[ September 01, 19:28:36 ] Eval #2 Success Rate: 0.00, Return: -25.00, Episode Length: 500.00, Hours: 0.07

   diff --git a/dtqn/agents/dqn.py b/dtqn/agents/dqn.py
   index ca3ec34..90dd4ba 100644
   --- a/dtqn/agents/dqn.py
   +++ b/dtqn/agents/dqn.py
   @@ -87,6 +87,7 @@ class DqnAgent:
    
            # Reset the environment
            self.current_obs = np.array([self.env.reset()]).flatten().copy()
   +        self.time = 0
    
        def prepopulate(self, prepop_steps: int) -> None:
            """Prepopulate the replay buffer with `prepop_steps` of experience"""
   @@ -115,6 +116,7 @@ class DqnAgent:
            self.current_obs = obs.copy()
    
        def step(self) -> bool:
   +        self.time += 1
            """Take one step of the environment"""
            action = self.get_action(
                np.expand_dims(self.current_obs, axis=0), epsilon=self.exp_coef.val
   @@ -122,6 +124,8 @@ class DqnAgent:
            obs, reward, done, info = self.env.step(action)
            obs = np.array([obs]).flatten().copy()
    
   +        print(self.time, info.get("TimeLimit.truncated", False), done)
            if info.get("TimeLimit.truncated", False):
                buffer_done = False
            else:
   @@ -139,6 +143,10 @@ class DqnAgent:
                self.current_episode += 1
                obs = self.env.reset()
    
   +        if done or info.get("TimeLimit.truncated", False):
   +            self.time = 0
            self.current_obs = np.array([obs]).flatten().copy()

which gave output:

498 False False
499 False False
500 True True

It seems a bit inconsistent because the wrapper should have reflected the correct info dict in 1 and both values shouldn't have been True in 2. What do you think?

kevslinger · 2024-09-15T09:08:35Z

Hi Ashok, very sorry I have left you hanging for so long.

It may seem surprising, but I believe this is the intended behaviour for the 2 places you put print statements. For the first example, it returns an empty dict and False because the GridverseWrapper class is the inner Env for the TimeLimit Wrapper

DTQN/utils/env_processing.py

Line 66 in bc46144

env = TimeLimit(GridVerseWrapper(env), max_episode_steps=500)

We can take a look at the step function of the TimeLimit class from gym 0.18

  def step(self, action):
      assert self._elapsed_steps is not None, "Cannot call env.step() before calling reset()"
      observation, reward, done, info = self.env.step(action)
      self._elapsed_steps += 1
      if self._elapsed_steps >= self._max_episode_steps:
          info['TimeLimit.truncated'] = not done
          done = True
      return observation, reward, done, info

We can see it calls the inner (in this case, GridverseWrapper) step function first, which would return an empty dict and False, and then adds the "TimeLimit.truncated" and sets done = True.

This also explains why both are True in the 2nd example. Again, sorry for taking so long to respond, I hope this helps

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replacing `done` with `truncated` and `terminated` #10

Replacing `done` with `truncated` and `terminated` #10

ashok-arora commented Jul 1, 2024

kevslinger commented Jul 1, 2024

ashok-arora commented Jul 2, 2024

kevslinger commented Jul 2, 2024

ashok-arora commented Jul 3, 2024

kevslinger commented Jul 5, 2024

ashok-arora commented Aug 21, 2024 •

edited

Loading

kevslinger commented Aug 31, 2024

ashok-arora commented Sep 1, 2024

kevslinger commented Sep 15, 2024

Replacing done with truncated and terminated #10

Replacing done with truncated and terminated #10

Comments

ashok-arora commented Jul 1, 2024

kevslinger commented Jul 1, 2024

ashok-arora commented Jul 2, 2024

kevslinger commented Jul 2, 2024

ashok-arora commented Jul 3, 2024

kevslinger commented Jul 5, 2024

ashok-arora commented Aug 21, 2024 • edited Loading

kevslinger commented Aug 31, 2024

ashok-arora commented Sep 1, 2024

kevslinger commented Sep 15, 2024

Replacing `done` with `truncated` and `terminated` #10

Replacing `done` with `truncated` and `terminated` #10

ashok-arora commented Aug 21, 2024 •

edited

Loading