Policy Gradient, SAC doesn't learn #65

Ling01234 · 2023-04-06T02:51:27Z

Hi! I have a few more questions about the code that I don't quite get.

First, I was wondering what pybullet_envs is for. I installed the library but got errors when i tried to import it. I also dont see where its being used.

Second, I was getting really bad scores when i ran the code. I cloned the code from your git, and changed a few things as follows. The first thing I changed is the environment. More specifically, I changed it to env = gym.make("InvertedPendulum-v4") and as a result I also changed the following obs, _ = env.reset() and obs_, reward, done, *_ = env.step(action). Finally, I commented out the lines in sac_torch.py where we use the reparameterize=True since I ran into some nan Tensors when calling rsample().

That's all I've changed, and when I run the code, the score actually decreases (oddly enough). It starts with a score of approx 10 like a random agent, and decreases down to 3 or 4 after 250 episodes.

Would you have any idea of why this is happening? It would be so greatly appreciated!

Thanks a lot for your time

The text was updated successfully, but these errors were encountered:

philtabor · 2023-04-06T11:43:05Z

The pybullet_envs was originally used for the test environment: InvertedPendulumBulletEnv-v0, but unfortunately pybullet hasn't updated their code to be compliant with the new gym specifications. Hence the errors when you try to import.

I'm dealing with this problem in my Academy right now (and, spoiler, I'm writing a deep RL framework that I'll release a 0.1.dev build of very soon) and will be able to address these particular issues, and more, in the coming days.

But, to get you started, you want to make sure that you actually get the "truncated" boolean flag back from the env.step() function. The reason is that the done flag doesn't flip to True when max_steps is reached, rather the truncated flag takes care of this. So your while loop should be while not (done or truncated), so that you don't get an infinite loop.

As far as learning issues, I'll have to come back and update. I'm validating the initial commit of my framework, and will test SAC today.

Ling01234 · 2023-04-06T15:56:37Z

Thank you so much for your answer!

Please keep me updated on the learning issues whenever you have time to test it, it'd be greatly appreciated. I hope you have a good day!

roefer · 2024-11-14T14:39:23Z

In ReinforcementLearning/PolicyGradient/SAC/tf2/networks.py:94, you basically compute the following:

tf.math.log(1 - tf.math.pow(tf.math.tanh(actions) * self.max_action, 2) + self.noise)

By multiplying with self.max_action (which is 3, at least with MuJoCo), you increase the likelihood that the argument of tf.math.log() is negative. If that is the case, tf.math.log() returns NaN.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Policy Gradient, SAC doesn't learn #65

Policy Gradient, SAC doesn't learn #65

Ling01234 commented Apr 6, 2023 •

edited

Loading

philtabor commented Apr 6, 2023

Ling01234 commented Apr 6, 2023

roefer commented Nov 14, 2024 •

edited

Loading

Policy Gradient, SAC doesn't learn #65

Policy Gradient, SAC doesn't learn #65

Comments

Ling01234 commented Apr 6, 2023 • edited Loading

philtabor commented Apr 6, 2023

Ling01234 commented Apr 6, 2023

roefer commented Nov 14, 2024 • edited Loading

Ling01234 commented Apr 6, 2023 •

edited

Loading

roefer commented Nov 14, 2024 •

edited

Loading