Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Policy Gradient, SAC doesn't learn #65

Open
Ling01234 opened this issue Apr 6, 2023 · 3 comments
Open

Policy Gradient, SAC doesn't learn #65

Ling01234 opened this issue Apr 6, 2023 · 3 comments

Comments

@Ling01234
Copy link

Ling01234 commented Apr 6, 2023

Hi! I have a few more questions about the code that I don't quite get.

First, I was wondering what pybullet_envs is for. I installed the library but got errors when i tried to import it. I also dont see where its being used.

Second, I was getting really bad scores when i ran the code. I cloned the code from your git, and changed a few things as follows. The first thing I changed is the environment. More specifically, I changed it to env = gym.make("InvertedPendulum-v4") and as a result I also changed the following obs, _ = env.reset() and obs_, reward, done, *_ = env.step(action). Finally, I commented out the lines in sac_torch.py where we use the reparameterize=True since I ran into some nan Tensors when calling rsample().

That's all I've changed, and when I run the code, the score actually decreases (oddly enough). It starts with a score of approx 10 like a random agent, and decreases down to 3 or 4 after 250 episodes.

Would you have any idea of why this is happening? It would be so greatly appreciated!

Thanks a lot for your time

@philtabor
Copy link
Owner

The pybullet_envs was originally used for the test environment: InvertedPendulumBulletEnv-v0, but unfortunately pybullet hasn't updated their code to be compliant with the new gym specifications. Hence the errors when you try to import.

I'm dealing with this problem in my Academy right now (and, spoiler, I'm writing a deep RL framework that I'll release a 0.1.dev build of very soon) and will be able to address these particular issues, and more, in the coming days.

But, to get you started, you want to make sure that you actually get the "truncated" boolean flag back from the env.step() function. The reason is that the done flag doesn't flip to True when max_steps is reached, rather the truncated flag takes care of this. So your while loop should be while not (done or truncated), so that you don't get an infinite loop.

As far as learning issues, I'll have to come back and update. I'm validating the initial commit of my framework, and will test SAC today.

@Ling01234
Copy link
Author

Thank you so much for your answer!

Please keep me updated on the learning issues whenever you have time to test it, it'd be greatly appreciated. I hope you have a good day!

@roefer
Copy link

roefer commented Nov 14, 2024

In ReinforcementLearning/PolicyGradient/SAC/tf2/networks.py:94, you basically compute the following:

tf.math.log(1 - tf.math.pow(tf.math.tanh(actions) * self.max_action, 2) + self.noise)

By multiplying with self.max_action (which is 3, at least with MuJoCo), you increase the likelihood that the argument of tf.math.log() is negative. If that is the case, tf.math.log() returns NaN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants