-
Notifications
You must be signed in to change notification settings - Fork 846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting Nan loss when training dlrm with Kaggle Criteo dataset #363
Comments
It seems like running |
What happens when you run the test and bench script as shown in the documentation? |
Hi, I also get NaN when run it in DLRCs with TorchRec. Did you sovle it? I found that there are some -inf in Kaggle Criteo dataset. I'm not sure if torch team handled it. |
I think it is one preprocessing operation in the script that is causing the problem. I ended up using some custom preprocessing steps instead of torchrec.datasets.scripts.npy_preproc_criteo. |
I'm also trying to do that. If you still have that script, would you mind sharing it with me? Really thanks for your responding. |
Sorry, I'm not working on this now so I didn't keep a copy of the code. I remember I used the some part of the torchrec.datasets.scripts.npy_preproc_criteo code to decode the text to values and got a bunch of numpy files, and then did normalization with the dense values. Hope this helps! |
It's ok. Thank you very much. |
The original script simply added 3 to the target value before taking the log. This led to the issue that in data preprocessing, if there was a value of -3, it would result in a value of -inf. This problem was mentioned in the issue facebookresearch/dlrm#363 (comment). I changed the preprocessing operation to dense_np -= (dense_np.min() - 2) in the tsv_to_npys function, and correctly handled the Criteo Kaggle dataset.
The original script simply added 3 to the target value before taking the log. This led to the issue that in data preprocessing, if there was a value of -3, it would result in a value of -inf. This problem was mentioned in the issue facebookresearch/dlrm#363 (comment). I changed the preprocessing operation to dense_np -= dense_np.min() - 2 in the tsv_to_npys function, and correctly handled the Criteo Kaggle dataset.
Hello,
I'm running some training with the Kaggle Criteo dataset, and here is the command I ran:
The model hyperparameters I chose follow this example script. I'm getting Nan results for some iterations. The preprocessed dataset does not contain Nan values, and I have tried using 0.1, 0.01, 0.001 for the start learning rate, but I always get Nan results. Is there something I'm doing wrong here? What might be the cause for this issue?
Thanks!
The text was updated successfully, but these errors were encountered: