Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent json file from BaiduYun and Huggingface #1

Open
cheliu-computation opened this issue Jan 5, 2024 · 3 comments
Open

Inconsistent json file from BaiduYun and Huggingface #1

cheliu-computation opened this issue Jan 5, 2024 · 3 comments

Comments

@cheliu-computation
Copy link

Thanks for your impressive work and the effort on the dataset construction!

I have download the dataset from BaiduYun and the json file from huggingface.
The json from baidu shows 2,893 cases for Test, 32,891 cases for train. But the last case from train set has some error about unfinished strings.
However, the json from huggingface has 7,772 cases for test, 31,086 for train.
The total cases from the paper is 39,026, but the number from BaiduYun is '35,784' and '38,858' from huggingface.
This error makes reproducing work quite hardly.

So, if we want to reproduce this work based on 20 zip file from BaiduYun, which split file should we use?
Also, I found the subfolder1.zip and subfolder2.zip have some corrupted issue, even I fixed it with 'zip -F' command, I am not sure all files are restored. If authors could fix this issue, it will be really grateful.

After unzipping, it has multiple folders but not match the path from the json file.
I guess the number from json: ''/processed_images/2534/1/CT_0'', the '2534' is the correct path for the image?

The format from two different sources are also different:
the json from huggingface: "image_path": [ "/remote-home/share/data200/172.16.11.200/zhengqiaoyu//processed_file/npys/32940/1/MRI_0.nii.gz", "/remote-home/share/data200/172.16.11.200/zhengqiaoyu//processed_file/npys/32940/1/MRI_1.nii.gz",
the json from baidu: "image_path": [ "/processed_images/1/1/CT_0", "/processed_images/1/1/CT_1", "/processed_images/1/1/CT_2", "/processed_images/1/1/CT_3" ]

Seems the huggingface json point to the original medical image, but the baidu json is for jpg only?

Looking forward to your response!

@cheliu-computation
Copy link
Author

Additionally, in the huggingface json, the file provide 'icd10s' code, but the json from baiduyun does not have this term

@qiaoyu-zheng
Copy link
Owner

I'm sorry the json file from baiduyun has some error, please follow the json file from huggingface. Because of the uploading problem, there may be some subtle matching issue in the total number, we will check it again later. However, the exisiting version is already enough for reproducing. If you meet the matching error in the dataloader, you can just simply remove these cases, this will not lead to a significant effect on the reproduction results. Again, we will fix this error later

@SZUHvern
Copy link

Hello,

Thank you very much for your work!
I wanted to check if the relevant data and JSON corrections are now available.
I eagerly waiting for your response. Many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants