Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

如何解决数据集camelyon17 #151

Open
yuu-Wang opened this issue Jul 25, 2024 · 14 comments
Open

如何解决数据集camelyon17 #151

yuu-Wang opened this issue Jul 25, 2024 · 14 comments

Comments

@yuu-Wang
Copy link

您好,要是想用Camelyon17 ,该怎么编排数据集的结构,DomainBed/domainbed/data/camelyon17/是这样吗,但是它一直显示没找到。Traceback (most recent call last):
File "/root/wangxy/AlignClip/main.py", line 155, in
main(args)
File "/root/wangxy/AlignClip/main.py", line 44, in main
train_iter, val_loader, test_loaders, train_class_names, template = get_dataset(args)
File "/root/wangxy/AlignClip/engine.py", line 59, in get_dataset
converter_domainbed.get_domainbed_datasets(dataset_name=args.data, root=args.root, targets=args.targets,
File "/root/wangxy/AlignClip/converter_domainbed.py", line 21, in get_domainbed_datasets
datasets = vars(dbdatasets)[dataset_name](root, targets, hparams)
File "/root/wangxy/AlignClip/DomainBed/domainbed/datasets.py", line 347, in init
dataset = Camelyon17Dataset(root_dir=root)
File "/root/anaconda3/envs/pytorch_2.0.1/lib/python3.8/site-packages/wilds/datasets/camelyon17_dataset.py", line 64, in init
self._data_dir = self.initialize_data_dir(root_dir, download)
File "/root/anaconda3/envs/pytorch_2.0.1/lib/python3.8/site-packages/wilds/datasets/wilds_dataset.py", line 341, in initialize_data_dir
self.download_dataset(data_dir, download)
File "/root/anaconda3/envs/pytorch_2.0.1/lib/python3.8/site-packages/wilds/datasets/wilds_dataset.py", line 368, in download_dataset
raise FileNotFoundError(
FileNotFoundError: The camelyon17 dataset could not be found in DomainBed/domainbed/data/camelyon17_v1.0. Initialize the dataset with download=True to download the dataset. If you are using the example script, run with --download. This might take some time for large datasets.

@piotr-teterwak
Copy link
Collaborator

Hi, unfortunately I'm an English speaker, but it looks like you're having issues using Camelyon17 because it is not downloaded?

I woudl run this script, with line 304 uncommented to download the dataset: https://github.com/facebookresearch/DomainBed/blob/main/domainbed/scripts/download.py#L304

@yuu-Wang
Copy link
Author

Hello, I have already downloaded the dataset through this link, and the path is /root/wangxy/AlignClip/DomainBed/domainbed/data/camelyon17_v1.0/. However, I keep getting an error that says camelyon17 cannot be found. I'm not sure if my file naming is correct, but I successfully ran other datasets like PACS and Officehome. Do I need to handle the camelyon dataset separately?

@piotr-teterwak
Copy link
Collaborator

Could you post here the command you use to run main.py, and the directory you run it from? To me it looks like args.root is set to DomainBed/domainbed/data/ instead of '/root/wangxy/AlignClip/DomainBed/domainbed/data/'.

@yuu-Wang
Copy link
Author

main.zip
This is the main file, and these are the parameters I need to run: DomainBed/domainbed/data/ -d WILDSCamelyon --task domain_shift --targets 0 -b 36 --lr 5e-6 --epochs 10 --beta 0.5. This is the location where I placed my dataset.
1

@piotr-teterwak
Copy link
Collaborator

Can you run with /root/wangxy/AlignClip/DomainBed/domainbed/data/ -d WILDSCamelyon --task domain_shift --targets 0 -b 36 --lr 5e-6 --epochs 10 --beta 0.5 instead? See the data path in the first parameter.

@yuu-Wang
Copy link
Author

Hello, it's running now, but there's a problem. The camelyon17_v1.0 dataset contains raw dataset patches, and the patches are classified in the format patient_00X_node_X, which is necessary for it to work. However, I have already divided the dataset into hospital0, hospital1, hospital2, hospital3, hospital4, and now it's giving an error. Could you please explain why this is happening?

@piotr-teterwak
Copy link
Collaborator

Could you post the stack trace so I can have more information?

@yuu-Wang
Copy link
Author

1
Hello, why is the test dataset empty here? I reclassified the camelyon17 dataset and regenerated the metadata.csv file. Do I need to create separate CSV files for test, validation, and training sets?
2

@piotr-teterwak
Copy link
Collaborator

Hi @yuu-Wang ,

It's pretty hard to understand what exactly is going on here without more details. In order to help you, I will need a minimal reproducable example including:

  1. How you download the data.
  2. How you generate the new metadata.csv file
  3. A few lines of code, of how you load the data.

However, taking a quick look, I don't think the images need to be split into directories based on their hpspital source.

@yuu-Wang
Copy link
Author

hi,
1.The dataset I downloaded is from line 304 of download.py.
(

# Camelyon17Dataset(root_dir=args.data_dir, download=True)
)
2.Since I noticed that the WILD environment in the file(
class WILDSCamelyon(WILDSDataset):
) requires four domains (hospital0, hospital1, hospital2, hospital3), I used this code(https://github.com/jameszhou-gl/gpt-4v-distribution-shift/blob/ccfcf00851ccd8867de7c6d92591eaedd8a66d0d/data/process_wilds.py#L21) to divide the downloaded dataset into these four domains.
3.I used this
(https://github.com/thuml/CLIPood/blob/bc0d8745e8b0d97b0873bd8ed8589793abd1c1a7/engine.py#L53)
(https://github.com/thuml/CLIPood/blob/bc0d8745e8b0d97b0873bd8ed8589793abd1c1a7/converter_domainbed.py#L18)
to divide the dataset into training, validation, and test sets.

@piotr-teterwak
Copy link
Collaborator

piotr-teterwak commented Jul 29, 2024

Steps 2 and 3 are not needed; the code takes care of this internally. Could you re-download with step 1, skip steps 2 and 3, and try again? If this does not work, could you please send a few lines of code of how exactly you are loading the dataset in your training code?

@yuu-Wang
Copy link
Author

Sure, I downloaded it directly according to step one and then ran main.py
(https://github.com/thuml/CLIPood/blob/bc0d8745e8b0d97b0873bd8ed8589793abd1c1a7/main.py#L104).
Is it only this dataset that exceeds the length?
微信图片_20240729145723

@piotr-teterwak
Copy link
Collaborator

I see that you're running main.py from another repository, CLIPood, and not the DomainBed repository. I'm not very familiar with the CLIPOod code. Could you run from an unmodified DomainBed codebase?

@yuu-Wang
Copy link
Author

Thank you very much for your patience. It's doable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants