You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Upload and convert a local webdataset using MDS writer like the following code produces huge temp files (the webdataset is 800gb and the temp file is 1.8tb stored in AppData/Local/Temp eventually crashing the upload.
Code
file = r"file:d:/Datasets/shards50m/{00000..04999}.tar"
dataset = wds.WebDataset(file).decode("pil").to_tuple("jpg", "txt")
data_dir = "s3://50m/mds/"
columns = {
'image': 'pil',
'caption': 'str'
}
with MDSWriter(out=data_dir, columns=columns,progress_bar=True) as out:
try:
for sample in tqdm(dataset):
try:
if len(sample) != 2:
print("Skipping sample, missing 'txt' or 'jpg'.")
continue
img, caption = sample
sample = {
'image': img,
'caption': caption,
}
out.write(sample)
except Exception as e:
print(f"Error processing sample: {e}")
except Exception as e:
print(f"Error processing sample: {e}")
The text was updated successfully, but these errors were encountered:
Thank you @MaxxP0 for reporting the issue! With "800gb", do you mean by the .tar dataset or the uncompressed raw webdataset?
It may be a OS thing where temp file is handled differently. But first, can we check if keep_local defined here is actually False? When keep_local= False, a local file is removed after upload_file is done, logic is here
To debug this, can you add some printouts and let's make sure keep_local=false and the local removal is happening first.
Thank you for the quick response. Yes the webdataset has 4999 tar files which have a total size of 800gb. Yes i have keep_local=False and i was able to reduce the problem by saving the bytes directly instead of decoding them to PIL Images and then storing them as PIL Images usig the MDSWriter. The temp file keeps increasing until it reaches like 300 gb after which it slowly decreases in size as files are uploaded. So thankfully this doesnt fill up my disk completly and the code doesnt crash. But it would proably be useful if there was something in place which would prevent temp buildup to a certain extend if this is not already happening. Thanks alot for your help.
Environment
To reproduce
Steps to reproduce the behavior:
Code
The text was updated successfully, but these errors were encountered: