From 1cef8037def6c38452ab88e7de2f5f634342ffb6 Mon Sep 17 00:00:00 2001 From: Li Bo Date: Wed, 8 Nov 2023 01:38:08 +0000 Subject: [PATCH] Update mimicit_format.md --- docs/mimicit_format.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/mimicit_format.md b/docs/mimicit_format.md index d21ab271..04bcf93e 100755 --- a/docs/mimicit_format.md +++ b/docs/mimicit_format.md @@ -1,5 +1,7 @@ # Breaking Down the MIMIC-IT Format +❗❗❗We changed previous `images.json` to `images.parquet`. They are all containing multiple `key:base64` pairs but the later one would consume far less CPU memory and faster during loading with `pandas.Dataframe`. It enables us to train with larger datasets more conviently. + We mainly use one integrate dataset format and we refer it to MIMIC-IT format since. The mimic-it format contains the following data yaml file. Within this data yaml file, you could assign the path of the instruction json file and the image parquet file, and also the number of samples you want to use. The number of samples within each group will be uniformly sampled, and the `number_samples / total_numbers`` will decide sampling ratio of each dataset. @@ -86,4 +88,4 @@ parquet_file_path = os.path.join( parquet_root_path, os.path.basename(json_file_path).split(".")[0].replace("_image", "") + ".parquet" ) df.to_parquet(parquet_file_path, engine="pyarrow") -``` \ No newline at end of file +```