Release v1.0.1

Latest

Latest

BeachWang released this 06 Dec 09:09

· 7 commits to main since this release

9f1b0c8

Major Updates

🚀 Supports automatically arranging operators from fastest to slowest based on their execution speed, and also supports automating the operator batch size according to the execution speed. #464
🚀 [UnitTest] Performance benchmark for efficiency tests of 4 modalities. Reports will be uploaded to internal wandb server. #483
💥 Added some useful OPs, including the construction of DPO training data and a lightweight user-customizable OP interface. See more details below~ #491 #492 #493

OPs

Text OPs

pair_preference_mapper: Mapper to construct preference answers for QA pairs. #491

Script OPs

python_lambda_mapper: Mapper for executing customized Python lambda functions on data samples. #492
python_file_mapper: Mapper for executing customized Python functions on data samples. #493

Bugs Fixed

Add an argument to control whether to open Monitor for data processing. It's True by default. #483
For the mp start method of monitor, set it to "spawn" for Windows systems and "fork" for others. #483
Update transformers version to >=4.47.0 to avoid "shape mismatch" bug from older version 4.46.3. #483
Fix the logic errors in Turbo acceleration and batch processing, and ensure that map and filter are consistent in this part of the logic. #504

Others

Pin the PyAV version to prevent inconsistent updates. #504
Skip some unit test for audio OPs to avoid lazy_loader failure during multiprocessing. #503
Remove unnecessary UNFORKABLE marks for some OPs. #491
Refine the docker image building. Add a new self-hosted runner for docker image building, optimize the building logic for auto docker image building on release, change the default full image to a GPU-version image. #494 #501

Acknowledgment

Here we thank public contributors for their PRs and issues to make Data-Juicer better!

Assets 3