Major Updates
- 🚀 Supports automatically arranging operators from fastest to slowest based on their execution speed, and also supports automating the operator batch size according to the execution speed. #464
- 🚀 [UnitTest] Performance benchmark for efficiency tests of 4 modalities. Reports will be uploaded to internal wandb server. #483
- 💥 Added some useful OPs, including the construction of DPO training data and a lightweight user-customizable OP interface. See more details below~ #491 #492 #493
OPs
Text OPs
pair_preference_mapper
: Mapper to construct preference answers for QA pairs. #491
Script OPs
python_lambda_mapper
: Mapper for executing customized Python lambda functions on data samples. #492python_file_mapper
: Mapper for executing customized Python functions on data samples. #493
Bugs Fixed
- Add an argument to control whether to open
Monitor
for data processing. It's True by default. #483 - For the mp start method of monitor, set it to "spawn" for Windows systems and "fork" for others. #483
- Update transformers version to >=4.47.0 to avoid "shape mismatch" bug from older version 4.46.3. #483
- Fix the logic errors in Turbo acceleration and batch processing, and ensure that map and filter are consistent in this part of the logic. #504
Others
- Pin the PyAV version to prevent inconsistent updates. #504
- Skip some unit test for audio OPs to avoid lazy_loader failure during multiprocessing. #503
- Remove unnecessary UNFORKABLE marks for some OPs. #491
- Refine the docker image building. Add a new self-hosted runner for docker image building, optimize the building logic for auto docker image building on release, change the default full image to a GPU-version image. #494 #501
Acknowledgment
Here we thank public contributors for their PRs and issues to make Data-Juicer better!