Skip to content

Release v1.0.1

Latest
Compare
Choose a tag to compare
@BeachWang BeachWang released this 06 Dec 09:09
· 7 commits to main since this release
9f1b0c8

Major Updates

  • 🚀 Supports automatically arranging operators from fastest to slowest based on their execution speed, and also supports automating the operator batch size according to the execution speed. #464
  • 🚀 [UnitTest] Performance benchmark for efficiency tests of 4 modalities. Reports will be uploaded to internal wandb server. #483
  • 💥 Added some useful OPs, including the construction of DPO training data and a lightweight user-customizable OP interface. See more details below~ #491 #492 #493

OPs

Text OPs

  • pair_preference_mapper: Mapper to construct preference answers for QA pairs. #491

Script OPs

  • python_lambda_mapper: Mapper for executing customized Python lambda functions on data samples. #492
  • python_file_mapper: Mapper for executing customized Python functions on data samples. #493

Bugs Fixed

  • Add an argument to control whether to open Monitor for data processing. It's True by default. #483
  • For the mp start method of monitor, set it to "spawn" for Windows systems and "fork" for others. #483
  • Update transformers version to >=4.47.0 to avoid "shape mismatch" bug from older version 4.46.3. #483
  • Fix the logic errors in Turbo acceleration and batch processing, and ensure that map and filter are consistent in this part of the logic. #504

Others

  • Pin the PyAV version to prevent inconsistent updates. #504
  • Skip some unit test for audio OPs to avoid lazy_loader failure during multiprocessing. #503
  • Remove unnecessary UNFORKABLE marks for some OPs. #491
  • Refine the docker image building. Add a new self-hosted runner for docker image building, optimize the building logic for auto docker image building on release, change the default full image to a GPU-version image. #494 #501

Acknowledgment

Here we thank public contributors for their PRs and issues to make Data-Juicer better!