Sort samples by length in `TextToEmbeddingModelPipeline` #40

avidale · 2024-09-17T08:32:21Z

When encoding texts, currently, the pipeline (https://github.com/facebookresearch/SONAR/blob/main/sonar/inference_pipelines/text.py#L169) reads them in the provided order, groups them into batches, and collates each batch by padding each text in the batch to have the same length as the longest text in this batch. This sometimes produces batches where most tokens are just pad tokens, so the computation is wasted for them.

To avoid this waste and speed up the pipeline, we could sort the text by length before batching them.

avidale added enhancement New feature or request good first issue Good for newcomers labels Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sort samples by length in `TextToEmbeddingModelPipeline` #40

Sort samples by length in `TextToEmbeddingModelPipeline` #40

avidale commented Sep 17, 2024

Sort samples by length in TextToEmbeddingModelPipeline #40

Sort samples by length in TextToEmbeddingModelPipeline #40

Comments

avidale commented Sep 17, 2024

Sort samples by length in `TextToEmbeddingModelPipeline` #40

Sort samples by length in `TextToEmbeddingModelPipeline` #40