Releases: FlagOpen/FlagEmbedding
v1.3.4
What's Changed
- Inference docstring by @ZiyiXia in #1186
- delete useless parameters for embedder classes by @hanhainebula in #1189
- Bug of BGE M3 training by @baochi0212 in #1183
- feat:add bce-embedding-base_v1 by @zhudongwork in #1198
- Docstring by @ZiyiXia in #1200
- Update AbsDataset.py by @jhyeom1545 in #1204
- Fix bugs by @hanhainebula in #1211
- fixed a bug in AbsReranker.py for mps device support by @Swgj in #1216
- Fix bugs by @hanhainebula in #1219
- update stop pool by @545999961 in #1221
- update mteb eval by @545999961 in #1227
- update adjust batch size by @545999961 in #1229
- update mteb eval by @545999961 in #1230
- fix bugs and refactor code by @hanhainebula in #1231
- update mteb eval by @545999961 in #1235
- release training data for bge-multilingual-gemma2 by @hanhainebula in #1245
- add missed trust_remote_code for finetune code by @hanhainebula in #1248
- fix DecoderOnlyEmbedderICLSameDatasetTrainDataset category index error by @billvsme in #1232
- Clean code by @hanhainebula in #1250
- Fix bugs by @hanhainebula in #1253
- update examples by @545999961 in #1254
- update examples by @545999961 in #1255
- Fix air-bench eval bugs: AIRBenchEvalArgs by @hanhainebula in #1256
- Fix air-bench eval bugs: AIRBenchEvalArgs by @hanhainebula in #1257
- update code and README for scripts by @hanhainebula in #1258
- update examples by @545999961 in #1261
- update
C_MTEB
reference by @emmanuel-ferdman in #1296 - [Bugfix] Typehint error on py38 by @DrDavidS in #1300
- Update model_mapping.py by @pengjunfeng11 in #1311
- fix bugs for embedder finetune by @hanhainebula in #1328
- fix a bug in icl/dataset.py by @hanhainebula in #1330
- Fix bugs by @hanhainebula in #1340
- fix beir data_loader.py: dev -> validation by @hanhainebula in #1341
- update embedder finetune code by @hanhainebula in #1342
- Fix Bug: OOM by @545999961 in #1349
- fix transformers 4.48.0 by @Hypothesis-Z in #1343
- Fix a bug in beir evaluation and release v1.3.4 by @hanhainebula in #1359
- del dp code by @hanhainebula in #1360
- support musa backend in FlagEmbedding by @qiyulei-mt in #1350
- docs: fix link to https://bge-model.com/ within NEWS section by @bufferoverflow in #1355
- fix/reranking tutorial typos by @rendyfebry in #1313
New Contributors
- @baochi0212 made their first contribution in #1183
- @zhudongwork made their first contribution in #1198
- @jhyeom1545 made their first contribution in #1204
- @Swgj made their first contribution in #1216
- @billvsme made their first contribution in #1232
- @emmanuel-ferdman made their first contribution in #1296
- @DrDavidS made their first contribution in #1300
- @pengjunfeng11 made their first contribution in #1311
- @Hypothesis-Z made their first contribution in #1343
- @qiyulei-mt made their first contribution in #1350
- @bufferoverflow made their first contribution in #1355
- @rendyfebry made their first contribution in #1313
Full Changelog: v1.3.2-BGE-Update...v1.3.4
1.3.2
We have completely updated the BGE code repository, including the following key improvements:
Inference Code
- Added
FlagAutoModel
andFlagAutoReranker
, making it easier to utilize the models.
Inference Optimization
- Implemented multi-GPU support.
- Introduced dynamic batch sizing to prevent out-of-memory (OOM) issues.
- Optimized padding to improve efficiency.
Evaluation Code
- Integrated support for common evaluation datasets to enhance user convenience.
- Provided a custom evaluation interface, adhering to specified data organization standards, to simplify the evaluation process.
Project Structure Organization
- Reorganized the project to streamline processes related to inference, fine-tuning, and evaluation.
Release BGE-M3 and Activation Beacon
BGE-M3
A new member of the BGE model series! BGE-M3 stands for Multi-linguality, Multi-granularities (input length up to 8192), and Multi-Functionality (unification of dense, lexical, multi-vec retrieval). It is the first embedding model which supports all three retrieval methods.
For more details please refer to Technical Report and Code.
Activation Beacon
An effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM by x100 times. We extend the context length of Llama-2-chat-7b from 4K to 400K.
For more details please refer to paper and code
Feedback is welcome
Release LM-Cocktail
LM-Cocktail
Merge language models (e.g., Llama, bge) to improve the general ability of models.
This method can be used to:
- Mitigate the Problem of Catastrophic Forgetting
- Improve the performance of new tasks without fine-tuning
- Approximate multitask learning or model ensemble
FlagEmbedding 1.1.2
Create the first release #131
FlagEmbedding
- Update Embedding Models
bge-*-v1.5
:- alleviate the issue of the similarity distribution
- the new models can do retrieval tasks without instruction, but still recommend using instruction which can have a better performance.
- New Models
bge-reranker-*
: cross-encoders that can rerank the top-k retrieved results - Specify using normalization in the configuration for sentence-transformers, thanks to skirres.
Now users have no need to setnormalize_embeddings=True
manually when using sentence-transformers.
C-MTEB
- Add two cross-lingual retrieval tasks: T2RerankingZh2En and T2RerankingEn2Zh.