Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Created chinese readme.md #64

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
223 changes: 223 additions & 0 deletions README_Chinese.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,223 @@
# 令人印象深刻的语义搜索 [![令人印象深刻](https://awesome.re/badge.svg)](https://awesome.re) [![常规提交](https://img.shields.io/badge/常规%20提交-1.0.0-yellow.svg)](https://conventionalcommits.org)

<img src ="logo.svg" />

Logo made by [@createdbytango](https://instagram.com/createdbytango).

**寻找更多论文添加。 PS:提一个拉取请求 (PR)**

以下存储库旨在充当[语义搜索](https://en.wikipedia.org/wiki/Semantic_search)和[语义相似性](http://nlpprogress.com/english/semantic_textual_similarity.html)相关任务的元存储库。

语义搜索不仅限于文本!它可以用于图像、语音等。语义搜索有许多不同的用例和应用。

欢迎在此存储库上提出拉取请求 (PR)!

## 目录

- [论文](#论文)
- [2014](#2014)
- [2015](#2015)
- [2016](#2016)
- [2017](#2017)
- [2018](#2018)
- [2019](#2019)
- [2020](#2020)
- [2021](#2021)
- [2022](#2022)
- [2023](#2023)
- [文章](#文章)
- [库和工具](#库和工具)
- [数据集](#数据集)
- [里程碑](#里程碑)

## 论文

### 2010
- [优先级范围树](https://arxiv.org/abs/1009.3527)

### 2014
- [用于信息检索的具有卷积-池结构的潜在语义模型](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2014_cdssm_final.pdf) 📄

### 2015
- [跳跃思考向量](https://arxiv.org/pdf/1506.06726.pdf) 📄
- [角度距离的实用和最优局部敏感哈希](https://proceedings.neurips.cc/paper/2015/hash/2823f4797102ce1a1aec05359cc16dd9-Abstract.html)

### 2016
- [文本分类的一袋诀窍](https://arxiv.org/abs/1607.01759) 📄
- [用子词信息丰富词向量](https://arxiv.org/abs/1607.04606) 📄
- [使用分层可导航小世界图进行高维数据的高效且鲁棒的近似最近邻搜索](https://arxiv.org/abs/1603.09320)
- [关于近似搜索相似词嵌入的实验、分析和改进](https://www.aclweb.org/anthology/P16-1214.pdf)
- [从未标记的数据中学习句子的分布式表示](https://arxiv.org/abs/1602.03483)📄
- [高维数据的近似最近邻搜索---实验、分析和改进](https://arxiv.org/abs/1610.02455)

### 2017
- [从自然语言推理数据中监督学习通用句子表示](https://research.fb.com/wp-content/uploads/2017/09/emnlp2017.pdf) 📄
- [印地语的语义文本相似度](https://www.semanticscholar.org/paper/Semantic-Textual-Similarity-For-Hindi-Mujadia-Mamidi/372f615ce36d7543512b8e40d6de51d17f316e0b)📄
- [智能回复的高效自然语言响应建议](https://arxiv.org/abs/1705.00652)📃

### 2018
- [通用句子编码器](https://arxiv.org/pdf/1803.11175.pdf) 📄
- [从对话中学习语义文本相似度](https://arxiv.org/pdf/1804.07754.pdf) 📄
- [Google AI博客:语义文本相似度的进展](https://ai.googleblog.com/2018/05/advances-in-semantic-textual-similarity.html) 📄
- [Speech2Vec:从语音学习词嵌入的序列到序列框架](https://arxiv.org/abs/1803.08976))🔊
- [基于k-最近邻图的近似最近邻搜索的优化](https://arxiv.org/abs/1810.07355) 🔊
- [快速近似最近邻搜索与导航扩展图](http://www.vldb.org/pvldb/vol12/p461-fu.pdf)
- [学习索引结构的案例](https://dl.acm.org/doi/10.1145/3183713.3196909)

### 2019
- [LASER:语言无关的句子表示](https://engineering.fb.com/2019/01/22/ai-research/laser-multilingual-sentence-embeddings/) 📄
- [通过查询预测扩展文档](https://arxiv.org/abs/1904.08375) 📄
- [Sentence-BERT:使用Siamese BERT网络的句子嵌入](https://arxiv.org/pdf/1908.10084.pdf) 📄
- [具有BERT的多阶段文档排名](https://arxiv.org/abs/1910.14424) 📄
- [弱监督开放域问答的潜在检索](https://arxiv.org/abs/1906.00300)
- [BERTserini的端到端开放域问答](https://www.aclweb.org/anthology/N19-4013/)
- [BioBERT: 一种用于生物医学文本挖掘的预训练生物医学语言表示模型](https://arxiv.org/abs/1901.08746)📄
- [使用软最近邻损失分析和改进表示](https://arxiv.org/pdf/1902.01889.pdf)📷
- [DiskANN: 在单个节点上快速准确地搜索十亿点最近邻](https://proceedings.neurips.cc/paper/2019/file/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf)

### 2020
- [为 COVID-19 开放研究数据集快速部署神经搜索引擎:初步思考和经验教训](https://arxiv.org/abs/2004.05125) 📄
- [使用 BERT 进行通道重新排名](https://arxiv.org/pdf/1901.04085.pdf) 📄
- [CO-Search: 具有语义搜索、问题回答和抽象摘要的 COVID-19 信息检索](https://arxiv.org/pdf/2006.09595.pdf) 📄
- [LaBSE:面向语言的 BERT 句子嵌入](https://arxiv.org/abs/2007.01852) 📄
- [Covidex:COVID-19 开放研究数据集的神经排名模型和关键词搜索基础设施](https://arxiv.org/abs/2007.07846) 📄
- [DeText:用于智能文本理解的深度 NLP 框架](https://engineering.linkedin.com/blog/2020/open-sourcing-detext) 📄
- [使用知识蒸馏使单语句嵌入多语言化](https://arxiv.org/pdf/2004.09813.pdf) 📄
- [文本排名的预训练 Transformers:BERT 及其衍生](https://arxiv.org/abs/2010.06467) 📄
- [REALM:检索增强语言模型预训练](https://arxiv.org/abs/2002.08909)
- [ELECTRA:将文本编码器预训练为判别器而非生成器](https://openreview.net/pdf?id=r1xMH1BtvB)📄
- [改进 Airbnb 搜索的深度学习](https://arxiv.org/pdf/2002.05515)
- [在 Airbnb 搜索中管理多样性](https://arxiv.org/abs/2004.02621)📄
- [用于稠密文本检索的近似最近邻负对比学习](https://arxiv.org/abs/2007.00808v1)📄
- [无监督图像风格嵌入用于检索和识别任务](https://openaccess.thecvf.com/content_WACV_2020/papers/Gairola_Unsupervised_Image_Style_Embeddings_for_Retrieval_and_Recognition_Tasks_WACV_2020_paper.pdf)📷
- [DeCLUTR:用于无监督文本表示的深度对比学习](https://arxiv.org/abs/2006.03659)📄

### 2021
- [混合方法用于计算泰米尔语词汇之间的语义相似度](https://www.researchgate.net/publication/350112163_Hybrid_approach_for_semantic_similarity_calculation_between_Tamil_words) 📄
- [增强版 SBERT](https://arxiv.org/pdf/2010.08240.pdf) 📄
- [BEIR:用于信息检索模型零样本评估的异构基准](https://arxiv.org/abs/2104.08663) 📄
- [兼容性感知的异构视觉搜索](https://arxiv.org/abs/2105.06047) 📷
- [从少量示例中学习个人风格](https://chuanenlin.com/personalstyle)📷
- [TSDAE:使用基于 Transformer 的顺序去噪自编码器进行无监督句子嵌入学习](https://arxiv.org/abs/2104.06979)📄
- [Transformer 调查](https://arxiv.org/abs/2106.04554)📄📷
- [SPLADE:用于第一阶段排名的稀疏词汇和扩展模型](https://dl.acm.org/doi/10.1145/3404835.3463098)📄
- [使用深度强化学习提高相关搜索查询建议的质量](https://arxiv.org/abs/2108.04452v1)
- [淘宝搜索中基于嵌入的产品检索](https://arxiv.org/pdf/2106.09297.pdf)📄📷
- [TPRM:面向 Web 搜索的基于主题的个性化排名模型](https://arxiv.org/abs/2108.06014)📄
- [mMARCO:MS MARCO Passage Ranking 数据集的多语言版本](https://arxiv.org/abs/2108.13897)📄
- [对文本进行数据库推理](https://aclanthology.org/2021.acl-long.241.pdf)📄
- [对抗微调如何使 BERT 受益?](https://arxiv.org/abs/2108.13602))📄
- [训练短、测试长:具有线性偏差的注意力使输入长度外推成为可能](https://arxiv.org/abs/2108.12409)📄
- [Primer:搜索语言建模的高效 Transformer](https://arxiv.org/abs/2109.08668)📄
- [那听起来有多熟悉?基于语音单词嵌入的跨语言表示相似性分析](https://arxiv.org/pdf/2109.10179.pdf)🔊
- [SimCSE:句子嵌入的简单对比学习](https://arxiv.org/abs/2104.08821#)📄
- [构成注意力:解耦搜索和检索](https://arxiv.org/abs/2110.09419)📄📷
- [SPANN:高效的十亿级近似最近邻搜索](https://arxiv.org/abs/2111.08566)
- [GPL:密集检索的无监督域自适应的生成伪标签](https://arxiv.org/abs/2112.07577) 📄
- [生成式搜索引擎:初步实验](https://computationalcreativity.net/iccc21/wp-content/uploads/2021/09/ICCC_2021_paper_50.pdf) 📷
- [重新思考搜索:将业务专家培养成业余爱好者](https://dl.acm.org/doi/10.1145/3476415.3476428)
- [WhiteningBERT:一种简单的无监督句子嵌入方法](https://arxiv.org/abs/2104.01767)


### 2022
- [对比预训练的文本和代码嵌入](https://arxiv.org/abs/2201.10005)📄
- [RELIC:检索文学主张的证据](https://arxiv.org/abs/2203.10053)📄
- [Trans-Encoder:通过自我和相互蒸馏进行无监督句对建模](https://arxiv.org/abs/2109.13059)📄
- [SAMU-XLSR:语义对齐的多模态话语级跨语言语音表示](https://arxiv.org/abs/2205.08180)🔊
- [混合检索的融合函数分析](https://arxiv.org/abs/2210.11934)📄
- [深度最近邻的分布外检测](https://arxiv.org/abs/2204.06507)
- [ESB:多领域端到端语音识别的基准](https://arxiv.org/abs/2210.13352)🔊
- [从预训练的自监督语音模型中分析声学单词嵌入](https://arxiv.org/pdf/2210.16043.pdf))🔊
- [重新思考检索:大语言模型推理的可信性](https://arxiv.org/abs/2301.00303)📄
- [在没有相关性标签的情况下进行精确的零样本稠密检索](https://arxiv.org/pdf/2212.10496.pdf)📄
- [Transformer Memory 作为可微分搜索索引](https://arxiv.org/abs/2202.06991)📄

### 2023
- [FINGER:基于图的近似最近邻搜索的快速推理](https://dl.acm.org/doi/10.1145/3543507.3583318)📄
- [“低资源”文本分类:一种无参数分类方法与压缩器](https://aclanthology.org/2023.findings-acl.426/)📄
- [SparseEmbed:学习用于检索的稀疏词汇表示与上下文嵌入](https://dl.acm.org/doi/pdf/10.1145/3539618.3592065) 📄

## 文章
- [解决语义搜索问题](https://adityamalte.substack.com/p/tackle-semantic-search/)
- [Azure Cognitive Search 中的语义搜索](https://docs.microsoft.com/en-us/azure/search/semantic-search-overview)
- [如何使用语义搜索使我们的搜索智能化 10 倍](https://zilliz.com/blog/How-we-used-semantic-search-to-make-our-search-10-x-smarter/)
- [Stanford AI Blog:构建可扩展、可解释和自适应的检索 NLP 模型](https://ai.stanford.edu/blog/retrieval-based-NLP/)
- [使用双空间词嵌入构建语义搜索引擎](https://m.mage.ai/building-a-semantic-search-engine-with-dual-space-word-embeddings-f5a596eb6d90)
- [使用 FAISS+SBERT 实现十亿级语义相似性搜索](https://towardsdatascience.com/billion-scale-semantic-similarity-search-with-faiss-sbert-c845614962e2)
- [关于相似性搜索阈值的一些观察](https://greglandrum.github.io/rdkit-blog/similarity/reference/2021/05/26/similarity-threshold-observations1.html)
- [使用局部敏感哈希进行近似重复图像搜索](https://keras.io/examples/vision/near_dup_search/)
- [关于向量相似性搜索和 Faiss 的免费课程](https://link.medium.com/HtFoFKlKvkb)
- [近似最近邻算法的全面指南](https://link.medium.com/V62Z8drvEkb)
- [引入混合索引以实现关键字感知的语义搜索](https://www.pinecone.io/learn/hybrid-search/?utm_medium=email&_hsmi=0&_hsenc=p2ANqtz--zLu9hiyh-y_XTa7FCEpi8JESJKmif5dhpYtAxTWka8PIttaTOGE21LMZlg9EOZyPYpCm6GDvYy57tlGRwH6TjgLCsJg&utm_content=231741722&utm_source=hs_email)
- [Argilla 语义搜索](https://docs.argilla.io/en/latest/guides/features/semantic-search.html)
- [Co:here 的多语言文本理解模型](https://txt.cohere.ai/multilingual/)
- [使用多语言嵌入模型简化搜索](https://blog.vespa.ai/simplify-search-with-multilingual-embeddings/)

## 库和工具
- [fastText](https://fasttext.cc/)
- [Universal Sentence Encoder](https://tfhub.dev/google/universal-sentence-encoder/4)
- [SBERT](https://www.sbert.net/)
- [ELECTRA](https://github.com/google-research/electra)
- [LaBSE](https://tfhub.dev/google/LaBSE/2)
- [LASER](https://github.com/facebookresearch/LASER)
- [Relevance AI - 从实验到部署的矢量平台](https://relevance.ai)
- [Haystack](https://github.com/deepset-ai/haystack/)
- [Jina.AI](https://jina.ai/)
- [pinecone](https://www.pinecone.io/)
- [SentEval Toolkit](https://github.com/facebookresearch/SentEval?utm_source=catalyzex.com)
- [ranx](https://github.com/AmenRa/ranx)
- [BEIR:信息检索基准](https://github.com/UKPLab/beir)
- [RELiC:检索文学主张的证据数据集](https://relic.cs.umass.edu/)
- [matchzoo-py](https://github.com/NTMC-Community/MatchZoo-py)
- [deep_text_matching](https://github.com/wangle1218/deep_text_matching)
- [Which Frame?](http://whichframe.com/)
- [lexica.art](https://lexica.art/)
- [emoji semantic search](https://github.com/lilianweng/emoji-semantic-search)
- [PySerini](https://github.com/castorini/pyserini)
- [BERTSerini](https://github.com/rsvp-ai/bertserini)
- [BERTSimilarity](https://github.com/Brokenwind/BertSimilarity)
- [milvus](https://www.milvus.io/)
- [NeuroNLP++](https://plusplus.neuronlp.fruitflybrain.org/)
- [weaviate](https://github.com/semi-technologies/weaviate)
- [通过 Weaviate 进行维基百科的语义搜索](https://github.com/semi-technologies/semantic-search-through-wikipedia-with-weaviate)
- [自然语言 YouTube 搜索](https://github.com/haltakov/natural-language-youtube-search)
- [same.energy](https://www.same.energy/about)
- [ann benchmarks](http://ann-benchmarks.com/)
- [scaNN](https://github.com/google-research/google-research/tree/master/scann)
- [REALM](https://github.com/google-research/language/tree/master/language/realm)
- [annoy](https://github.com/spotify/annoy)
- [pynndescent](https://github.com/lmcinnes/pynndescent)
- [nsg](https://github.com/ZJULearning/nsg)
- [FALCONN](https://github.com/FALCONN-LIB/FALCONN)
- [redis HNSW](https://github.com/zhao-lang/redis_hnsw)
- [autofaiss](https://github.com/criteo/autofaiss)
- [DPR](https://github.com/facebookresearch/DPR)
- [rank_BM25](https://github.com/dorianbrown/rank_bm25)
- [nearPy](http://pixelogik.github.io/NearPy/)
- [vearch](https://github.com/vearch/vearch)
- [vespa](https://github.com/vespa-engine/vespa)
- [PyNNDescent](https://github.com/lmcinnes/pynndescent)
- [pgANN](https://github.com/netrasys/pgANN)
- [Tensorflow Similarity](https://github.com/tensorflow/similarity)
- [opensemanticsearch.org](https://www.opensemanticsearch.org/)
- [GPT3 Semantic Search](https://gpt3demo.com/category/semantic-search)
- [searchy](https://github.com/lubianat/searchy)
- [txtai](https://github.com/neuml/txtai)
- [HyperTag](https://github.com/Ravn-Tech/HyperTag)
- [vectorai](https://github.com/vector-ai/vectorai)
- [embeddinghub](https://github.com/featureform/embeddinghub)
- [AquilaDb](https://github.com/Aquila-Network/AquilaDB)
- [STripNet](https://github.com/stephenleo/stripnet)

## 数据集
- [Semantic Text Similarity Dataset Hub](https://github.com/brmson/dataset-sts)
- [Facebook AI Image Similarity Challenge](https://www.drivendata.org/competitions/79/competition-image-similarity-1-dev/?fbclid=IwAR31vRV0EdxRdrxtPy12neZtBJQ0H9qdLHm8Wl2DjHY09PtQdn1nEEIJVUo)
- [WIT : Wikipedia-based Image Text Dataset](https://github.com/google-research-datasets/wit)
- [BEIR](https://github.com/beir-cellar/beir)
- MTEB

## 里程碑

请查看[项目看板](https://github.com/Agrover112/awesome-semantic-search/projects/1)以获取贡献到任何开放问题的任务列表。