update for embedding

wangwei1237 · Oct 27, 2023 · 31ab4e6 · 31ab4e6
1 parent d5a60d8
commit 31ab4e6
Show file tree

Hide file tree

Showing 8 changed files with 131 additions and 3 deletions.
diff --git a/_quarto.yml b/_quarto.yml
@@ -43,7 +43,7 @@ book:
     - references.qmd
   appendices: 
     - glossary.qmd
-    - install_langchain.qmd
+    - langchain_install.qmd
     - milvus_install.qmd
   site-url: https://wangwei1237.github.io/LLM_in_Action/
   navbar:
@@ -74,4 +74,4 @@ format:
 comments:
   utterances:
     repo: wangwei1237/LLM_in_Action
-    label: comment
+    label: comment
diff --git a/code/test_embedding_visualization.py b/code/test_embedding_visualization.py
@@ -0,0 +1,37 @@
+#encoding: utf-8
+
+"""
+@discribe: demo for the embedding visualization.
+@author: [email protected]
+"""
+
+import matplotlib.pyplot as plt
+import numpy as np
+from pymilvus import connections
+from pymilvus import Collection
+from sklearn.manifold import TSNE
+
+connections.connect(
+  host='127.0.0.1',
+  port='8081'
+)  # <1>
+
+collection = Collection("LangChainCollection") # <2>
+
+res = collection.query(
+  expr = "pk >= 0",
+  offset = 0,
+  limit = 500, 
+  output_fields = ["vector", "text", "source", "title"],
+) # <3>
+
+vector_list = [i["vector"] for i in res] # <4>
+
+matrix = np.array(vector_list) # <5>
+
+tsne = TSNE(n_components=2, perplexity=15, random_state=42, init='random', learning_rate=200)
+vis_dims = tsne.fit_transform(matrix) # <6>
+
+plt.scatter(vis_dims[:, 0], vis_dims[:, 1]) # <7>
+plt.title("embedding visualized using t-SNE")
+plt.show()
diff --git a/embedding.qmd b/embedding.qmd
@@ -1,3 +1,65 @@
+---
+filters:
+   - include-code-files
+---
+
 # Embedding
 
 在机器学习和自然语言处理中，embedding 是指将高维度的数据（例如文字、图片、音频）映射到低维度空间的过程。embedding 向量通常是一个由实数构成的向量，它将输入的数据表示成一个连续的数值空间中的点。简单来说，embedding 就是一个N维的实值向量，它几乎可以用来表示任何事情，如文本、音乐、视频等。
+
+对数据进行 embedding 的目的在于保留数据的内容或者其含义的各个特征。和不相关的数据相比，相似数据的 embedding 的大小和方向更接近，因此可以用于表述文本的相关性。
+
+Embedding 的应用场景：
+
+* 搜索：根据与查询字符串的相关性对结果进行排序
+* 聚类：对数据按相似性分组
+* 推荐：推荐具有相关内容的数据项
+* 分类：对数据按其最相似的标签进行分类
+* 异常检测：识别出相关性很小的异常值
+
+Embedding 是一个浮点数类型的向量或列表。可以用向量之间的距离来测量它们的相关性：距离越小，表示相关性越高；距离越大，相关性越低。
+
+## 获取 Embedding
+可以根据 [Embedding-V1 API 文档](https://cloud.baidu.com/doc/WENXINWORKSHOP/s/alj562vvu) 的介绍，来获取基于百度文心大模型的字符串 Embedding。
+
+还可以使用 @lst-langchain_embed_query_wx 的方式来获取相同的基于文心大模型的 Embedding。
+
+```python
+embeddings = QianfanEmbeddingsEndpoint()
+query_result = embeddings.embed_query("你是谁？")
+```
+
+```bash
+[0.02949424833059311, -0.054236963391304016, -0.01735987327992916, 
+ 0.06794580817222595, -0.00020318820315878838, 0.04264984279870987, 
+ -0.0661700889468193, ……
+ ……]
+```
+
+## 可视化
+Embedding 一般是一种高维数据，为了将这种高维数据可视化，我们可以使用 t-SNE [-@tsne_online] 算法将数据进行降维，然后再做可视化处理。
+
+利用 @lst-langchain_milvus_embedding 对文档进行向量化，然后将向量数据存储于 Milvus 向量数据库中（默认采用的 Collection 为 LangChainCollection）。
+
+可以通过 Milvus 提供的 HTTP API 来查看指定的 Collection 的结构：
+
+```bash
+http://{{MILVUS_URI}}/v1/vector/collections/describe?collectionName=LangChainCollection
+```
+
+![Milvus向量数据的结构](./images/lc_milvus_coll_demo.jpg){#fig-lc_milvus_coll_demo}
+
+```{#lst-embedding_visualization .python include="./code/test_embedding_visualization.py" code-line-numbers="true" lst-cap="向量数据可视化"}
+```
+
+1. 初始化 Milvus 链接
+2. 选择 LangChainCollection
+3. 从 LangChainCollection 中检索特定数据
+4. 只提取结果中的 vector 字段，并生成新的列表
+5. 将 python 列表转换成矩阵
+6. 对向量数据进行降维
+7. 对低维数据进行可视化
+
+结果如 @fig-vfe 所示：
+
+![向量数据可视化结果](./images/vfe.png){#fig-vfe}
diff --git a/glossary.qmd b/glossary.qmd
@@ -12,10 +12,16 @@ LM(Language Model)：语言模型
 
 LLM(Large Language Model)：大语言模型
 
+NSFW(Not Safe for Work)：用于提醒内容不适合公开场合浏览
+
 Prompt：提示词
 
 Prompt Engineering：提示词工程
 
 RAG(Retrieval Augmented Generation)：检索式增强生成
 
 RATG(Retrieval Augmented Text Generation)：检索增强式文本生成
+
+SD(Stable Diffusion)：
+
+t-SNE(t-Distributed, Stochastic neighbor Embedding)：T分布和随机近邻嵌入
diff --git a/images/lc_milvus_coll_demo.jpg b/images/lc_milvus_coll_demo.jpg
diff --git a/images/vfe.png b/images/vfe.png
diff --git a/langchain_retrieval.qmd b/langchain_retrieval.qmd
@@ -224,6 +224,22 @@ vector_db = Milvus.from_documents(
 http://127.0.0.1:8081/v1/vector/collections/describe?collectionName=LangChainCollection
 ```
 
+:::{.callout-note title="修改 Collection 名称"}
+为了方便使用，可以使用 `collection_name` 参数以实现将不同的专有数据源存储在不同的 Collection。
+
+```python
+vector_db = Milvus.from_documents(
+    texts,
+    QianfanEmbeddingsEndpoint(),
+    connection_args={"host": "127.0.0.1", "port": "8081"},
+    collection_name="test", # <1>
+)
+```
+
+1. 设置数据存储的 Collection，类似于在关系数据库中，将数据存储在不同的表中。
+
+:::
+
 :::{.callout-warning}
 使用千帆进行 Embedding 时，每次 Embedding 的 token 是有长度限制的，目前的最大限制是 384 个 token。因此，我们在使用 `RecursiveCharacterTextSplitter` 进行文档拆分的时候要特别注意拆分后文档的长度。
 

diff --git a/references.bib b/references.bib
@@ -145,4 +145,11 @@ @online{yao2022react_online
   author={Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan},
   year={2022},
   url={https://react-lm.github.io/}
-}
+}
+
+@online{tsne_online,
+  title={Introduction to t-SNE},
+  author={Abid Ali Awan},
+  year={2023},
+  url={https://www.datacamp.com/tutorial/introduction-t-sne}
+}