-
Notifications
You must be signed in to change notification settings - Fork 488
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
6c35ac4
commit cbfe634
Showing
8 changed files
with
260 additions
and
6 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,7 @@ | ||
# 内核增强功能 | ||
# 自研功能 | ||
|
||
- [高性能](./performance/README.md) | ||
- [高可用](./availability/README.md) | ||
- [安全](./security/README.md) | ||
- [弹性跨机并行查询(ePQ)](./epq/README.md) | ||
- [第三方插件](./extensions/README.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# 第三方插件 | ||
|
||
- [pgvector](./pgvector.md) <Badge type="tip" text="V11 / v1.1.35-" vertical="top" /> | ||
- [smlar](./smlar.md) <Badge type="tip" text="V11 / v1.1.28-" vertical="top" /> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
--- | ||
author: 山现 | ||
date: 2023/12/25 | ||
minute: 10 | ||
--- | ||
|
||
# pgvector | ||
|
||
<Badge type="tip" text="V11 / v1.1.35-" vertical="top" /> | ||
|
||
<ArticleInfo :frontmatter=$frontmatter></ArticleInfo> | ||
|
||
[[toc]] | ||
|
||
## 背景 | ||
|
||
[`pgvector`](https://github.com/pgvector/pgvector) 作为一款高效的向量数据库插件,基于 PostgreSQL 的扩展机制,利用 C 语言实现了多种向量数据类型和运算算法,同时还能够高效存储与查询以向量表示的 AI Embedding。 | ||
|
||
`pgvector` 支持 IVFFlat 索引。IVFFlat 索引能够将向量空间分为若干个划分区域,每个区域都包含一些向量,并创建倒排索引,用于快速地查找与给定向量相似的向量。IVFFlat 是 IVFADC 索引的简化版本,适用于召回精度要求高,但对查询耗时要求不严格(100ms 级别)的场景。相比其他索引类型,IVFFlat 索引具有高召回率、高精度、算法和参数简单、空间占用小的优势。 | ||
|
||
`pgvector` 插件算法的具体流程如下: | ||
|
||
1. 高维空间中的点基于隐形的聚类属性,按照 K-Means 等聚类算法对向量进行聚类处理,使得每个类簇有一个中心点 | ||
2. 检索向量时首先遍历计算所有类簇的中心点,找到与目标向量最近的 n 个类簇中心 | ||
3. 遍历计算 n 个类簇中心所在聚类中的所有元素,经过全局排序得到距离最近的 k 个向量 | ||
|
||
## 使用方法 | ||
|
||
`pgvector` 可以顺序检索或索引检索高维向量,关于索引类型和更多参数介绍可以参考插件源代码的 [README](https://github.com/pgvector/pgvector/blob/master/README.md)。 | ||
|
||
### 安装插件 | ||
|
||
```sql:no-line-numbers | ||
CREATE EXTENSION vector; | ||
``` | ||
|
||
### 向量操作 | ||
|
||
执行如下命令,创建一个含有向量字段的表: | ||
|
||
```sql:no-line-numbers | ||
CREATE TABLE t (val vector(3)); | ||
``` | ||
|
||
执行如下命令,可以插入向量数据: | ||
|
||
```sql:no-line-numbers | ||
INSERT INTO t (val) VALUES ('[0,0,0]'), ('[1,2,3]'), ('[1,1,1]'), (NULL); | ||
``` | ||
|
||
创建 IVFFlat 类型的索引: | ||
|
||
1. `val vector_ip_ops` 表示需要创建索引的列名为 `val`,并且使用向量操作符 `vector_ip_ops` 来计算向量之间的相似度。该操作符支持向量之间的点积、余弦相似度、欧几里得距离等计算方式 | ||
2. `WITH (lists = 1)` 表示使用的划分区域数量为 1,这意味着所有向量都将被分配到同一个区域中。在实际应用中,划分区域数量需要根据数据规模和查询性能进行调整 | ||
|
||
```sql:no-line-numbers | ||
CREATE INDEX ON t USING ivfflat (val vector_ip_ops) WITH (lists = 1); | ||
``` | ||
|
||
计算近似向量: | ||
|
||
```sql:no-line-numbers | ||
=> SELECT * FROM t ORDER BY val <#> '[3,3,3]'; | ||
val | ||
--------- | ||
[1,2,3] | ||
[1,1,1] | ||
[0,0,0] | ||
(4 rows) | ||
``` | ||
|
||
### 卸载插件 | ||
|
||
```sql:no-line-numbers | ||
DROP EXTENSION vector; | ||
``` | ||
|
||
## 注意事项 | ||
|
||
- [ePQ](../epq/README.md) 支持通过排序遍历高维向量,不支持通过索引查询向量类型 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,143 @@ | ||
--- | ||
author: 棠羽 | ||
date: 2022/10/05 | ||
minute: 10 | ||
--- | ||
|
||
# smlar | ||
|
||
<Badge type="tip" text="V11 / v1.1.28-" vertical="top" /> | ||
|
||
<ArticleInfo :frontmatter=$frontmatter></ArticleInfo> | ||
|
||
[[toc]] | ||
|
||
## 背景 | ||
|
||
对大规模的数据进行相似度计算在电商业务、搜索引擎中是一个很关键的技术问题。相对简易的相似度计算实现不仅运算速度慢,还十分消耗资源。[`smlar`](https://github.com/jirutka/smlar) 是 PostgreSQL 的一款开源第三方插件,提供了可以在数据库内高效计算数据相似度的函数,并提供了支持 GiST 和 GIN 索引的相似度运算符。目前该插件已经支持 PostgreSQL 所有的内置数据类型。 | ||
|
||
::: warning | ||
由于 smlar 插件的 `%` 操作符与 RUM 插件的 `%` 操作符冲突,因此 smlar 与 RUM 两个插件无法同时创建在同一 schema 中。 | ||
::: | ||
|
||
## 函数及运算符介绍 | ||
|
||
- **`float4 smlar(anyarray, anyarray)`** | ||
|
||
计算两个数组的相似度,数组的数据类型需要一致。 | ||
|
||
- **`float4 smlar(anyarray, anyarray, bool useIntersect)`** | ||
|
||
计算两个自定义复合类型数组的相似度,`useIntersect` 参数表示是否让仅重叠元素还是全部元素参与运算;复合类型可由以下方式定义: | ||
|
||
```sql:no-line-numbers | ||
CREATE TYPE type_name AS (element_name anytype, weight_name FLOAT4); | ||
``` | ||
|
||
- **`float4 smlar(anyarray a, anyarray b, text formula);`** | ||
|
||
使用参数给定的公式来计算两个数组的相似度,数组的数据类型需要一致;公式中可以使用的预定义变量有: | ||
|
||
- `N.i`:两个数组中的相同元素个数(交集) | ||
- `N.a`:第一个数组中的唯一元素个数 | ||
- `N.b`:第二个数组中的唯一元素个数 | ||
|
||
```sql:no-line-numbers | ||
SELECT smlar('{1,4,6}'::int[], '{5,4,6}', 'N.i / sqrt(N.a * N.b)'); | ||
``` | ||
|
||
- **`anyarray % anyarray`** | ||
|
||
该运算符的含义为,当两个数组的的相似度超过阈值时返回 `TRUE`,否则返回 `FALSE`。 | ||
|
||
- **`text[] tsvector2textarray(tsvector)`** | ||
|
||
将 `tsvector` 类型转换为字符串数组。 | ||
|
||
- **`anyarray array_unique(anyarray)`** | ||
|
||
对数组进行排序、去重。 | ||
|
||
- **`float4 inarray(anyarray, anyelement)`** | ||
|
||
如果元素出现在数组中,则返回 `1.0`;否则返回 `0`。 | ||
|
||
- **`float4 inarray(anyarray, anyelement, float4, float4)`** | ||
|
||
如果元素出现在数组中,则返回第三个参数;否则返回第四个参数。 | ||
|
||
## 可配置参数说明 | ||
|
||
- **`smlar.threshold FLOAT`** | ||
|
||
相似度阈值,用于给 `%` 运算符判断两个数组是否相似。 | ||
|
||
- **`smlar.persistent_cache BOOL`** | ||
|
||
全局统计信息的缓存是否存放在与事务无关的内存中。 | ||
|
||
- **`smlar.type STRING`**:相似度计算公式,可选的相似度类型包含: | ||
|
||
- [cosine](https://en.wikipedia.org/wiki/Cosine_similarity)(默认) | ||
- [tfidf](https://zh.wikipedia.org/zh-cn/Tf-idf) | ||
- [overlap](https://en.wikipedia.org/wiki/Overlap_coefficient) | ||
|
||
- **`smlar.stattable STRING`** | ||
|
||
存储集合范围统计信息的表名,表定义如下: | ||
|
||
```sql:no-line-numbers | ||
CREATE TABLE table_name ( | ||
value data_type UNIQUE, | ||
ndoc int4 (or bigint) NOT NULL CHECK (ndoc>0) | ||
); | ||
``` | ||
|
||
- **`smlar.tf_method STRING`**:计算词频(TF,Term Frequency)的方法,取值如下 | ||
|
||
- `n`:简单计数(默认) | ||
- `log`:`1 + log(n)` | ||
- `const`:频率等于 `1` | ||
|
||
- **`smlar.idf_plus_one BOOL`**:计算逆文本频率指数的方法(IDF,Inverse Document Frequency)的方法,取值如下 | ||
|
||
- `FALSE`:`log(d / df)`(默认) | ||
- `TRUE`:`log(1 + d / df)` | ||
|
||
## 基本使用方法 | ||
|
||
### 安装插件 | ||
|
||
```sql:no-line-numbers | ||
CREATE EXTENSION smlar; | ||
``` | ||
|
||
### 相似度计算 | ||
|
||
使用上述的函数计算两个数组的相似度: | ||
|
||
```sql | ||
SELECT smlar('{3,2}'::int[], '{3,2,1}'); | ||
smlar | ||
---------- | ||
0.816497 | ||
(1 row) | ||
|
||
SELECT smlar('{1,4,6}'::int[], '{5,4,6}', 'N.i / (N.a + N.b)' ); | ||
smlar | ||
---------- | ||
0.333333 | ||
(1 row) | ||
``` | ||
|
||
### 卸载插件 | ||
|
||
```sql:no-line-numbers | ||
DROP EXTENSION smlar; | ||
``` | ||
|
||
## 原理和设计 | ||
|
||
[GitHub - jirutka/smlar](https://github.com/jirutka/smlar) | ||
|
||
[PGCon 2012 - Finding Similar: Effective similarity search in database](https://www.pgcon.org/2012/schedule/track/Hacking/443.en.html) ([slides](https://www.pgcon.org/2012/schedule/attachments/252_smlar-2012.pdf)) |