-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add articles on Content Defined Chunking in English and Chinese
- Loading branch information
1 parent
118d902
commit e716b1a
Showing
2 changed files
with
266 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,136 @@ | ||
--- | ||
title: "Content Defined Chunk (CDC)" | ||
description: "Under the hood of CDC" | ||
date: "2023-05-24" # The date of the post fist published. | ||
category: "拿来主义" # [Optional] The category of the post. | ||
tags: # [Optional] The tags of the post. | ||
- "Storage" | ||
- "CDC" | ||
--- | ||
|
||
> This article was created with the assistance of AI. | ||
# Why Use Chunking? | ||
|
||
By chunking data, we isolate changes. When a file is modified, only the modified chunks need to be updated. | ||
|
||
## How to Chunk? | ||
|
||
### Fixed-Length Chunking | ||
|
||
Currently, files are chunked into fixed lengths. For example, suppose the file content is `abcdefg`. By dividing it into chunks of four bytes, we get `abcd|efg`. If a character is added at the beginning, making the content `0abcdefg`, the chunks become `0abc|defg`. Both chunks differ completely from the previous ones. This means that, when syncing the modification to a network file system, both chunks must be re-uploaded. | ||
|
||
### Content-Defined Variable-Length Chunking | ||
|
||
If chunks are defined based on content, using `d` as a breakpoint, chunks are formed as `0abcd|efg`. Here, only one chunk differs from the previous set, which significantly improves efficiency compared to fixed-length chunking. | ||
|
||
#### Problem | ||
|
||
There is an extremely low probability of creating multiple short chunks, e.g., `dddd` would be divided as `d|d|d|d`. This situation leads to excessive chunks, making it hard to manage. Obviously, we cannot always use the same content as the breakpoint. For instance, if the file content is `dd...d`, it would be chunked into `d|d|...|d|`, with each chunk containing just one character, wasting space and going against the original intent of chunking. | ||
|
||
To address this issue, we need a method to randomly select breakpoints such that chunks have an average size while maintaining certain properties for the breakpoints. | ||
|
||
--- | ||
|
||
# Hashing | ||
|
||
Hashing maps an input of any length to a fixed-length output with the following properties: | ||
|
||
- Fast in the forward direction | ||
- Difficult to reverse | ||
- Sensitive to input changes | ||
- Avoids collisions | ||
- Rolling hash | ||
|
||
### Objective: Optimizing String Matching with Hashing | ||
|
||
By matching strings based on their hash values, given a pattern string of length \( n \), we can take substrings of the matching string of length \( n \), compute their hashes, and compare with the pattern hash. With high probability, a hash collision implies the substrings match the pattern. However, brute-forcing all combinations is inefficient, so optimization is required. | ||
|
||
Using a rolling hash, a sliding window of length \( n \) calculates the hash of each substring by removing the effect of the old character and adding the effect of the new character, significantly reducing computational complexity. | ||
|
||
#### Rolling Hash Formula | ||
|
||
Let the substring within the window before sliding be \( s_{i...i+n} \). If \( M \) is a prime polynomial, the hash is: | ||
|
||
\[ | ||
\text{hash}(s_{i...i+n}) = \left(s_i \cdot a^n + s_{i+1} \cdot a^{n-1} + \dots + s_{i+n-1} \cdot a + s_{i+n}\right) \bmod M | ||
\] | ||
|
||
After sliding, the hash for \( s_{i+1...i+n+1} \) is: | ||
|
||
\[ | ||
\text{hash}(s_{i+1...i+n+1}) = \left(a \cdot \text{hash}(s_{i...i+n}) - s_i \cdot a^n + s_{i+n+1}\right) \bmod M | ||
\] | ||
|
||
Thus, the recurrence relation is: | ||
|
||
\[ | ||
\text{hash}(s_{i+1...i+n+1}) = \left(a \cdot \text{hash}(s_{i...i+n}) - s_i \cdot a^n + s_{i+n+1}\right) \bmod M | ||
\] | ||
|
||
This allows \( O(1) \) computation for the next hash value, resulting in \( O(m + n) \) complexity for \( m \) substrings. | ||
|
||
### Rabin-Karp Algorithm | ||
|
||
The Rabin-Karp algorithm implements string matching using rolling hash. It relies on an efficient hash function, specifically the **Rabin Fingerprint**. | ||
|
||
#### Rabin Fingerprint | ||
|
||
The Rabin fingerprint is a polynomial hash on a finite field \( GF(2) \), e.g., \( f(x) = x^3 + x^2 + 1 \), represented as \( 1101 \) in binary. | ||
|
||
Addition and subtraction are XOR operations, simplifying computation by avoiding carry-over concerns. However, multiplication and division require \( O(k) \) complexity (where \( k \) is the polynomial's degree). | ||
|
||
--- | ||
|
||
### Implementation Example | ||
|
||
For a polynomial \( M(x) \) of degree \( 64 \): | ||
|
||
```cpp | ||
uint64_t poly = 0xbfe6b8a5bf378d83LL; | ||
``` | ||
|
||
The recurrence relation is: | ||
|
||
\[ | ||
H = \left(a(x) \cdot H_\text{old} - s_i \cdot a^n(x) + s_{i+n}\right) \bmod M(x) | ||
\] | ||
|
||
Key optimizations include: | ||
|
||
1. **Multiplication \( a(x) \cdot H_\text{old} \)**: Precompute terms involving \( a(x) \). | ||
2. **Efficient Modulo Operations**: Precompute values for modular reduction using \( g(x) \), simplifying modulo operations. | ||
|
||
--- | ||
|
||
### Code for Finding Polynomial Degree | ||
|
||
Below is the C++ implementation of finding the highest degree of a polynomial, equivalent to finding the most significant bit of a binary number: | ||
|
||
```cpp | ||
uint32_t RabinChecksum::find_last_set(uint32_t value) { | ||
if (value & 0xffff0000) { | ||
if (value & 0xff000000) | ||
return 24 + byteMSB[value >> 24]; | ||
else | ||
return 16 + byteMSB[value >> 16]; | ||
} else { | ||
if (value & 0x0000ff00) | ||
return 8 + byteMSB[value >> 8]; | ||
else | ||
return byteMSB[value]; | ||
} | ||
} | ||
|
||
uint32_t RabinChecksum::find_last_set(uint64_t v) { | ||
uint32_t h = v >> 32; | ||
if (h) | ||
return 32 + find_last_set(h); | ||
else | ||
return find_last_set((uint32_t)v); | ||
} | ||
``` | ||
Using this, polynomial multiplication can be implemented efficiently. | ||
[Original Article (Chinese)](https://blog.csdn.net/cyk0620/article/details/120813255) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,130 @@ | ||
--- | ||
title: "CDC基于内容的可变长度分块" | ||
description: "基于内容的可变长度分块(CDC)相关原理解析" | ||
date: "2023-05-24" # The date of the post fist published. | ||
category: "拿来主义" # [Optional] The category of the post. | ||
tags: # [Optional] The tags of the post. | ||
- "Storage" | ||
- "CDC" | ||
--- | ||
|
||
> 本文由AI辅助创作 | ||
# 为什么要分块? | ||
通过分块来隔离变更,对于文件变更仅需更新被修改的块 | ||
如何分块? | ||
定长分块:现在对文件进行定长分块,假设文件中的内容为`abcdefg`,每四个字节分为一块,则分块后为`abcd|efg`。假如在头部加入了一个字符,内容变更为`0abcdefg`,则分块后为`0abc|defg`,发现两个块和之前完全不一样。这意味着如果要向网络文件系统同步此次修改,则需重新上传两个块。 | ||
基于内容可变长度的分块:假如我们基于内容进行分块,以`d`作为断点,在`d`后产生断点,那么此时分块就变成了`0abcd|efg`。发现这样分块与之前的分块仅一个块不一致,也就是说只需重新上传这个不一致的块,相比定长分块效率大大提高。 | ||
问题:有极低的概率出现多个短块。如`dddd`以`d`断开,则会得到`d|d|d|d`。此情况会导致块数过多,因而变得难以维护。显然不能总是以相同的内容作为断点,例如若文件内容为`dd...d`,则分块后为`d|d|...|d|`。每个块仅包含一个字符,非常浪费空间,更不方便管理,违背了分块的初衷。 | ||
为了解决这个问题,我们需要想个办法随机选择一个断点,使得各个块有一个平均大小,同时又保证断点的某种属性相同。 | ||
# 哈希 | ||
任意长度输入,固定长度输出 | ||
- 正向快速 | ||
- 逆向困难 | ||
- 输入敏感 | ||
- 冲突避免 | ||
- 滚动哈希 | ||
目标:优化使用哈希的字符串匹配问题 | ||
利用字符串哈希值进行匹配。已知模式串长度为n nn,那么我们可以依次截取匹配串中长为n nn的子串计算哈希,然后与模式串的哈希进行比对,若相等则得到一次匹配。在极大概率下,哈希碰撞到的子串与模式串相同。但如果采用暴力方法,也就是简单地计算所有可能的组合,那其实跟暴力匹配没区别,所以需要优化 | ||
通过滚动哈希,使用长为n的滑动窗口来解决这个问题。每次扫过一个字符,从原来的哈希中删除旧字符的影响,加上新字符的影响,得到新的哈希,将计算复杂度减少了很多。 | ||
滚动哈希的实现需要一种算法来完成这个新旧字符影响的考虑 | ||
为了构造这样一个哈希函数,使用一个[[素域]]上的多项式进行映射。[[拉宾指纹]]的原理来看,其本质是将需要进行哈希的输入编进一个多项式做为系数,再构造一个新的多项式,并将两个这样的多项式求余( $$mod$$ )运算。 | ||
我们刚刚讨论滚动哈希,就是在窗口内,窗口滑动前,把窗口内的东西编入一个多项式,与一个预定义的多项式 $$M$$ 进行mod运算,然后窗口滑动一格,再将窗口内的内容编入一个多项式,mod同一个多项式。设滑动前窗口内的串为 $$s_{i...i+n}$$ , 其中n为窗口长度,M为素域上的一个多项式,则滑动前的哈希为 | ||
$$hash(s_{i...i+n}) = (s_i a^{n}+s_{i+1} a^{n-1}+...+s_{i+n-1}a^{1} + s_{i+n}) \pmod M$$ 同理。滑动后窗口内的串哈希为: | ||
$$hash(s_{i+1...i+n+1}) = (s_{i+1} a^{n}+s_{i+2} a^{n-1}+...+s_{i+n}a^{1} + s_{i+n+1}) \pmod M$$ 由此可得递推关系: | ||
$$hash(s_{i+1\dots i+n+1})=(a·hash(s_{i\dots i+n} ) - s_i a^n + s_{i+n+1})\pmod{M}$$ 这样每次只要算一次,可以直接以O(1)的复杂度获得下一窗口的哈希值,也就是先以O(n)的复杂度计算第一个哈希值,再以O(1)的复杂度滚动计算每个窗口的哈希值,总共需要计算m次,所以复杂度是O(m+n) | ||
[[拉宾-卡普算法]]是一种滚动哈希实现字符串匹配的实现算法,它需要一种可靠、高效的散列函数,即为[[拉宾指纹]] | ||
拉宾指纹也是一个多项式哈希映射,但它并非在素域 $\small M$ 上,且映射结果不是一个值。拉宾指纹使用的是[[有限域]] $\small GF(2)$ 上的多项式,例如: $f(x)=x^3+x^2+1$ 。这个多项式可以使用二进制表示为 $\small 1101$ 。 | ||
之所以使用这样的多项式表示,是因为相比传统的值运算, $\small GF(2)$ 多项式运算更简单:加减都是异或,这样就完全不需要考虑进位的问题。并且多项式的乘除性质与整数相似。不过就算不需要考虑进位,乘法和除法(求余)也只能以 $\small k$ 的复杂度完成( $\small k$ 为多项式的最高次幂)。 | ||
拉宾指纹的哈希函数如下:(和素域类似,模需要是个不可约多项式) | ||
$$hash(s_{i\dots i+n})=(s_{i}a(x)^n+s_{i+1}a(x)^{n-1}+\dots +s_{i+n-1}a(x)+s_{i+n}){\pmod{M(x)}}$$ 递推式如下: | ||
$$hash(s_{i+1\dots i+n})=((a(x)·hash(s_{i\dots i+n-1}))_{\pmod{M(x)}}-s_ia^{n}(x)+s_{i+n}){\pmod{M(x)}}$$ ## 拉宾指纹实现 | ||
选取一个 $$k=64$$ 的多项式 $$M(x)$$ ,也就是一个64位的二进制数 | ||
```c++ | ||
uint64_t poly = 0xbfe6b8a5bf378d83LL; | ||
``` | ||
我们假设窗口滑动前窗口内字符串的哈希已知,根据拉宾指纹的递推式,窗口滑动后的哈希为 $$H=(a(x)\times H_{old} - s_i a^n(x)+s_{i+n}) \pmod{M(x)}$$ | ||
这个式子分成3个部分 | ||
乘法 $$a(x) \times H_{old}$$ 部分,难以预处理,因为旧的哈希不可预知。整体要运算 $$(p(x) \cdot a(x)) \pmod{M(x)}$$ | ||
对乘法部分优化 | ||
乘以 $$a(x)$$ 的幂运算,精心选取一个多项式,每次窗口滑动一个字节,也就是8bit,我们可以让 $$a(x)=x^8$$ ,使得乘以 $$a(x)$$ 变成二进制下左移8位。 | ||
求余mod运算的优化 | ||
令 $$g(x)$$ 为 $$p(x)$$ 的最高次项(也就是 $$g \cdot x^{some}$$ 的形式),则有 | ||
$$g(x)\le p(x)$$ 原式可以写成 | ||
$$((p-p \pmod{g/a} + p \pmod{g/a}) \cdot a ) + p$$ 变换可得 | ||
$$(p \pmod{g/a} \cdot a) \pmod{g/a} - ( (p - p \pmod{g/a}) \cdot a ) \pmod {M}$$ 即 | ||
$$( (p-(p - p \pmod{g/a})) \cdot a) \pmod{g/a} + ( ( p - p \pmod{g-a} ) \cdot a) \pmod{M}$$ 由于 $$( (p-(p - p \pmod{g/a})) \cdot a) \pmod{g/a} $$ 一定小于 $$g$$ ,因此一定小于 $$a$$ 。 | ||
令 $$j \cdot (\frac{g}{a}) = p - p \pmod {g/a}$$ , 即有 | ||
$$((p-j\cdot (\frac{g}{a})) \cdot a ) + (g \cdot j) \pmod{M}$$ 其中, $$g/a = x^{shiftx-8}$$ ( $$x^{shiftx}$$ 是多项式p(x)的最高次项,shiftx是最高次项的次数, 因此 $$p-p \pmod{g/a}$$ 就是保留p的高8位。将各个式子改成二进制形式,并使用位运算,就可以写成 | ||
```c++ | ||
(p^j) << 8 + g * j % (xshift - 8) | ||
``` | ||
继续改写,将p提出来,写成 | ||
```c++ | ||
(p<<8) ^ (g * j % (xshift - 8) | j << (xshift + 8)) | ||
``` | ||
所以预处理`g * j % (xshift - 8) | j << (xshift + 8)`(上式中后半部分),提前算好存进T表,前半部分只有一个位移运算就是O(1), 整个式子就可以用O(1)的复杂度算出来。 | ||
乘法 $$s_i a^n (x)$$ 部分,可以预处理 | ||
先实现 $$GF(2)$$ 下的多项式乘法,即可知道 $$a^n(x)$$ ,然后枚举 $$s_i a^n(x)$$ , 将结果缓存到一个表中 | ||
加法 $$s_{i+n}$$ 部分,这部分求模只需要一步异或运算,常数时间复杂度,可以忽略。 | ||
代码 | ||
首先实现查找一个多项式最高次项次数的函数,本质上是在查找二进制的最高位1在哪 | ||
```c++ | ||
/// <summary> | ||
/// find last set | ||
/// 对于uint32,找到最高位1在第几位 | ||
/// </summary> | ||
/// <param name="value">要查找的数</param> | ||
/// <returns>最高位1在哪一位上</returns> | ||
uint32_t RabinChecksum::find_last_set(uint32_t value) | ||
{ | ||
// 如果这个32位整数的(右往左)前16位不为0 | ||
if (value & 0xffff0000) | ||
{ | ||
if (value & 0xff000000) | ||
// 这个整数前8位不为0 | ||
// 将value左移24位只剩下8位, 范围0-256, | ||
// 查看这个前八位组成的数最高位的1在第几位. 这个第几位的结果是0-8, 即0000-1000 | ||
// 24换成二进制是0001 1000 | ||
// 这样加上去会把得到位数转换成范围在24-32, 而最高位1也确实在这个范围(因为知道前面不为0) | ||
return 24 + byteMSB[value >> 24]; | ||
else | ||
// 这个整数前8位为0, 左移16位剩下8位, 查表看剩下的8位最高位在第几位 | ||
// 16 二进制为0001 0000, 任何结果将被转换为16-31 | ||
return 16 + byteMSB[value >> 16]; | ||
} | ||
else | ||
{ | ||
// 这个32位整数的前16位都为0, 还剩后16位 | ||
if (value & 0x0000ff00) | ||
// 这后16位中前8位不为0 | ||
return 8 + byteMSB[value >> 8]; | ||
else | ||
// 这后16位中前8位都为0, 只剩最后8位, 直接返回 | ||
return byteMSB[value]; | ||
} | ||
|
||
} | ||
/// <summary> | ||
/// 对于uint64, 最高位1在第几位 | ||
/// </summary> | ||
/// <param name="value">要查找的数</param> | ||
/// <returns>最高位1在哪一位上<</returns> | ||
uint32_t RabinChecksum::find_last_set(uint64_t v) | ||
{ | ||
uint32_t h = v >> 32; // h为v的前32位 | ||
if (h) | ||
{ | ||
// v的前32位不为0, 且值为h | ||
// 则最高位在h最高位所在的位数+32 | ||
return 32 + find_last_set(h); | ||
} | ||
else | ||
{ | ||
// 前32位为0, 则直接截断, 找后32位最高位1在第几位 | ||
return find_last_set((uint32_t)v); | ||
} | ||
} | ||
``` | ||
利用这个实现了查找次数的函数实现多项式乘法。 | ||
[原文](https://blog.csdn.net/cyk0620/article/details/120813255) |