Add articles on Content Defined Chunking in English and Chinese

charlieJ107 · Jan 28, 2025 · e716b1a · e716b1a
1 parent 118d902
commit e716b1a
Show file tree

Hide file tree

Showing 2 changed files with 266 additions and 0 deletions.
diff --git a/content/blog/en/Content Defined Chunk.md b/content/blog/en/Content Defined Chunk.md
@@ -0,0 +1,136 @@
+---
+title: "Content Defined Chunk (CDC)"
+description: "Under the hood of CDC" 
+date: "2023-05-24" # The date of the post fist published.
+category: "拿来主义" # [Optional] The category of the post.
+tags: # [Optional] The tags of the post.
+  - "Storage"
+  - "CDC"
+---
+
+> This article was created with the assistance of AI.
+
+# Why Use Chunking?
+
+By chunking data, we isolate changes. When a file is modified, only the modified chunks need to be updated.
+
+## How to Chunk?
+
+### Fixed-Length Chunking
+
+Currently, files are chunked into fixed lengths. For example, suppose the file content is `abcdefg`. By dividing it into chunks of four bytes, we get `abcd|efg`. If a character is added at the beginning, making the content `0abcdefg`, the chunks become `0abc|defg`. Both chunks differ completely from the previous ones. This means that, when syncing the modification to a network file system, both chunks must be re-uploaded.
+
+### Content-Defined Variable-Length Chunking
+
+If chunks are defined based on content, using `d` as a breakpoint, chunks are formed as `0abcd|efg`. Here, only one chunk differs from the previous set, which significantly improves efficiency compared to fixed-length chunking.
+
+#### Problem
+
+There is an extremely low probability of creating multiple short chunks, e.g., `dddd` would be divided as `d|d|d|d`. This situation leads to excessive chunks, making it hard to manage. Obviously, we cannot always use the same content as the breakpoint. For instance, if the file content is `dd...d`, it would be chunked into `d|d|...|d|`, with each chunk containing just one character, wasting space and going against the original intent of chunking.
+
+To address this issue, we need a method to randomly select breakpoints such that chunks have an average size while maintaining certain properties for the breakpoints.
+
+---
+
+# Hashing
+
+Hashing maps an input of any length to a fixed-length output with the following properties:
+
+- Fast in the forward direction
+- Difficult to reverse
+- Sensitive to input changes
+- Avoids collisions
+- Rolling hash
+
+### Objective: Optimizing String Matching with Hashing
+
+By matching strings based on their hash values, given a pattern string of length \( n \), we can take substrings of the matching string of length \( n \), compute their hashes, and compare with the pattern hash. With high probability, a hash collision implies the substrings match the pattern. However, brute-forcing all combinations is inefficient, so optimization is required.
+
+Using a rolling hash, a sliding window of length \( n \) calculates the hash of each substring by removing the effect of the old character and adding the effect of the new character, significantly reducing computational complexity.
+
+#### Rolling Hash Formula
+
+Let the substring within the window before sliding be \( s_{i...i+n} \). If \( M \) is a prime polynomial, the hash is:
+
+\[
+\text{hash}(s_{i...i+n}) = \left(s_i \cdot a^n + s_{i+1} \cdot a^{n-1} + \dots + s_{i+n-1} \cdot a + s_{i+n}\right) \bmod M
+\]
+
+After sliding, the hash for \( s_{i+1...i+n+1} \) is:
+
+\[
+\text{hash}(s_{i+1...i+n+1}) = \left(a \cdot \text{hash}(s_{i...i+n}) - s_i \cdot a^n + s_{i+n+1}\right) \bmod M
+\]
+
+Thus, the recurrence relation is:
+
+\[
+\text{hash}(s_{i+1...i+n+1}) = \left(a \cdot \text{hash}(s_{i...i+n}) - s_i \cdot a^n + s_{i+n+1}\right) \bmod M
+\]
+
+This allows \( O(1) \) computation for the next hash value, resulting in \( O(m + n) \) complexity for \( m \) substrings.
+
+### Rabin-Karp Algorithm
+
+The Rabin-Karp algorithm implements string matching using rolling hash. It relies on an efficient hash function, specifically the **Rabin Fingerprint**.
+
+#### Rabin Fingerprint
+
+The Rabin fingerprint is a polynomial hash on a finite field \( GF(2) \), e.g., \( f(x) = x^3 + x^2 + 1 \), represented as \( 1101 \) in binary.
+
+Addition and subtraction are XOR operations, simplifying computation by avoiding carry-over concerns. However, multiplication and division require \( O(k) \) complexity (where \( k \) is the polynomial's degree).
+
+---
+
+### Implementation Example
+
+For a polynomial \( M(x) \) of degree \( 64 \):
+
+```cpp
+uint64_t poly = 0xbfe6b8a5bf378d83LL;
+```
+
+The recurrence relation is:
+
+\[
+H = \left(a(x) \cdot H_\text{old} - s_i \cdot a^n(x) + s_{i+n}\right) \bmod M(x)
+\]
+
+Key optimizations include:
+
+1. **Multiplication \( a(x) \cdot H_\text{old} \)**: Precompute terms involving \( a(x) \).
+2. **Efficient Modulo Operations**: Precompute values for modular reduction using \( g(x) \), simplifying modulo operations.
+
+---
+
+### Code for Finding Polynomial Degree
+
+Below is the C++ implementation of finding the highest degree of a polynomial, equivalent to finding the most significant bit of a binary number:
+
+```cpp
+uint32_t RabinChecksum::find_last_set(uint32_t value) {
+    if (value & 0xffff0000) {
+        if (value & 0xff000000)
+            return 24 + byteMSB[value >> 24];
+        else
+            return 16 + byteMSB[value >> 16];
+    } else {
+        if (value & 0x0000ff00)
+            return 8 + byteMSB[value >> 8];
+        else
+            return byteMSB[value];
+    }
+}
+
+uint32_t RabinChecksum::find_last_set(uint64_t v) {
+    uint32_t h = v >> 32;
+    if (h)
+        return 32 + find_last_set(h);
+    else
+        return find_last_set((uint32_t)v);
+}
+```
+
+Using this, polynomial multiplication can be implemented efficiently.
+
+[Original Article (Chinese)](https://blog.csdn.net/cyk0620/article/details/120813255)
diff --git a/content/blog/zh/CDC基于内容的可变长度分块.md b/content/blog/zh/CDC基于内容的可变长度分块.md
@@ -0,0 +1,130 @@
+---
+title: "CDC基于内容的可变长度分块"
+description: "基于内容的可变长度分块（CDC）相关原理解析" 
+date: "2023-05-24" # The date of the post fist published.
+category: "拿来主义" # [Optional] The category of the post.
+tags: # [Optional] The tags of the post.
+  - "Storage"
+  - "CDC"
+---
+
+> 本文由AI辅助创作
+
+# 为什么要分块？
+通过分块来隔离变更，对于文件变更仅需更新被修改的块
+如何分块？
+定长分块：现在对文件进行定长分块，假设文件中的内容为`abcdefg`，每四个字节分为一块，则分块后为`abcd|efg`。假如在头部加入了一个字符，内容变更为`0abcdefg`，则分块后为`0abc|defg`，发现两个块和之前完全不一样。这意味着如果要向网络文件系统同步此次修改，则需重新上传两个块。
+基于内容可变长度的分块：假如我们基于内容进行分块，以`d`作为断点，在`d`后产生断点，那么此时分块就变成了`0abcd|efg`。发现这样分块与之前的分块仅一个块不一致，也就是说只需重新上传这个不一致的块，相比定长分块效率大大提高。
+问题：有极低的概率出现多个短块。如`dddd`以`d`断开，则会得到`d|d|d|d`。此情况会导致块数过多，因而变得难以维护。显然不能总是以相同的内容作为断点，例如若文件内容为`dd...d`，则分块后为`d|d|...|d|`。每个块仅包含一个字符，非常浪费空间，更不方便管理，违背了分块的初衷。
+为了解决这个问题，我们需要想个办法随机选择一个断点，使得各个块有一个平均大小，同时又保证断点的某种属性相同。
+# 哈希
+任意长度输入，固定长度输出
+- 正向快速
+- 逆向困难
+- 输入敏感
+- 冲突避免
+- 滚动哈希
+目标：优化使用哈希的字符串匹配问题
+利用字符串哈希值进行匹配。已知模式串长度为n nn，那么我们可以依次截取匹配串中长为n nn的子串计算哈希，然后与模式串的哈希进行比对，若相等则得到一次匹配。在极大概率下，哈希碰撞到的子串与模式串相同。但如果采用暴力方法，也就是简单地计算所有可能的组合，那其实跟暴力匹配没区别，所以需要优化
+通过滚动哈希，使用长为n的滑动窗口来解决这个问题。每次扫过一个字符，从原来的哈希中删除旧字符的影响，加上新字符的影响，得到新的哈希，将计算复杂度减少了很多。
+滚动哈希的实现需要一种算法来完成这个新旧字符影响的考虑
+为了构造这样一个哈希函数，使用一个[[素域]]上的多项式进行映射。[[拉宾指纹]]的原理来看，其本质是将需要进行哈希的输入编进一个多项式做为系数，再构造一个新的多项式，并将两个这样的多项式求余（ $$mod$$ )运算。
+我们刚刚讨论滚动哈希，就是在窗口内，窗口滑动前，把窗口内的东西编入一个多项式，与一个预定义的多项式 $$M$$ 进行mod运算，然后窗口滑动一格，再将窗口内的内容编入一个多项式，mod同一个多项式。设滑动前窗口内的串为 $$s_{i...i+n}$$ , 其中n为窗口长度，M为素域上的一个多项式，则滑动前的哈希为
+$$hash(s_{i...i+n}) = (s_i a^{n}+s_{i+1} a^{n-1}+...+s_{i+n-1}a^{1} + s_{i+n}) \pmod M$$ 同理。滑动后窗口内的串哈希为：
+$$hash(s_{i+1...i+n+1}) = (s_{i+1} a^{n}+s_{i+2} a^{n-1}+...+s_{i+n}a^{1} + s_{i+n+1}) \pmod M$$ 由此可得递推关系：
+$$hash(s_{i+1\dots i+n+1})=(a·hash(s_{i\dots i+n} ) - s_i a^n + s_{i+n+1})\pmod{M}$$ 这样每次只要算一次，可以直接以O(1)的复杂度获得下一窗口的哈希值，也就是先以O(n)的复杂度计算第一个哈希值，再以O(1)的复杂度滚动计算每个窗口的哈希值，总共需要计算m次，所以复杂度是O（m+n)
+[[拉宾-卡普算法]]是一种滚动哈希实现字符串匹配的实现算法，它需要一种可靠、高效的散列函数，即为[[拉宾指纹]]
+拉宾指纹也是一个多项式哈希映射，但它并非在素域 $\small M$ 上，且映射结果不是一个值。拉宾指纹使用的是[[有限域]] $\small GF(2)$ 上的多项式，例如： $f(x)=x^3+x^2+1$ 。这个多项式可以使用二进制表示为 $\small 1101$ 。
+之所以使用这样的多项式表示，是因为相比传统的值运算， $\small GF(2)$ 多项式运算更简单：加减都是异或，这样就完全不需要考虑进位的问题。并且多项式的乘除性质与整数相似。不过就算不需要考虑进位，乘法和除法(求余)也只能以 $\small k$ 的复杂度完成（ $\small k$ 为多项式的最高次幂）。
+拉宾指纹的哈希函数如下：（和素域类似，模需要是个不可约多项式）
+$$hash(s_{i\dots i+n})=(s_{i}a(x)^n+s_{i+1}a(x)^{n-1}+\dots +s_{i+n-1}a(x)+s_{i+n}){\pmod{M(x)}}$$ 递推式如下：
+$$hash(s_{i+1\dots i+n})=((a(x)·hash(s_{i\dots i+n-1}))_{\pmod{M(x)}}-s_ia^{n}(x)+s_{i+n}){\pmod{M(x)}}$$ ## 拉宾指纹实现
+选取一个 $$k=64$$ 的多项式 $$M(x)$$ ，也就是一个64位的二进制数
+```c++
+uint64_t poly = 0xbfe6b8a5bf378d83LL;
+```
+我们假设窗口滑动前窗口内字符串的哈希已知，根据拉宾指纹的递推式，窗口滑动后的哈希为 $$H=(a(x)\times H_{old} - s_i a^n(x)+s_{i+n}) \pmod{M(x)}$$
+这个式子分成3个部分
+乘法 $$a(x) \times H_{old}$$ 部分，难以预处理，因为旧的哈希不可预知。整体要运算 $$(p(x) \cdot a(x)) \pmod{M(x)}$$
+对乘法部分优化
+乘以 $$a(x)$$ 的幂运算，精心选取一个多项式，每次窗口滑动一个字节，也就是8bit，我们可以让 $$a(x)=x^8$$ ，使得乘以 $$a(x)$$ 变成二进制下左移8位。
+求余mod运算的优化
+令 $$g(x)$$ 为 $$p(x)$$ 的最高次项(也就是 $$g \cdot x^{some}$$ 的形式)，则有
+$$g(x)\le p(x)$$ 原式可以写成
+$$((p-p \pmod{g/a} + p \pmod{g/a}) \cdot a ) + p$$ 变换可得
+$$(p \pmod{g/a} \cdot a) \pmod{g/a}  - ( (p - p \pmod{g/a}) \cdot a ) \pmod {M}$$ 即
+$$( (p-(p - p \pmod{g/a})) \cdot a) \pmod{g/a} + ( ( p - p \pmod{g-a} ) \cdot a) \pmod{M}$$ 由于 $$( (p-(p - p \pmod{g/a})) \cdot a) \pmod{g/a} $$ 一定小于 $$g$$ ，因此一定小于 $$a$$ 。
+令 $$j \cdot (\frac{g}{a}) = p - p \pmod {g/a}$$ , 即有
+$$((p-j\cdot (\frac{g}{a})) \cdot a ) + (g \cdot j) \pmod{M}$$ 其中， $$g/a = x^{shiftx-8}$$ （ $$x^{shiftx}$$ 是多项式p(x)的最高次项，shiftx是最高次项的次数, 因此 $$p-p \pmod{g/a}$$ 就是保留p的高8位。将各个式子改成二进制形式，并使用位运算，就可以写成
+```c++
+(p^j) << 8 + g * j % (xshift - 8)
+```
+继续改写，将p提出来，写成
+```c++
+(p<<8) ^ (g * j % (xshift - 8) | j << (xshift + 8))
+```
+所以预处理`g * j % (xshift - 8) | j << (xshift + 8)`(上式中后半部分），提前算好存进T表，前半部分只有一个位移运算就是O(1), 整个式子就可以用O(1)的复杂度算出来。
+乘法 $$s_i a^n (x)$$ 部分，可以预处理
+先实现 $$GF(2)$$ 下的多项式乘法，即可知道 $$a^n(x)$$ ，然后枚举 $$s_i a^n(x)$$ , 将结果缓存到一个表中
+加法 $$s_{i+n}$$ 部分，这部分求模只需要一步异或运算，常数时间复杂度，可以忽略。
+代码
+首先实现查找一个多项式最高次项次数的函数，本质上是在查找二进制的最高位1在哪
+```c++
+/// <summary>
+/// find last set
+/// 对于uint32，找到最高位1在第几位
+/// </summary>
+/// <param name="value">要查找的数</param>
+/// <returns>最高位1在哪一位上</returns>
+uint32_t RabinChecksum::find_last_set(uint32_t value)
+{
+    // 如果这个32位整数的(右往左)前16位不为0
+    if (value & 0xffff0000)
+    {
+        if (value & 0xff000000)
+            // 这个整数前8位不为0
+            // 将value左移24位只剩下8位, 范围0-256, 
+            // 查看这个前八位组成的数最高位的1在第几位. 这个第几位的结果是0-8, 即0000-1000
+            // 24换成二进制是0001 1000
+            // 这样加上去会把得到位数转换成范围在24-32, 而最高位1也确实在这个范围(因为知道前面不为0)
+            return 24 + byteMSB[value >> 24];
+        else
+            // 这个整数前8位为0, 左移16位剩下8位, 查表看剩下的8位最高位在第几位
+            // 16 二进制为0001 0000, 任何结果将被转换为16-31
+            return 16 + byteMSB[value >> 16];
+    }
+    else
+    {
+        // 这个32位整数的前16位都为0, 还剩后16位
+        if (value & 0x0000ff00)
+            // 这后16位中前8位不为0
+            return 8 + byteMSB[value >> 8];
+        else
+            // 这后16位中前8位都为0, 只剩最后8位, 直接返回
+            return byteMSB[value];
+    }
+
+}
+/// <summary>
+/// 对于uint64, 最高位1在第几位
+/// </summary>
+/// <param name="value">要查找的数</param>
+/// <returns>最高位1在哪一位上<</returns>
+uint32_t RabinChecksum::find_last_set(uint64_t v)
+{
+    uint32_t h = v >> 32; // h为v的前32位
+    if (h)
+    { 
+        // v的前32位不为0, 且值为h
+        // 则最高位在h最高位所在的位数+32
+        return 32 + find_last_set(h);
+    }
+    else
+    {
+        // 前32位为0, 则直接截断, 找后32位最高位1在第几位
+        return find_last_set((uint32_t)v);
+    }
+}
+```
+利用这个实现了查找次数的函数实现多项式乘法。
+[原文](https://blog.csdn.net/cyk0620/article/details/120813255)