Skip to content

Commit

Permalink
Add articles on Content Defined Chunking in English and Chinese
Browse files Browse the repository at this point in the history
  • Loading branch information
charlieJ107 committed Jan 28, 2025
1 parent 118d902 commit e716b1a
Show file tree
Hide file tree
Showing 2 changed files with 266 additions and 0 deletions.
136 changes: 136 additions & 0 deletions content/blog/en/Content Defined Chunk.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
---
title: "Content Defined Chunk (CDC)"
description: "Under the hood of CDC"
date: "2023-05-24" # The date of the post fist published.
category: "拿来主义" # [Optional] The category of the post.
tags: # [Optional] The tags of the post.
- "Storage"
- "CDC"
---

> This article was created with the assistance of AI.
# Why Use Chunking?

By chunking data, we isolate changes. When a file is modified, only the modified chunks need to be updated.

## How to Chunk?

### Fixed-Length Chunking

Currently, files are chunked into fixed lengths. For example, suppose the file content is `abcdefg`. By dividing it into chunks of four bytes, we get `abcd|efg`. If a character is added at the beginning, making the content `0abcdefg`, the chunks become `0abc|defg`. Both chunks differ completely from the previous ones. This means that, when syncing the modification to a network file system, both chunks must be re-uploaded.

### Content-Defined Variable-Length Chunking

If chunks are defined based on content, using `d` as a breakpoint, chunks are formed as `0abcd|efg`. Here, only one chunk differs from the previous set, which significantly improves efficiency compared to fixed-length chunking.

#### Problem

There is an extremely low probability of creating multiple short chunks, e.g., `dddd` would be divided as `d|d|d|d`. This situation leads to excessive chunks, making it hard to manage. Obviously, we cannot always use the same content as the breakpoint. For instance, if the file content is `dd...d`, it would be chunked into `d|d|...|d|`, with each chunk containing just one character, wasting space and going against the original intent of chunking.

To address this issue, we need a method to randomly select breakpoints such that chunks have an average size while maintaining certain properties for the breakpoints.

---

# Hashing

Hashing maps an input of any length to a fixed-length output with the following properties:

- Fast in the forward direction
- Difficult to reverse
- Sensitive to input changes
- Avoids collisions
- Rolling hash

### Objective: Optimizing String Matching with Hashing

By matching strings based on their hash values, given a pattern string of length \( n \), we can take substrings of the matching string of length \( n \), compute their hashes, and compare with the pattern hash. With high probability, a hash collision implies the substrings match the pattern. However, brute-forcing all combinations is inefficient, so optimization is required.

Using a rolling hash, a sliding window of length \( n \) calculates the hash of each substring by removing the effect of the old character and adding the effect of the new character, significantly reducing computational complexity.

#### Rolling Hash Formula

Let the substring within the window before sliding be \( s_{i...i+n} \). If \( M \) is a prime polynomial, the hash is:

\[
\text{hash}(s_{i...i+n}) = \left(s_i \cdot a^n + s_{i+1} \cdot a^{n-1} + \dots + s_{i+n-1} \cdot a + s_{i+n}\right) \bmod M
\]

After sliding, the hash for \( s_{i+1...i+n+1} \) is:

\[
\text{hash}(s_{i+1...i+n+1}) = \left(a \cdot \text{hash}(s_{i...i+n}) - s_i \cdot a^n + s_{i+n+1}\right) \bmod M
\]

Thus, the recurrence relation is:

\[
\text{hash}(s_{i+1...i+n+1}) = \left(a \cdot \text{hash}(s_{i...i+n}) - s_i \cdot a^n + s_{i+n+1}\right) \bmod M
\]

This allows \( O(1) \) computation for the next hash value, resulting in \( O(m + n) \) complexity for \( m \) substrings.

### Rabin-Karp Algorithm

The Rabin-Karp algorithm implements string matching using rolling hash. It relies on an efficient hash function, specifically the **Rabin Fingerprint**.

#### Rabin Fingerprint

The Rabin fingerprint is a polynomial hash on a finite field \( GF(2) \), e.g., \( f(x) = x^3 + x^2 + 1 \), represented as \( 1101 \) in binary.

Addition and subtraction are XOR operations, simplifying computation by avoiding carry-over concerns. However, multiplication and division require \( O(k) \) complexity (where \( k \) is the polynomial's degree).

---

### Implementation Example

For a polynomial \( M(x) \) of degree \( 64 \):

```cpp
uint64_t poly = 0xbfe6b8a5bf378d83LL;
```

The recurrence relation is:

\[
H = \left(a(x) \cdot H_\text{old} - s_i \cdot a^n(x) + s_{i+n}\right) \bmod M(x)
\]

Key optimizations include:

1. **Multiplication \( a(x) \cdot H_\text{old} \)**: Precompute terms involving \( a(x) \).
2. **Efficient Modulo Operations**: Precompute values for modular reduction using \( g(x) \), simplifying modulo operations.

---

### Code for Finding Polynomial Degree

Below is the C++ implementation of finding the highest degree of a polynomial, equivalent to finding the most significant bit of a binary number:

```cpp
uint32_t RabinChecksum::find_last_set(uint32_t value) {
if (value & 0xffff0000) {
if (value & 0xff000000)
return 24 + byteMSB[value >> 24];
else
return 16 + byteMSB[value >> 16];
} else {
if (value & 0x0000ff00)
return 8 + byteMSB[value >> 8];
else
return byteMSB[value];
}
}

uint32_t RabinChecksum::find_last_set(uint64_t v) {
uint32_t h = v >> 32;
if (h)
return 32 + find_last_set(h);
else
return find_last_set((uint32_t)v);
}
```
Using this, polynomial multiplication can be implemented efficiently.
[Original Article (Chinese)](https://blog.csdn.net/cyk0620/article/details/120813255)
130 changes: 130 additions & 0 deletions content/blog/zh/CDC基于内容的可变长度分块.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
---
title: "CDC基于内容的可变长度分块"
description: "基于内容的可变长度分块(CDC)相关原理解析"
date: "2023-05-24" # The date of the post fist published.
category: "拿来主义" # [Optional] The category of the post.
tags: # [Optional] The tags of the post.
- "Storage"
- "CDC"
---

> 本文由AI辅助创作
# 为什么要分块?
通过分块来隔离变更,对于文件变更仅需更新被修改的块
如何分块?
定长分块:现在对文件进行定长分块,假设文件中的内容为`abcdefg`,每四个字节分为一块,则分块后为`abcd|efg`。假如在头部加入了一个字符,内容变更为`0abcdefg`,则分块后为`0abc|defg`,发现两个块和之前完全不一样。这意味着如果要向网络文件系统同步此次修改,则需重新上传两个块。
基于内容可变长度的分块:假如我们基于内容进行分块,以`d`作为断点,在`d`后产生断点,那么此时分块就变成了`0abcd|efg`。发现这样分块与之前的分块仅一个块不一致,也就是说只需重新上传这个不一致的块,相比定长分块效率大大提高。
问题:有极低的概率出现多个短块。如`dddd``d`断开,则会得到`d|d|d|d`。此情况会导致块数过多,因而变得难以维护。显然不能总是以相同的内容作为断点,例如若文件内容为`dd...d`,则分块后为`d|d|...|d|`。每个块仅包含一个字符,非常浪费空间,更不方便管理,违背了分块的初衷。
为了解决这个问题,我们需要想个办法随机选择一个断点,使得各个块有一个平均大小,同时又保证断点的某种属性相同。
# 哈希
任意长度输入,固定长度输出
- 正向快速
- 逆向困难
- 输入敏感
- 冲突避免
- 滚动哈希
目标:优化使用哈希的字符串匹配问题
利用字符串哈希值进行匹配。已知模式串长度为n nn,那么我们可以依次截取匹配串中长为n nn的子串计算哈希,然后与模式串的哈希进行比对,若相等则得到一次匹配。在极大概率下,哈希碰撞到的子串与模式串相同。但如果采用暴力方法,也就是简单地计算所有可能的组合,那其实跟暴力匹配没区别,所以需要优化
通过滚动哈希,使用长为n的滑动窗口来解决这个问题。每次扫过一个字符,从原来的哈希中删除旧字符的影响,加上新字符的影响,得到新的哈希,将计算复杂度减少了很多。
滚动哈希的实现需要一种算法来完成这个新旧字符影响的考虑
为了构造这样一个哈希函数,使用一个[[素域]]上的多项式进行映射。[[拉宾指纹]]的原理来看,其本质是将需要进行哈希的输入编进一个多项式做为系数,再构造一个新的多项式,并将两个这样的多项式求余( $$mod$$ )运算。
我们刚刚讨论滚动哈希,就是在窗口内,窗口滑动前,把窗口内的东西编入一个多项式,与一个预定义的多项式 $$M$$ 进行mod运算,然后窗口滑动一格,再将窗口内的内容编入一个多项式,mod同一个多项式。设滑动前窗口内的串为 $$s_{i...i+n}$$ , 其中n为窗口长度,M为素域上的一个多项式,则滑动前的哈希为
$$hash(s_{i...i+n}) = (s_i a^{n}+s_{i+1} a^{n-1}+...+s_{i+n-1}a^{1} + s_{i+n}) \pmod M$$ 同理。滑动后窗口内的串哈希为:
$$hash(s_{i+1...i+n+1}) = (s_{i+1} a^{n}+s_{i+2} a^{n-1}+...+s_{i+n}a^{1} + s_{i+n+1}) \pmod M$$ 由此可得递推关系:
$$hash(s_{i+1\dots i+n+1})=(a·hash(s_{i\dots i+n} ) - s_i a^n + s_{i+n+1})\pmod{M}$$ 这样每次只要算一次,可以直接以O(1)的复杂度获得下一窗口的哈希值,也就是先以O(n)的复杂度计算第一个哈希值,再以O(1)的复杂度滚动计算每个窗口的哈希值,总共需要计算m次,所以复杂度是O(m+n)
[[拉宾-卡普算法]]是一种滚动哈希实现字符串匹配的实现算法,它需要一种可靠、高效的散列函数,即为[[拉宾指纹]]
拉宾指纹也是一个多项式哈希映射,但它并非在素域 $\small M$ 上,且映射结果不是一个值。拉宾指纹使用的是[[有限域]] $\small GF(2)$ 上的多项式,例如: $f(x)=x^3+x^2+1$ 。这个多项式可以使用二进制表示为 $\small 1101$ 。
之所以使用这样的多项式表示,是因为相比传统的值运算, $\small GF(2)$ 多项式运算更简单:加减都是异或,这样就完全不需要考虑进位的问题。并且多项式的乘除性质与整数相似。不过就算不需要考虑进位,乘法和除法(求余)也只能以 $\small k$ 的复杂度完成( $\small k$ 为多项式的最高次幂)。
拉宾指纹的哈希函数如下:(和素域类似,模需要是个不可约多项式)
$$hash(s_{i\dots i+n})=(s_{i}a(x)^n+s_{i+1}a(x)^{n-1}+\dots +s_{i+n-1}a(x)+s_{i+n}){\pmod{M(x)}}$$ 递推式如下:
$$hash(s_{i+1\dots i+n})=((a(x)·hash(s_{i\dots i+n-1}))_{\pmod{M(x)}}-s_ia^{n}(x)+s_{i+n}){\pmod{M(x)}}$$ ## 拉宾指纹实现
选取一个 $$k=64$$ 的多项式 $$M(x)$$ ,也就是一个64位的二进制数
```c++
uint64_t poly = 0xbfe6b8a5bf378d83LL;
```
我们假设窗口滑动前窗口内字符串的哈希已知,根据拉宾指纹的递推式,窗口滑动后的哈希为 $$H=(a(x)\times H_{old} - s_i a^n(x)+s_{i+n}) \pmod{M(x)}$$
这个式子分成3个部分
乘法 $$a(x) \times H_{old}$$ 部分,难以预处理,因为旧的哈希不可预知。整体要运算 $$(p(x) \cdot a(x)) \pmod{M(x)}$$
对乘法部分优化
乘以 $$a(x)$$ 的幂运算,精心选取一个多项式,每次窗口滑动一个字节,也就是8bit,我们可以让 $$a(x)=x^8$$ ,使得乘以 $$a(x)$$ 变成二进制下左移8位。
求余mod运算的优化
$$g(x)$$$$p(x)$$ 的最高次项(也就是 $$g \cdot x^{some}$$ 的形式),则有
$$g(x)\le p(x)$$ 原式可以写成
$$((p-p \pmod{g/a} + p \pmod{g/a}) \cdot a ) + p$$ 变换可得
$$(p \pmod{g/a} \cdot a) \pmod{g/a} - ( (p - p \pmod{g/a}) \cdot a ) \pmod {M}$$
$$( (p-(p - p \pmod{g/a})) \cdot a) \pmod{g/a} + ( ( p - p \pmod{g-a} ) \cdot a) \pmod{M}$$ 由于 $$( (p-(p - p \pmod{g/a})) \cdot a) \pmod{g/a} $$ 一定小于 $$g$$ ,因此一定小于 $$a$$
$$j \cdot (\frac{g}{a}) = p - p \pmod {g/a}$$ , 即有
$$((p-j\cdot (\frac{g}{a})) \cdot a ) + (g \cdot j) \pmod{M}$$ 其中, $$g/a = x^{shiftx-8}$$$$x^{shiftx}$$ 是多项式p(x)的最高次项,shiftx是最高次项的次数, 因此 $$p-p \pmod{g/a}$$ 就是保留p的高8位。将各个式子改成二进制形式,并使用位运算,就可以写成
```c++
(p^j) << 8 + g * j % (xshift - 8)
```
继续改写,将p提出来,写成
```c++
(p<<8) ^ (g * j % (xshift - 8) | j << (xshift + 8))
```
所以预处理`g * j % (xshift - 8) | j << (xshift + 8)`(上式中后半部分),提前算好存进T表,前半部分只有一个位移运算就是O(1), 整个式子就可以用O(1)的复杂度算出来。
乘法 $$s_i a^n (x)$$ 部分,可以预处理
先实现 $$GF(2)$$ 下的多项式乘法,即可知道 $$a^n(x)$$ ,然后枚举 $$s_i a^n(x)$$ , 将结果缓存到一个表中
加法 $$s_{i+n}$$ 部分,这部分求模只需要一步异或运算,常数时间复杂度,可以忽略。
代码
首先实现查找一个多项式最高次项次数的函数,本质上是在查找二进制的最高位1在哪
```c++
/// <summary>
/// find last set
/// 对于uint32,找到最高位1在第几位
/// </summary>
/// <param name="value">要查找的数</param>
/// <returns>最高位1在哪一位上</returns>
uint32_t RabinChecksum::find_last_set(uint32_t value)
{
// 如果这个32位整数的(右往左)前16位不为0
if (value & 0xffff0000)
{
if (value & 0xff000000)
// 这个整数前8位不为0
// 将value左移24位只剩下8位, 范围0-256,
// 查看这个前八位组成的数最高位的1在第几位. 这个第几位的结果是0-8, 即0000-1000
// 24换成二进制是0001 1000
// 这样加上去会把得到位数转换成范围在24-32, 而最高位1也确实在这个范围(因为知道前面不为0)
return 24 + byteMSB[value >> 24];
else
// 这个整数前8位为0, 左移16位剩下8位, 查表看剩下的8位最高位在第几位
// 16 二进制为0001 0000, 任何结果将被转换为16-31
return 16 + byteMSB[value >> 16];
}
else
{
// 这个32位整数的前16位都为0, 还剩后16位
if (value & 0x0000ff00)
// 这后16位中前8位不为0
return 8 + byteMSB[value >> 8];
else
// 这后16位中前8位都为0, 只剩最后8位, 直接返回
return byteMSB[value];
}

}
/// <summary>
/// 对于uint64, 最高位1在第几位
/// </summary>
/// <param name="value">要查找的数</param>
/// <returns>最高位1在哪一位上<</returns>
uint32_t RabinChecksum::find_last_set(uint64_t v)
{
uint32_t h = v >> 32; // h为v的前32位
if (h)
{
// v的前32位不为0, 且值为h
// 则最高位在h最高位所在的位数+32
return 32 + find_last_set(h);
}
else
{
// 前32位为0, 则直接截断, 找后32位最高位1在第几位
return find_last_set((uint32_t)v);
}
}
```
利用这个实现了查找次数的函数实现多项式乘法。
[原文](https://blog.csdn.net/cyk0620/article/details/120813255)

0 comments on commit e716b1a

Please sign in to comment.