Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize word splitter with state machine, replacing regex #58

Open
wants to merge 16 commits into
base: wbrown.fix-streaming-add-md
Choose a base branch
from

Conversation

Rexwang8
Copy link
Collaborator

@Rexwang8 Rexwang8 commented Nov 25, 2024

Summary

Profiling the dataset_tokenizer, we find that it is mainly bottlenecked on the lines -> words portion handled by the golang regex package in the wordSplitter function. In this PR, we propose that this process can be optimized by replacing this regex line with a simple state machine that can handle the splitting of lines into words.

Changes

This state machine replaces the regex line in gpt_bpe.go's makeWordSplitter function. The state machine operates mostly in rune-space instead of string-space as well, which helps with computation times. The state machine is created by decomposing the provided or default regex pattern with the syntax.regex golang package and implements a subset of all regex features which should support all tokenizer wordsplitters.

A modified runetree, dubbed a Contraction Tree was created in runetree.go to represent the tree of choices for contractions, as part of the word splitter change.

Results

(Rough performance)

Before changes, when tested on a linux VDI, the benchmark yielded roughly 1.7m words/second and the dataset test yielded around 4 million tokens/second.

After changes, the benchmark yields roughly 2.25-2.5 million words/second using gpt2 as the baseline, and 8.5-9.5 million tokens/second on the dataset test.

The dataset test was run with 2 reader threads, 16 tokenizer threads, streaming encode, on the gutenberg dataset via building the dataset_tokenizer and running it in the command line.

Wes tested this setup with 64 tokenizer threads on a CPU node and reached as high as 67 million tokens/second, over 3x previous maximums, we speculate that this starts to bump up on OS file operation limit rates.

@Rexwang8 Rexwang8 self-assigned this Nov 25, 2024
@Rexwang8 Rexwang8 requested a review from wbrown November 25, 2024 20:41
@Rexwang8 Rexwang8 marked this pull request as ready for review November 25, 2024 20:41
Copy link
Collaborator

@sangstar sangstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a very ambitious and cool feature! A few requests to make sure this code is clean and idiomatic.

gpt_bpe.go Show resolved Hide resolved
gpt_bpe.go Show resolved Hide resolved
gpt_bpe.go Show resolved Hide resolved
gpt_bpe.go Outdated Show resolved Hide resolved
gpt_bpe.go Outdated Show resolved Hide resolved
gpt_bpe.go Outdated Show resolved Hide resolved
runetree.go Outdated Show resolved Hide resolved
runetree.go Show resolved Hide resolved
runetree.go Outdated Show resolved Hide resolved
runetree.go Outdated Show resolved Hide resolved
@Rexwang8 Rexwang8 requested a review from sangstar December 27, 2024 16:01
Copy link
Collaborator

@sangstar sangstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! Some comments that are mostly nitpicks and good just to document, because it's already doing well when profiled. Once these are resolved I'll approve this.

The only thing I'll mention is that traverseRegexTree is doing an enormous amount of heavy lifting, and would benefit significantly from modularizing in to small functions. I understand if you're concerned about function call overhead being a potential bottleneck in this case (although if they're small enough your compiler should still inline it), so at the very least I would very heavily comment this for each group-able code chunk.

Comment on lines +1295 to +1296
word := wordsBuffer[idx-1]
return &word
Copy link
Collaborator

@sangstar sangstar Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this change?

Comment on lines +1377 to +1380
func isNewLine(r rune) bool {
// While \n is often considered a whitespace, we treat it as a symbol
// to ensure it is always a separate token.
return r == '\n'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\n is technically considered whitespace as you say. It might be a good idea to change the name of isWhitespace to account for the weird implication that isWhitespace('\n') is falsy.

// Process replacements and normalization
for replaced, replacement := range encoder.replacements {
line = strings.ReplaceAll(line, replaced, replacement)
// AppendBatch appends a batch of words to the wordBatch and flushes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// AppendBatch appends a batch of words to the wordBatch and flushes
// appendBatch appends a batch of words to the wordBatch and flushes

line = strings.ReplaceAll(line, replaced, replacement)
// AppendBatch appends a batch of words to the wordBatch and flushes
// the batch if it is full.
// ForceFlush forces the batch to be flushed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// ForceFlush forces the batch to be flushed.
// forceFlush forces the batch to be flushed.

Although technically this should be described for the definition of forceFlush itself rather than here.

if len(v) == 0 {
continue
}
if runes[i] == []rune(k)[0] {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea to only compare strings if the first runes for each match. Do you know if this is faster than some direct string comparison function from the standard library or Go's builtins?

return root
}

func (runeTree *RegexNode) createTree(AST *syntax.Regexp, ASTPath []string) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really do not recommend calling your receiver runeTree. This note only adds confusion with the receiver for RuneNode being also called runeTree, and usually struct receivers are abbreviated anyway, so it would be rt or something. These are different structs and as such really shouldn't be taking the same name for the receiver.

runeArray: sub.Rune,
parent: runeTree,
children: make([]*RegexNode, 0),
terminal: sub.Op == syntax.OpCharClass,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption here is that termination always is from a syntax.OpCharClass op. Why is this?

Comment on lines +374 to +379
func (runeTree *RegexNode) PrintTree() {
// Print the tree
sb := strings.Builder{}
runeTree.string(0, &sb)
fmt.Printf("%s\n", sb.String())
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (runeTree *RegexNode) PrintTree() {
// Print the tree
sb := strings.Builder{}
runeTree.string(0, &sb)
fmt.Printf("%s\n", sb.String())
}
func (runeTree *RegexNode) String() string {
// Print the tree
sb := strings.Builder{}
runeTree.string(0, &sb)
return sb.String()
}

Comment on lines +362 to +366
currentPath = append(currentPath, parentIndex)

// If not already in the map, add the current path
pathCopy := make([]int, len(currentPath))
copy(pathCopy, currentPath)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another slight nitpick here. Take with a grain of salt as profiling looks good.

Instead of copying, you could've also created pathCopy with something like:

		pathCopy := append([]int{}, currentPath...)
		pathCopy = append(pathCopy, parentIndex)

Which might be slightly more idiomatic, but I suspect there would be a negligible performance difference. Just documenting this.

}
level += 1
thisNodeMap := matchVars.pathMap[matchVars.currentNodeIdx]
lastNodeMap := make([]int, 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nitpick again, but wouldn't this technically be a wasted allocation for currentNodeIdx = 0?

@harubaru harubaru self-requested a review January 24, 2025 01:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants