Optimize word splitter with state machine, replacing regex #58

Rexwang8 · 2024-11-25T17:00:27Z

Summary

Profiling the dataset_tokenizer, we find that it is mainly bottlenecked on the lines -> words portion handled by the golang regex package in the wordSplitter function. In this PR, we propose that this process can be optimized by replacing this regex line with a simple state machine that can handle the splitting of lines into words.

Changes

This state machine replaces the regex line in gpt_bpe.go's makeWordSplitter function. The state machine operates mostly in rune-space instead of string-space as well, which helps with computation times. The state machine is created by decomposing the provided or default regex pattern with the syntax.regex golang package and implements a subset of all regex features which should support all tokenizer wordsplitters.

A modified runetree, dubbed a Contraction Tree was created in runetree.go to represent the tree of choices for contractions, as part of the word splitter change.

Results

(Rough performance)

Before changes, when tested on a linux VDI, the benchmark yielded roughly 1.7m words/second and the dataset test yielded around 4 million tokens/second.

After changes, the benchmark yields roughly 2.25-2.5 million words/second using gpt2 as the baseline, and 8.5-9.5 million tokens/second on the dataset test.

The dataset test was run with 2 reader threads, 16 tokenizer threads, streaming encode, on the gutenberg dataset via building the dataset_tokenizer and running it in the command line.

Wes tested this setup with 64 tokenizer threads on a CPU node and reached as high as 67 million tokens/second, over 3x previous maximums, we speculate that this starts to bump up on OS file operation limit rates.

sangstar

Looks like a very ambitious and cool feature! A few requests to make sure this code is clean and idiomatic.

gpt_bpe.go

runetree.go

tests: Cache load encoders when not benchmarking

sangstar

Great job! Some comments that are mostly nitpicks and good just to document, because it's already doing well when profiled. Once these are resolved I'll approve this.

The only thing I'll mention is that traverseRegexTree is doing an enormous amount of heavy lifting, and would benefit significantly from modularizing in to small functions. I understand if you're concerned about function call overhead being a potential bottleneck in this case (although if they're small enough your compiler should still inline it), so at the very least I would very heavily comment this for each group-able code chunk.

sangstar · 2025-01-08T16:40:45Z

gpt_bpe.go

+		word := wordsBuffer[idx-1]
+		return &word


What's the purpose of this change?

sangstar · 2025-01-08T16:47:56Z

gpt_bpe.go

+func isNewLine(r rune) bool {
+	// While \n is often considered a whitespace, we treat it as a symbol
+	// to ensure it is always a separate token.
+	return r == '\n'


\n is technically considered whitespace as you say. It might be a good idea to change the name of isWhitespace to account for the weird implication that isWhitespace('\n') is falsy.

sangstar · 2025-01-08T16:49:06Z

gpt_bpe.go

-			// Process replacements and normalization
-			for replaced, replacement := range encoder.replacements {
-				line = strings.ReplaceAll(line, replaced, replacement)
+		// AppendBatch appends a batch of words to the wordBatch and flushes


Suggested change

// AppendBatch appends a batch of words to the wordBatch and flushes

// appendBatch appends a batch of words to the wordBatch and flushes

sangstar · 2025-01-08T16:49:19Z

gpt_bpe.go

-				line = strings.ReplaceAll(line, replaced, replacement)
+		// AppendBatch appends a batch of words to the wordBatch and flushes
+		// the batch if it is full.
+		// ForceFlush forces the batch to be flushed.


Suggested change

// ForceFlush forces the batch to be flushed.

// forceFlush forces the batch to be flushed.

Although technically this should be described for the definition of forceFlush itself rather than here.

sangstar · 2025-01-08T16:52:04Z

gpt_bpe.go

+			if len(v) == 0 {
+				continue
+			}
+			if runes[i] == []rune(k)[0] {


Interesting idea to only compare strings if the first runes for each match. Do you know if this is faster than some direct string comparison function from the standard library or Go's builtins?

sangstar · 2025-01-08T17:38:24Z

runetree.go

+	return root
+}
+
+func (runeTree *RegexNode) createTree(AST *syntax.Regexp, ASTPath []string) {


I really do not recommend calling your receiver runeTree. This note only adds confusion with the receiver for RuneNode being also called runeTree, and usually struct receivers are abbreviated anyway, so it would be rt or something. These are different structs and as such really shouldn't be taking the same name for the receiver.

sangstar · 2025-01-08T17:39:56Z

runetree.go

+			runeArray:   sub.Rune,
+			parent:      runeTree,
+			children:    make([]*RegexNode, 0),
+			terminal:    sub.Op == syntax.OpCharClass,


The assumption here is that termination always is from a syntax.OpCharClass op. Why is this?

sangstar · 2025-01-08T17:42:25Z

runetree.go

+func (runeTree *RegexNode) PrintTree() {
+	// Print the tree
+	sb := strings.Builder{}
+	runeTree.string(0, &sb)
+	fmt.Printf("%s\n", sb.String())
+}


Suggested change

func (runeTree *RegexNode) PrintTree() {

// Print the tree

sb := strings.Builder{}

runeTree.string(0, &sb)

fmt.Printf("%s\n", sb.String())

}

func (runeTree *RegexNode) String() string {

// Print the tree

sb := strings.Builder{}

runeTree.string(0, &sb)

return sb.String()

}

sangstar · 2025-01-08T17:46:12Z

runetree.go

+	currentPath = append(currentPath, parentIndex)
+
+	// If not already in the map, add the current path
+	pathCopy := make([]int, len(currentPath))
+	copy(pathCopy, currentPath)


Another slight nitpick here. Take with a grain of salt as profiling looks good.

Instead of copying, you could've also created pathCopy with something like:

pathCopy := append([]int{}, currentPath...) pathCopy = append(pathCopy, parentIndex)

Which might be slightly more idiomatic, but I suspect there would be a negligible performance difference. Just documenting this.

sangstar · 2025-01-08T17:47:26Z

runetree.go

+	}
+	level += 1
+	thisNodeMap := matchVars.pathMap[matchVars.currentNodeIdx]
+	lastNodeMap := make([]int, 0)


Tiny nitpick again, but wouldn't this technically be a wasted allocation for currentNodeIdx = 0?

Rexwang8 added 4 commits November 7, 2024 09:50

feat: replace splitter with stateTree

572321b

feat: mostly finish splitter statemachine

3bcb0e7

lint: remove some dangling code

71319e8

lint: apply wrap lint

1393397

Rexwang8 self-assigned this Nov 25, 2024

Rexwang8 requested a review from wbrown November 25, 2024 20:41

Rexwang8 marked this pull request as ready for review November 25, 2024 20:41

sangstar requested changes Dec 12, 2024

View reviewed changes

Rexwang8 added 6 commits December 13, 2024 10:06

Feat: Add in-progress regex state machine code.

72731d1

feat: Improved regexTree speed to 1.3m gpt2

bc9dab7

feat: improved max speed to around 16m

1ae339c

feat: cleanup for PR

2f57b79

refactor: remove dangling old statemachine code

20f98fe

syntax: minor renaming

8a58826

Rexwang8 requested a review from sangstar December 27, 2024 16:01

Rexwang8 and others added 6 commits December 27, 2024 11:07

lint: minor commenting

1e57ed9

tests: Cache load encoders when not benchmarking

b983080

Merge pull request #59 from wbrown/sangstar/update-tests

1107da6

tests: Cache load encoders when not benchmarking

fix: case insensitive flag

e7e08dd

fix: false positive partial matches

1c3e9b8

fix: partial literal false positives.

46a799f

sangstar reviewed Jan 8, 2025

View reviewed changes

harubaru self-requested a review January 24, 2025 01:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize word splitter with state machine, replacing regex #58

Optimize word splitter with state machine, replacing regex #58

Rexwang8 commented Nov 25, 2024 •

edited

Loading

sangstar left a comment

sangstar left a comment •

edited

Loading

sangstar Jan 8, 2025 •

edited

Loading

sangstar Jan 8, 2025

sangstar Jan 8, 2025

sangstar Jan 8, 2025

sangstar Jan 8, 2025

sangstar Jan 8, 2025

sangstar Jan 8, 2025

sangstar Jan 8, 2025

sangstar Jan 8, 2025

sangstar Jan 8, 2025

	// AppendBatch appends a batch of words to the wordBatch and flushes
	// appendBatch appends a batch of words to the wordBatch and flushes

	// ForceFlush forces the batch to be flushed.
	// forceFlush forces the batch to be flushed.

Optimize word splitter with state machine, replacing regex #58

Are you sure you want to change the base?

Optimize word splitter with state machine, replacing regex #58

Conversation

Rexwang8 commented Nov 25, 2024 • edited Loading

Summary

Changes

Results

sangstar left a comment

Choose a reason for hiding this comment

sangstar left a comment • edited Loading

Choose a reason for hiding this comment

sangstar Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rexwang8 commented Nov 25, 2024 •

edited

Loading

sangstar left a comment •

edited

Loading

sangstar Jan 8, 2025 •

edited

Loading