Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search suffix tree implementation #48652

Merged
merged 101 commits into from
Oct 17, 2024
Merged
Show file tree
Hide file tree
Changes from 61 commits
Commits
Show all changes
101 commits
Select commit Hold shift + click to select a range
5425082
suffix tree impl
hannojg Sep 5, 2024
54a7b60
add some helpful comments
hannojg Sep 5, 2024
8622670
example implementation usage of Suffixtree
hannojg Sep 5, 2024
01162fe
fix: resolved one TODO
kirillzyusko Sep 11, 2024
09e8aa7
fix: reduce code duplication
kirillzyusko Sep 12, 2024
fa81e13
refactor: O(2) -> O(1)
kirillzyusko Sep 12, 2024
e33142f
refactor: minus one TODO
kirillzyusko Sep 12, 2024
30424a9
fix: bring back userToInvite
kirillzyusko Sep 13, 2024
1d11ed2
fix: make Marc discoverable again (when we search in second array, th…
kirillzyusko Sep 13, 2024
b502edc
Merge branch 'main' into perf/search-suffix-ukkonen-tree
kirillzyusko Sep 20, 2024
c0e38b8
add debug timings
hannojg Sep 24, 2024
16e4078
Merge branch 'main' of github.com:Expensify/App into perf/search-suff…
hannojg Sep 24, 2024
955cdcb
adjust values for search
hannojg Sep 24, 2024
d7c8d88
correct display name for personal details search
hannojg Sep 24, 2024
d07940d
deduplicate search results
hannojg Sep 24, 2024
84d3231
unify string cleaning
hannojg Sep 24, 2024
eb6b371
fix search function temporarily
hannojg Sep 24, 2024
6002101
make it work with numbers by base26 everything thats bigger than our …
hannojg Sep 25, 2024
131c303
don't add delimiter char at the end
hannojg Sep 25, 2024
0a89089
support for unicode
hannojg Sep 25, 2024
def772b
extended test case to make sure right item is returned
hannojg Sep 25, 2024
0806fb0
use concat instead of for-loop
hannojg Sep 25, 2024
c660312
Revert "use concat instead of for-loop"
hannojg Sep 25, 2024
24a59b7
add note
hannojg Sep 25, 2024
146d51b
wip: make t dynamic instead of preallocating
hannojg Sep 25, 2024
a2cdaf9
wip: refactor r list
hannojg Sep 25, 2024
f6dbecf
wip: refactor l list
hannojg Sep 25, 2024
bda1e34
wip: refactor p list
hannojg Sep 25, 2024
2508d55
wip: refactor s list
hannojg Sep 25, 2024
c32d57f
removed preallocations!
hannojg Sep 25, 2024
f4812c0
better naming
hannojg Sep 25, 2024
de0b467
renamings
hannojg Sep 25, 2024
c17a95b
add userToToInvite to header count back
hannojg Sep 25, 2024
0f94c05
remove debug logs
hannojg Sep 25, 2024
fc54faa
cleanup
hannojg Sep 25, 2024
96c93e7
add unit test for special char handling
hannojg Sep 25, 2024
81dffdd
wip: document suffix tree functions
hannojg Sep 25, 2024
25909ba
fix recursive search function
hannojg Sep 25, 2024
b85cfd4
small cleanup
hannojg Sep 25, 2024
5de310c
clean search input string
hannojg Sep 25, 2024
d51a4c3
fixed flaky test
hannojg Sep 25, 2024
9e9b574
made test more reliable
hannojg Sep 25, 2024
4fa8390
migrate to useOnyx
hannojg Sep 25, 2024
8e583cb
ignore warning, this algorithm is optimized for speed
hannojg Sep 25, 2024
56a1e8a
add timings to new search
hannojg Sep 25, 2024
309556a
refactor: separate suffix tree from search logic
hannojg Sep 25, 2024
f7528f4
add explanations + fix test
hannojg Sep 25, 2024
8b5b77f
code documentation
hannojg Sep 25, 2024
a6b7939
add test verifying identical words work in suffix tree
hannojg Sep 25, 2024
8b41e88
add failing unit tests for duplicated string issue
hannojg Sep 25, 2024
4e273df
add comment
hannojg Sep 25, 2024
57af9b1
fix search not returning results with same search value
hannojg Sep 26, 2024
8b26a1f
perf: precalculate base26 map, optimize math operations, letter looku…
hannojg Sep 26, 2024
af43c93
experiment: use array buffers
hannojg Sep 26, 2024
615a237
add correctChar optimization back
hannojg Sep 26, 2024
2ededdd
wip: optimizing search string construction
hannojg Sep 26, 2024
02f562d
add unit test
hannojg Sep 26, 2024
e894a7f
Optimize performance by using one ArrayBuffer for search values and s…
hannojg Sep 27, 2024
54dcea8
ignore eslint warning; performance optimization
hannojg Sep 27, 2024
8bb5b1e
rename test file
hannojg Sep 27, 2024
b331193
add note
hannojg Sep 27, 2024
18de46d
Update src/pages/ChatFinderPage/index.tsx
hannojg Sep 27, 2024
a91ffc9
wording
hannojg Sep 27, 2024
8277d37
Update src/libs/SuffixUkkonenTree/utils.ts
hannojg Sep 30, 2024
d1d028a
Update src/libs/FastSearch.ts
hannojg Sep 30, 2024
e829066
Update src/libs/SuffixUkkonenTree/utils.ts
hannojg Sep 30, 2024
3b69cf4
Update src/libs/SuffixUkkonenTree/utils.ts
hannojg Sep 30, 2024
8951708
Update src/libs/SuffixUkkonenTree/index.ts
hannojg Sep 30, 2024
4c4f1dc
Update src/libs/SuffixUkkonenTree/index.ts
hannojg Sep 30, 2024
050a320
use Uint8Array for search numeric representation
hannojg Oct 1, 2024
017e9f6
fix utils
hannojg Oct 1, 2024
4d5ad3d
add comment
hannojg Oct 1, 2024
1dc2a5e
Update src/libs/SuffixUkkonenTree/index.ts
hannojg Oct 1, 2024
fdac05f
implementation using none negative values
hannojg Oct 1, 2024
e8fdd5c
adjust array sizes to be correct
hannojg Oct 1, 2024
418e2cd
fix: can't find by first letter
hannojg Oct 1, 2024
49692fd
remove unnecessary initializations
hannojg Oct 1, 2024
1f90a13
fix: can't find by first letter was only a problem in the unit test
hannojg Oct 1, 2024
ee553da
cleanup
hannojg Oct 1, 2024
29a1934
explain in comments
hannojg Oct 1, 2024
1aec94a
+1 not +2
hannojg Oct 1, 2024
55acdc5
order options
hannojg Oct 1, 2024
a1103a3
skip white spaces
hannojg Oct 1, 2024
41207c6
don't return PD, concat with recentReports
hannojg Oct 1, 2024
736650f
Update src/libs/SuffixUkkonenTree/index.ts
hannojg Oct 3, 2024
4d115ed
Merge branch 'main' of github.com:Expensify/App into perf/search-suff…
hannojg Oct 3, 2024
35b2555
Merge branch 'perf/search-suffix-ukkonen-tree' of github.com:margelo/…
hannojg Oct 3, 2024
6f5be58
disable .at() rule
hannojg Oct 3, 2024
34d2dbd
rename parameter
hannojg Oct 3, 2024
6160173
step A: bring changes over to SearchRouter
hannojg Oct 3, 2024
fabe796
reorder
hannojg Oct 3, 2024
3539a57
fix dot use case
hannojg Oct 3, 2024
fdf63af
Merge branch 'main' of github.com:Expensify/App into perf/search-suff…
hannojg Oct 4, 2024
fa68c64
remove eslint disabling
hannojg Oct 4, 2024
162ad83
Merge branch 'main' of github.com:margelo/expensify-app-fork into per…
hannojg Oct 10, 2024
fd76f43
Update src/libs/FastSearch.ts
hannojg Oct 10, 2024
7c44fa7
improve error messages
hannojg Oct 10, 2024
29e5288
Update src/libs/SuffixUkkonenTree/utils.ts
hannojg Oct 10, 2024
de7f3b5
fix variable name usage
hannojg Oct 10, 2024
f757fd2
Merge branch 'main' of github.com:Expensify/App into perf/search-suff…
hannojg Oct 17, 2024
59d2562
fix duplicate imports after merge
hannojg Oct 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions src/CONST.ts
Original file line number Diff line number Diff line change
Expand Up @@ -1096,6 +1096,9 @@ const CONST = {
SEARCH_OPTION_LIST_DEBOUNCE_TIME: 300,
RESIZE_DEBOUNCE_TIME: 100,
UNREAD_UPDATE_DEBOUNCE_TIME: 300,
SEARCH_CONVERT_SEARCH_VALUES: 'search_convert_search_values',
SEARCH_MAKE_TREE: 'search_make_tree',
roryabraham marked this conversation as resolved.
Show resolved Hide resolved
SEARCH_BUILD_TREE: 'search_build_tree',
SEARCH_FILTER_OPTIONS: 'search_filter_options',
USE_DEBOUNCED_STATE_DELAY: 300,
},
Expand Down
142 changes: 142 additions & 0 deletions src/libs/FastSearch.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
import CONST from '@src/CONST';
import Timing from './actions/Timing';
import SuffixUkkonenTree from './SuffixUkkonenTree';

type SearchableData<T> = {
/**
* The data that should be searchable
*/
data: T[];
/**
* A function that generates a string from a data entry. The string's value is used for searching.
* If you have multiple fields that should be searchable, simply concat them to the string and return it.
*/
toSearchableString: (data: T) => string;
};

// There are certain characters appear very often in our search data (email addresses), which we don't need to search for.
const charSetToSkip = new Set(['@', '#', '$', '%', '&', '*', '+', '-', '/', ':', ';', '<', '=', '>', '?', '_', '~', '!']);

/**
* Creates a new "FastSearch" instance. "FastSearch" uses a suffix tree to search for (sub-)strings in a list of strings.
hannojg marked this conversation as resolved.
Show resolved Hide resolved
* You can provide multiple datasets. The search results will be returned for each dataset.
*
* Note: Creating a FastSearch instance with a lot of data is computationally expensive. You should create an instance once and reuse it.
* Searches will be very fast though, even with a lot of data.
*/
function createFastSearch<T>(dataSets: Array<SearchableData<T>>) {
// Create a numeric list for the suffix tree, and a look up indexes array
Timing.start(CONST.TIMING.SEARCH_CONVERT_SEARCH_VALUES);
const maxNumericListSize = 400_000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we exceed this? Or is that pretty unlikely? What happens if we do?

Copy link
Contributor Author

@hannojg hannojg Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I made this number to be based on the most extreme case i have seen so far (the search string generated for ten thousands of reports and personal details was below 400.000, don't remember the exact number anymore).
In case the search string is longer, we'd loose search results. I think no error would be thrown, but if the user you are searching for is on position 400.010 you wouldn't be able to find that person.

Let me brainstorm quickly with Szymon if we can do better!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We think it would be best to implement a resize mechanism. If we hit the limit, we'd resize the arrays to be bigger. However, that comes with a performance overhead. So we should pick a default size that makes sense for most users (400_00 might actually be a bit too much for the average case, but for some extreme cases it might be too little).
However, as this isn't directly breaking anything, would you be fine with us tweaking this in a follow up PR? I think the Pr is already quite huge and hard to understand

Copy link
Contributor Author

@hannojg hannojg Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Another solution for performance improvements on this could also be to store the array's arraybuffer in mmkv and rehydrate from there [mmkv has first class support for ArrayBuffers], but
again, it would be better to experiment with that in a follow up PR - i can open a follow up issue ticket, i think that would be best?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to keep it as 400K for now, but have a follow-up issue to make it resizable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks! I think this is something we can wait and see about - my initial thinking is that we still don't have any pagination for either personal details or reports. And we have also spent a lot of time tweaking things for some very high capacity usage.

Though, it was hard for me to tell how many personal details or reports we are talking about here when we'd be hitting that limit. If it's over 10k reports then nobody actually has that AFAIK.

// The user might provide multiple data sets, but internally, the search values will be stored in this one list:
let concatenatedNumericList = new Int8Array(maxNumericListSize);
// Here we store the index of the data item in the original data list, so we can map the found occurrences back to the original data:
const occurrenceToIndex = new Int32Array(maxNumericListSize * 4);
// As we are working with ArrayBuffers, we need to keep track of the current offset:
const offset = {value: 0};
// We store the last offset for a dataSet, so we can map the found occurrences to the correct dataSet:
const listOffsets: number[] = [];

for (const {data, toSearchableString} of dataSets) {
// Performance critical: the array parameters are out parameters, so we don't want to create new arrays every time:
dataToNumericRepresentation(concatenatedNumericList, occurrenceToIndex, offset, {data, toSearchableString});
listOffsets.push(offset.value);
}
concatenatedNumericList[offset.value++] = SuffixUkkonenTree.END_CHAR_CODE;
listOffsets[listOffsets.length - 1] = offset.value;
Timing.end(CONST.TIMING.SEARCH_CONVERT_SEARCH_VALUES);

// The list might be larger than necessary, so we clamp it to the actual size:
concatenatedNumericList = concatenatedNumericList.slice(0, offset.value);

// Create & build the suffix tree:
Timing.start(CONST.TIMING.SEARCH_MAKE_TREE);
const tree = SuffixUkkonenTree.makeTree(concatenatedNumericList);
Timing.end(CONST.TIMING.SEARCH_MAKE_TREE);

Timing.start(CONST.TIMING.SEARCH_BUILD_TREE);
tree.build();
Timing.end(CONST.TIMING.SEARCH_BUILD_TREE);

/**
* Searches for the given input and returns results for each dataset.
*/
function search(searchInput: string): T[][] {
const cleanedSearchString = cleanString(searchInput);
const {numeric} = SuffixUkkonenTree.stringToNumeric(cleanedSearchString, {
charSetToSkip,
// stringToNumeric might return a list that is larger than necessary, so we clamp it to the actual size
// (otherwise the search could fail as we include in our search empty array values):
clamp: true,
});
const result = tree.findSubstring(Array.from(numeric));

const resultsByDataSet = Array.from({length: dataSets.length}, () => new Set<T>());
// eslint-disable-next-line @typescript-eslint/prefer-for-of
for (let i = 0; i < result.length; i++) {
const occurrenceIndex = result[i];
const itemIndexInDataSet = occurrenceToIndex[occurrenceIndex];
const dataSetIndex = listOffsets.findIndex((listOffset) => occurrenceIndex < listOffset);

if (dataSetIndex === -1) {
throw new Error('Programmatic error, this should never ever happen');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it should never ever happen then why is the code necessary ? :trollface:

Maybe we say something less generic? Like, Oh no. Ask Hanno how this code works.

Just kidding, but seriously maybe "Search failed due to range error" or something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with more specific errors. Considering right now we have 2 with the same text, it would be hard to get to know where it failed.

Copy link
Contributor Author

@hannojg hannojg Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let me update this. Actually this might be removable completely. Its a remnant of when the dataSetIndex could be undefined, which we knew could never happen and we only added that statement to satisfy typescript.

However, i might keep the checks in place because .... you never know, right

}
const item = dataSets[dataSetIndex].data[itemIndexInDataSet];
if (!item) {
throw new Error('Programmatic error, this should never ever happen');
}
resultsByDataSet[dataSetIndex].add(item);
}

return resultsByDataSet.map((set) => Array.from(set));
}

return {
search,
};
}

/**
* The suffix tree can only store string like values, and internally stores those as numbers.
* This function converts the user data (which are most likely objects) to a numeric representation.
* Additionally a list of the original data and their index position in the numeric list is created, which is used to map the found occurrences back to the original data.
*/
function dataToNumericRepresentation<T>(concatenatedNumericList: Int8Array, occurrenceToIndex: Int32Array, offset: {value: number}, {data, toSearchableString}: SearchableData<T>): void {
// const searchIndexList: Array<T | undefined> = [];

data.forEach((option, index) => {
const searchStringForTree = toSearchableString(option);
const cleanedSearchStringForTree = cleanString(searchStringForTree);

if (cleanedSearchStringForTree.length === 0) {
return;
}

SuffixUkkonenTree.stringToNumeric(cleanedSearchStringForTree, {
charSetToSkip,
out: {
outArray: concatenatedNumericList,
offset,
outOccurrenceToIndex: occurrenceToIndex,
index,
},
});
// eslint-disable-next-line no-param-reassign
occurrenceToIndex[offset.value] = index;
// eslint-disable-next-line no-param-reassign
concatenatedNumericList[offset.value++] = SuffixUkkonenTree.DELIMITER_CHAR_CODE;
});
}

/**
* Everything in the tree is treated as lowercase.
*/
function cleanString(input: string) {
return input.toLowerCase();
}

const FastSearch = {
createFastSearch,
};

export default FastSearch;
37 changes: 27 additions & 10 deletions src/libs/OptionsListUtils.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2380,6 +2380,31 @@ function getPersonalDetailSearchTerms(item: Partial<ReportUtils.OptionData>) {
function getCurrentUserSearchTerms(item: ReportUtils.OptionData) {
return [item.text ?? '', item.login ?? '', item.login?.replace(CONST.EMAIL_SEARCH_REGEX, '') ?? ''];
}

type PickUserToInviteParams = {
canInviteUser: boolean;
recentReports: ReportUtils.OptionData[];
personalDetails: ReportUtils.OptionData[];
searchValue: string;
config?: FilterOptionsConfig;
optionsToExclude: Option[];
};

const pickUserToInvite = ({canInviteUser, recentReports, personalDetails, searchValue, config, optionsToExclude}: PickUserToInviteParams) => {
let userToInvite = null;
if (canInviteUser) {
if (recentReports.length === 0 && personalDetails.length === 0) {
userToInvite = getUserToInviteOption({
searchValue,
selectedOptions: config?.selectedOptions,
optionsToExclude,
});
}
}

return userToInvite;
};

/**
* Filters options based on the search input value
*/
Expand Down Expand Up @@ -2455,16 +2480,7 @@ function filterOptions(options: Options, searchInputValue: string, config?: Filt
recentReports = orderOptions(recentReports, searchValue);
}

let userToInvite = null;
if (canInviteUser) {
if (recentReports.length === 0 && personalDetails.length === 0) {
userToInvite = getUserToInviteOption({
searchValue,
selectedOptions: config?.selectedOptions,
optionsToExclude,
});
}
}
const userToInvite = pickUserToInvite({canInviteUser, recentReports, personalDetails, searchValue, config, optionsToExclude});

if (maxRecentReportsToShow > 0 && recentReports.length > maxRecentReportsToShow) {
recentReports.splice(maxRecentReportsToShow);
Expand Down Expand Up @@ -2547,6 +2563,7 @@ export {
getEmptyOptions,
shouldUseBoldText,
getAlternateText,
pickUserToInvite,
};

export type {MemberForList, CategorySection, CategoryTreeSection, Options, OptionList, SearchOption, PayeePersonalDetails, Category, Tax, TaxRatesOption, Option, OptionTree};
183 changes: 183 additions & 0 deletions src/libs/SuffixUkkonenTree/index.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
/* eslint-disable no-continue */
import {ALPHABET_SIZE, DELIMITER_CHAR_CODE, END_CHAR_CODE, SPECIAL_CHAR_CODE, stringToNumeric} from './utils';

/**
* This implements a suffix tree using Ukkonen's algorithm.
* A good visualization to learn about the algorithm can be found here: https://brenden.github.io/ukkonen-animation/
hannojg marked this conversation as resolved.
Show resolved Hide resolved
* Note: This implementation is optimized for performance, not necessarily for readability.
*
* You probably don't want to use this directly, but rather use @libs/FastSearch.ts as a easy to use wrapper around this.
*/

/**
* Creates a new tree instance that can be used to build a suffix tree and search in it.
* The input is a numeric representation of the search string, which can be create using {@link stringToNumeric}.
hannojg marked this conversation as resolved.
Show resolved Hide resolved
* Separate search values must be separated by the {@link DELIMITER_CHAR_CODE}. The search string must end with the {@link END_CHAR_CODE}.
*
* The tree will be built using the Ukkonen's algorithm: https://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf
*/
function makeTree(numericSearchValues: Int8Array) {
const maxNodes = 2 * numericSearchValues.length;
hannojg marked this conversation as resolved.
Show resolved Hide resolved
// Allocate an ArrayBuffer to store all transitions (flat buffer), 4 bytes per transition (Uint32)
hannojg marked this conversation as resolved.
Show resolved Hide resolved
const transitionNodes = new Int32Array(maxNodes * ALPHABET_SIZE * 4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to call this variable internalNodes? Is this variable representing all non-leaf nodes in the tree?

transitionNodes.fill(-1); // Initialize all transitions to -1 (no transition)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we need to use -1 as the default for "no transition" instead of 0?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use 0 but then we would need to move all the nodes as root is marked as 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't really get what you're saying. I'm just assuming that since this is the largest array of nodes, avoiding having to fill them all with -1 would be better. Also it would mean that this can be UInt32Array instead of Int32Array (though I don't know if that matters at all for performance)


const leftEdges = new Int32Array(maxNodes * 4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why maxNodes * 4?

const rightEdges = new Int32Array(maxNodes * 4);
const defaultREdgeValue = numericSearchValues.length - 1;

const parent = new Int32Array(maxNodes * 4);
const suffixLink = new Int32Array(maxNodes * 4);

let currentNode = 0;
let currentPosition = 0;
let nodeCounter = 2;
let currentIndex = 0;

function initializeTree() {
rightEdges.fill(numericSearchValues.length - 1);
suffixLink[0] = 1;
leftEdges[0] = -1;
rightEdges[0] = -1;
leftEdges[1] = -1;
rightEdges[1] = -1;
for (let i = 0; i < ALPHABET_SIZE; ++i) {
transitionNodes[ALPHABET_SIZE + i] = 0;
}
}

function processCharacter(char: number) {
// eslint-disable-next-line no-constant-condition
while (true) {
if (rightEdges[currentNode] < currentPosition) {
if (transitionNodes[currentNode * ALPHABET_SIZE + char] === -1) {
createNewLeaf(char);
continue;
}
currentNode = transitionNodes[currentNode * ALPHABET_SIZE + char];
currentPosition = leftEdges[currentNode];
}
if (currentPosition === -1 || char === numericSearchValues[currentPosition]) {
currentPosition++;
} else {
splitEdge(char);
continue;
}
break;
}
}

function createNewLeaf(c: number) {
transitionNodes[currentNode * ALPHABET_SIZE + c] = nodeCounter;
leftEdges[nodeCounter] = currentIndex;
parent[nodeCounter++] = currentNode;
currentNode = suffixLink[currentNode];

currentPosition = rightEdges[currentNode] + 1;
}

function splitEdge(c: number) {
leftEdges[nodeCounter] = leftEdges[currentNode];
rightEdges[nodeCounter] = currentPosition - 1;
parent[nodeCounter] = parent[currentNode];

transitionNodes[nodeCounter * ALPHABET_SIZE + numericSearchValues[currentPosition]] = currentNode;
transitionNodes[nodeCounter * ALPHABET_SIZE + c] = nodeCounter + 1;
leftEdges[nodeCounter + 1] = currentIndex;
parent[nodeCounter + 1] = nodeCounter;
leftEdges[currentNode] = currentPosition;
parent[currentNode] = nodeCounter;

transitionNodes[parent[nodeCounter] * ALPHABET_SIZE + numericSearchValues[leftEdges[nodeCounter]]] = nodeCounter;
nodeCounter += 2;
handleDescent(nodeCounter);
}

function handleDescent(ts: number) {
hannojg marked this conversation as resolved.
Show resolved Hide resolved
currentNode = suffixLink[parent[ts - 2]];
currentPosition = leftEdges[ts - 2];
while (currentPosition <= (rightEdges[ts - 2] ?? defaultREdgeValue)) {
currentNode = transitionNodes[currentNode * ALPHABET_SIZE + numericSearchValues[currentPosition]];
currentPosition += rightEdges[currentNode] - leftEdges[currentNode] + 1;
}
if (currentPosition === (rightEdges[ts - 2] ?? defaultREdgeValue) + 1) {
suffixLink[ts - 2] = currentNode;
} else {
suffixLink[ts - 2] = ts;
}
currentPosition = rightEdges[currentNode] - (currentPosition - (rightEdges[ts - 2] ?? defaultREdgeValue)) + 2;
}

function build() {
initializeTree();
for (currentIndex = 0; currentIndex < numericSearchValues.length; ++currentIndex) {
const c = numericSearchValues[currentIndex];
processCharacter(c);
}
}

/**
* Returns all occurrences of the given (sub)string in the input string.
*
* You can think of the tree that we create as a big string that looks like this:
*
* "banana$pancake$apple|"
* The example delimiter character '$' is used to separate the different strings.
* The end character '|' is used to indicate the end of our search string.
*
* This function will return the index(es) of found occurrences within this big string.
* So, when searching for "an", it would return [1, 3, 8].
*/
function findSubstring(searchValue: number[]) {
const occurrences: number[] = [];

function dfs(node: number, depth: number) {
const leftRange = leftEdges[node];
const rightRange = rightEdges[node] ?? defaultREdgeValue;
const rangeLen = node === 0 ? 0 : rightRange - leftRange + 1;

for (let i = 0; i < rangeLen && depth + i < searchValue.length && leftRange + i < numericSearchValues.length; i++) {
if (searchValue[depth + i] !== numericSearchValues[leftRange + i]) {
return;
}
}

let isLeaf = true;
for (let i = 0; i < ALPHABET_SIZE; ++i) {
const tNode = transitionNodes[node * ALPHABET_SIZE + i];

// Search speed optimization: don't go through the edge if it's different than the next char:
const correctChar = depth + rangeLen >= searchValue.length || i === searchValue[depth + rangeLen];

if (tNode && tNode !== -1 && correctChar) {
isLeaf = false;
dfs(tNode, depth + rangeLen);
}
}

if (isLeaf && depth + rangeLen >= searchValue.length) {
occurrences.push(numericSearchValues.length - (depth + rangeLen));
}
}

dfs(0, 0);
hannojg marked this conversation as resolved.
Show resolved Hide resolved
return occurrences;
}

return {
build,
findSubstring,
};
}

const SuffixUkkonenTree = {
makeTree,

// Re-exported from utils:
DELIMITER_CHAR_CODE,
SPECIAL_CHAR_CODE,
END_CHAR_CODE,
stringToNumeric,
};

export default SuffixUkkonenTree;
Loading
Loading