Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search suffix tree implementation #48652

Merged
merged 101 commits into from
Oct 17, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
101 commits
Select commit Hold shift + click to select a range
5425082
suffix tree impl
hannojg Sep 5, 2024
54a7b60
add some helpful comments
hannojg Sep 5, 2024
8622670
example implementation usage of Suffixtree
hannojg Sep 5, 2024
01162fe
fix: resolved one TODO
kirillzyusko Sep 11, 2024
09e8aa7
fix: reduce code duplication
kirillzyusko Sep 12, 2024
fa81e13
refactor: O(2) -> O(1)
kirillzyusko Sep 12, 2024
e33142f
refactor: minus one TODO
kirillzyusko Sep 12, 2024
30424a9
fix: bring back userToInvite
kirillzyusko Sep 13, 2024
1d11ed2
fix: make Marc discoverable again (when we search in second array, th…
kirillzyusko Sep 13, 2024
b502edc
Merge branch 'main' into perf/search-suffix-ukkonen-tree
kirillzyusko Sep 20, 2024
c0e38b8
add debug timings
hannojg Sep 24, 2024
16e4078
Merge branch 'main' of github.com:Expensify/App into perf/search-suff…
hannojg Sep 24, 2024
955cdcb
adjust values for search
hannojg Sep 24, 2024
d7c8d88
correct display name for personal details search
hannojg Sep 24, 2024
d07940d
deduplicate search results
hannojg Sep 24, 2024
84d3231
unify string cleaning
hannojg Sep 24, 2024
eb6b371
fix search function temporarily
hannojg Sep 24, 2024
6002101
make it work with numbers by base26 everything thats bigger than our …
hannojg Sep 25, 2024
131c303
don't add delimiter char at the end
hannojg Sep 25, 2024
0a89089
support for unicode
hannojg Sep 25, 2024
def772b
extended test case to make sure right item is returned
hannojg Sep 25, 2024
0806fb0
use concat instead of for-loop
hannojg Sep 25, 2024
c660312
Revert "use concat instead of for-loop"
hannojg Sep 25, 2024
24a59b7
add note
hannojg Sep 25, 2024
146d51b
wip: make t dynamic instead of preallocating
hannojg Sep 25, 2024
a2cdaf9
wip: refactor r list
hannojg Sep 25, 2024
f6dbecf
wip: refactor l list
hannojg Sep 25, 2024
bda1e34
wip: refactor p list
hannojg Sep 25, 2024
2508d55
wip: refactor s list
hannojg Sep 25, 2024
c32d57f
removed preallocations!
hannojg Sep 25, 2024
f4812c0
better naming
hannojg Sep 25, 2024
de0b467
renamings
hannojg Sep 25, 2024
c17a95b
add userToToInvite to header count back
hannojg Sep 25, 2024
0f94c05
remove debug logs
hannojg Sep 25, 2024
fc54faa
cleanup
hannojg Sep 25, 2024
96c93e7
add unit test for special char handling
hannojg Sep 25, 2024
81dffdd
wip: document suffix tree functions
hannojg Sep 25, 2024
25909ba
fix recursive search function
hannojg Sep 25, 2024
b85cfd4
small cleanup
hannojg Sep 25, 2024
5de310c
clean search input string
hannojg Sep 25, 2024
d51a4c3
fixed flaky test
hannojg Sep 25, 2024
9e9b574
made test more reliable
hannojg Sep 25, 2024
4fa8390
migrate to useOnyx
hannojg Sep 25, 2024
8e583cb
ignore warning, this algorithm is optimized for speed
hannojg Sep 25, 2024
56a1e8a
add timings to new search
hannojg Sep 25, 2024
309556a
refactor: separate suffix tree from search logic
hannojg Sep 25, 2024
f7528f4
add explanations + fix test
hannojg Sep 25, 2024
8b5b77f
code documentation
hannojg Sep 25, 2024
a6b7939
add test verifying identical words work in suffix tree
hannojg Sep 25, 2024
8b41e88
add failing unit tests for duplicated string issue
hannojg Sep 25, 2024
4e273df
add comment
hannojg Sep 25, 2024
57af9b1
fix search not returning results with same search value
hannojg Sep 26, 2024
8b26a1f
perf: precalculate base26 map, optimize math operations, letter looku…
hannojg Sep 26, 2024
af43c93
experiment: use array buffers
hannojg Sep 26, 2024
615a237
add correctChar optimization back
hannojg Sep 26, 2024
2ededdd
wip: optimizing search string construction
hannojg Sep 26, 2024
02f562d
add unit test
hannojg Sep 26, 2024
e894a7f
Optimize performance by using one ArrayBuffer for search values and s…
hannojg Sep 27, 2024
54dcea8
ignore eslint warning; performance optimization
hannojg Sep 27, 2024
8bb5b1e
rename test file
hannojg Sep 27, 2024
b331193
add note
hannojg Sep 27, 2024
18de46d
Update src/pages/ChatFinderPage/index.tsx
hannojg Sep 27, 2024
a91ffc9
wording
hannojg Sep 27, 2024
8277d37
Update src/libs/SuffixUkkonenTree/utils.ts
hannojg Sep 30, 2024
d1d028a
Update src/libs/FastSearch.ts
hannojg Sep 30, 2024
e829066
Update src/libs/SuffixUkkonenTree/utils.ts
hannojg Sep 30, 2024
3b69cf4
Update src/libs/SuffixUkkonenTree/utils.ts
hannojg Sep 30, 2024
8951708
Update src/libs/SuffixUkkonenTree/index.ts
hannojg Sep 30, 2024
4c4f1dc
Update src/libs/SuffixUkkonenTree/index.ts
hannojg Sep 30, 2024
050a320
use Uint8Array for search numeric representation
hannojg Oct 1, 2024
017e9f6
fix utils
hannojg Oct 1, 2024
4d5ad3d
add comment
hannojg Oct 1, 2024
1dc2a5e
Update src/libs/SuffixUkkonenTree/index.ts
hannojg Oct 1, 2024
fdac05f
implementation using none negative values
hannojg Oct 1, 2024
e8fdd5c
adjust array sizes to be correct
hannojg Oct 1, 2024
418e2cd
fix: can't find by first letter
hannojg Oct 1, 2024
49692fd
remove unnecessary initializations
hannojg Oct 1, 2024
1f90a13
fix: can't find by first letter was only a problem in the unit test
hannojg Oct 1, 2024
ee553da
cleanup
hannojg Oct 1, 2024
29a1934
explain in comments
hannojg Oct 1, 2024
1aec94a
+1 not +2
hannojg Oct 1, 2024
55acdc5
order options
hannojg Oct 1, 2024
a1103a3
skip white spaces
hannojg Oct 1, 2024
41207c6
don't return PD, concat with recentReports
hannojg Oct 1, 2024
736650f
Update src/libs/SuffixUkkonenTree/index.ts
hannojg Oct 3, 2024
4d115ed
Merge branch 'main' of github.com:Expensify/App into perf/search-suff…
hannojg Oct 3, 2024
35b2555
Merge branch 'perf/search-suffix-ukkonen-tree' of github.com:margelo/…
hannojg Oct 3, 2024
6f5be58
disable .at() rule
hannojg Oct 3, 2024
34d2dbd
rename parameter
hannojg Oct 3, 2024
6160173
step A: bring changes over to SearchRouter
hannojg Oct 3, 2024
fabe796
reorder
hannojg Oct 3, 2024
3539a57
fix dot use case
hannojg Oct 3, 2024
fdf63af
Merge branch 'main' of github.com:Expensify/App into perf/search-suff…
hannojg Oct 4, 2024
fa68c64
remove eslint disabling
hannojg Oct 4, 2024
162ad83
Merge branch 'main' of github.com:margelo/expensify-app-fork into per…
hannojg Oct 10, 2024
fd76f43
Update src/libs/FastSearch.ts
hannojg Oct 10, 2024
7c44fa7
improve error messages
hannojg Oct 10, 2024
29e5288
Update src/libs/SuffixUkkonenTree/utils.ts
hannojg Oct 10, 2024
de7f3b5
fix variable name usage
hannojg Oct 10, 2024
f757fd2
Merge branch 'main' of github.com:Expensify/App into perf/search-suff…
hannojg Oct 17, 2024
59d2562
fix duplicate imports after merge
hannojg Oct 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
310 changes: 310 additions & 0 deletions src/libs/SuffixUkkonenTree.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,310 @@
import enEmojis from '@assets/emojis/en';

const CHAR_CODE_A = 'a'.charCodeAt(0);
const ALPHABET_SIZE = 28;
const DELIMITER_CHAR_CODE = ALPHABET_SIZE - 2;

// TODO:
// make makeTree faster
// how to deal with unicode characters such as spanish ones?
// i think we need to support numbers as well

/**
* Converts a string to an array of numbers representing the characters of the string.
* The numbers are offset by the character code of 'a' (97).
* - This is so that the numbers from a-z are in the range 0-25.
* - 26 is for the delimiter character "{",
* - 27 is for the end character "|".
*/
function stringToArray(input: string) {
const res: number[] = [];
for (let i = 0; i < input.length; i++) {
const charCode = input.charCodeAt(i) - CHAR_CODE_A;
if (charCode >= 0 && charCode < ALPHABET_SIZE) {
res.push(charCode);
}
}
return res;
}

const aToZRegex = /[^a-z]/gi;
// The character that separates the different options in the search string
const delimiterChar = '{';

type PrepareDataParams<T> = {
data: T[];
transform: (data: T) => string;
};

function prepareData<T>({data, transform}: PrepareDataParams<T>): [string, Array<T | undefined>] {
const searchIndexList: Array<T | undefined> = [];
const str = data
.map((option) => {
let searchStringForTree = transform(option);
// Remove all none a-z chars:
searchStringForTree = searchStringForTree.toLowerCase().replace(aToZRegex, '');

if (searchStringForTree.length > 0) {
// We need to push an array that has the same length as the length of the string we insert for this option:
const indexes = Array.from({length: searchStringForTree.length}, () => option);
// Note: we add undefined for the delimiter character
searchIndexList.push(...indexes, undefined);
} else {
return undefined;
}

return searchStringForTree;
})
// slightly faster alternative to `.filter(Boolean).join(delimiterChar)`
.reduce((acc: string, curr) => {
if (!curr) {
return acc;
}

if (acc === '') {
return curr;
}

return `${acc}${delimiterChar}${curr}`;
}, '');

return [str, searchIndexList];
}

/**
* Makes a tree from an input string
* **Important:** As we only support an alphabet of 26 characters, the input string should only contain characters from a-z.
* Thus, all input data must be cleaned before being passed to this function.
* If you then use this tree for search you should clean your search input as well (so that a search query of "[email protected]" becomes "testusermyemailcom").
*/
function makeTree<T>(compose: Array<PrepareDataParams<T>>) {
const start1 = performance.now();
const strings = [];
const indexes: Array<Array<T | undefined>> = [];

for (const {data, transform} of compose) {
const [str, searchIndexList] = prepareData({data, transform});
strings.push(str);
indexes.push(searchIndexList);
}
const stringToSearch = `${strings.join('')}|`; // End Character
console.log('building search strings', performance.now() - start1);

Check failure on line 91 in src/libs/SuffixUkkonenTree.ts

View workflow job for this annotation

GitHub Actions / Run ESLint

Unexpected console statement

const a = stringToArray(stringToSearch);
hannojg marked this conversation as resolved.
Show resolved Hide resolved
const N = 25000; // TODO: i reduced this number from 1_000_000 down to this, for faster performance - however its possible that it needs to be bigger for larger search strings
const start = performance.now();
const t = Array.from({length: N}, () => Array(ALPHABET_SIZE).fill(-1) as number[]);
const l = Array(N).fill(0) as number[];
const r = Array(N).fill(0) as number[];
const p = Array(N).fill(0) as number[];
const s = Array(N).fill(0) as number[];
const end = performance.now();
console.log('Allocating memory took:', end - start, 'ms');

Check failure on line 102 in src/libs/SuffixUkkonenTree.ts

View workflow job for this annotation

GitHub Actions / Run ESLint

Unexpected console statement

let tv = 0;
let tp = 0;
let ts = 2;
let la = 0;

function initializeTree() {
r.fill(a.length - 1);
s[0] = 1;
l[0] = -1;
r[0] = -1;
l[1] = -1;
r[1] = -1;
t[1].fill(0);
}

function processCharacter(c: number) {
while (true) {

Check warning on line 120 in src/libs/SuffixUkkonenTree.ts

View workflow job for this annotation

GitHub Actions / Run ESLint

Unexpected constant condition
if (r[tv] < tp) {
if (t[tv][c] === -1) {
createNewLeaf(c);
continue;

Check failure on line 124 in src/libs/SuffixUkkonenTree.ts

View workflow job for this annotation

GitHub Actions / Run ESLint

Unexpected use of continue statement
}
tv = t[tv][c];
tp = l[tv];
}
if (tp === -1 || c === a[tp]) {
tp++;
} else {
splitEdge(c);
continue;

Check failure on line 133 in src/libs/SuffixUkkonenTree.ts

View workflow job for this annotation

GitHub Actions / Run ESLint

Unexpected use of continue statement
}
break;
}
if (c === DELIMITER_CHAR_CODE) {
resetTreeTraversal();
}
}

function createNewLeaf(c: number) {
t[tv][c] = ts;
l[ts] = la;
p[ts++] = tv;
tv = s[tv];
tp = r[tv] + 1;
}

function splitEdge(c: number) {
l[ts] = l[tv];
r[ts] = tp - 1;
p[ts] = p[tv];
t[ts][a[tp]] = tv;
t[ts][c] = ts + 1;
l[ts + 1] = la;
p[ts + 1] = ts;
l[tv] = tp;
p[tv] = ts;
t[p[ts]][a[l[ts]]] = ts;
ts += 2;
handleDescent(ts);
}

function handleDescent(ts: number) {

Check failure on line 165 in src/libs/SuffixUkkonenTree.ts

View workflow job for this annotation

GitHub Actions / Run ESLint

'ts' is already declared in the upper scope on line 106 column 9
tv = s[p[ts - 2]];
tp = l[ts - 2];
while (tp <= r[ts - 2]) {
tv = t[tv][a[tp]];
tp += r[tv] - l[tv] + 1;
}
if (tp === r[ts - 2] + 1) {
s[ts - 2] = tv;
} else {
s[ts - 2] = ts;
}
tp = r[tv] - (tp - r[ts - 2]) + 2;
}

function resetTreeTraversal() {
tv = 0;
tp = 0;
}

function build() {
initializeTree();
for (la = 0; la < a.length; ++la) {
const c = a[la];
processCharacter(c);
}
}

/**
* Returns all occurrences of the given (sub)string in the input string.
*
* You can think of the tree that we create as a big string that looks like this:
*
* "banana{pancake{apple|"
* The delimiter character '{' is used to separate the different strings.
* The end character '|' is used to indicate the end of our search string.
*
* This function will return the index(es) of found occurrences within this big string.
* So, when searching for "an", it would return [1, 4, 11].
*/
function findSubstring(searchString: string) {
const occurrences: number[] = [];

function dfs(node: number, depth: number) {
const leftRange = l[node];
const rightRange = r[node];
const rangeLen = node === 0 ? 0 : rightRange - leftRange + 1;

for (let i = 0; i < rangeLen && depth + i < searchString.length; i++) {
if (searchString.charCodeAt(depth + i) - CHAR_CODE_A !== a[leftRange + i]) {
return;
}
}

let isLeaf = true;
for (let i = 0; i < ALPHABET_SIZE; ++i) {
if (t[node][i] !== -1) {
isLeaf = false;
dfs(t[node][i], depth + rangeLen);
}
}

if (isLeaf && depth >= searchString.length) {
occurrences.push(a.length - (depth + rangeLen));
}
}

dfs(0, 0);
return occurrences;
}

function findInSearchTree(searchInput: string) {
const now = performance.now();
const result = findSubstring(searchInput);
console.log('FindSubstring index result for searchInput', searchInput, result);

Check failure on line 239 in src/libs/SuffixUkkonenTree.ts

View workflow job for this annotation

GitHub Actions / Run ESLint

Unexpected console statement
// Map the results to the original options

const mappedResults: T[][] = Array.from({length: compose.length}, () => []);
console.log({result});

Check failure on line 243 in src/libs/SuffixUkkonenTree.ts

View workflow job for this annotation

GitHub Actions / Run ESLint

Unexpected console statement
result.forEach((index) => {
// const textInSearchString = searchString.substring(index, searchString.indexOf(delimiterChar, index));
// console.log('textInSearchString', textInSearchString);

// TODO: check with Hanno whether we restore the data correctly
let offset = 0;
for (let i = 0; i < indexes.length; i++) {
const relativeIndex = index - offset;
if (relativeIndex < indexes[i].length && relativeIndex >= 0) {
const option = indexes[i][relativeIndex];
if (option) {
mappedResults[i].push(option);
}
} else {
offset += indexes[i].length;
}
}
});

console.log('search', performance.now() - now);

Check failure on line 263 in src/libs/SuffixUkkonenTree.ts

View workflow job for this annotation

GitHub Actions / Run ESLint

Unexpected console statement
return mappedResults;
}

return {
build,
findSubstring,
findInSearchTree,
};
}

function performanceProfile<T>(input: PrepareDataParams<T>, search = 'sasha') {
// TODO: For emojis we could precalculate the makeTree function during build time using a babel plugin
// maybe babel plugin that just precalculates the result of function execution (so that it can be generic purpose plugin)
const {build, findSubstring} = makeTree([input]);

const buildStart = performance.now();
build();
const buildEnd = performance.now();
console.log('Building time:', buildEnd - buildStart, 'ms');

Check failure on line 282 in src/libs/SuffixUkkonenTree.ts

View workflow job for this annotation

GitHub Actions / Run ESLint

Unexpected console statement

const searchStart = performance.now();
const results = findSubstring(search);
const searchEnd = performance.now();
console.log('Search time:', searchEnd - searchStart, 'ms');

Check failure on line 287 in src/libs/SuffixUkkonenTree.ts

View workflow job for this annotation

GitHub Actions / Run ESLint

Unexpected console statement
console.log(results);

return {
buildTime: buildEnd - buildStart,
recursiveSearchTime: searchEnd - searchStart,
};
}

// Demo function testing the performance for emojis
function testEmojis() {
const data = Object.values(enEmojis);
return performanceProfile(
{
data,
transform: ({keywords}) => {
return `${keywords.join('')}{`;
},
},
'smile',
);
}

export {makeTree, prepareData, testEmojis};
53 changes: 49 additions & 4 deletions src/pages/ChatFinderPage/index.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ import type {RootStackParamList} from '@libs/Navigation/types';
import * as OptionsListUtils from '@libs/OptionsListUtils';
import Performance from '@libs/Performance';
import type {OptionData} from '@libs/ReportUtils';
import {makeTree, prepareData} from '@libs/SuffixUkkonenTree';
import * as Report from '@userActions/Report';
import Timing from '@userActions/Timing';
import CONST from '@src/CONST';
Expand Down Expand Up @@ -51,6 +52,8 @@ const setPerformanceTimersEnd = () => {

const ChatFinderPageFooterInstance = <ChatFinderPageFooter />;

const aToZRegex = /[^a-z]/gi;

function ChatFinderPage({betas, isSearchingForReports, navigation}: ChatFinderPageProps) {
const [isScreenTransitionEnd, setIsScreenTransitionEnd] = useState(false);
const {translate} = useLocalize();
Expand Down Expand Up @@ -94,6 +97,48 @@ function ChatFinderPage({betas, isSearchingForReports, navigation}: ChatFinderPa
return {...optionList, headerMessage: header};
}, [areOptionsInitialized, betas, isScreenTransitionEnd, options]);

/**
* Builds a suffix tree and returns a function to search in it.
*/
const findInSearchTree = useMemo(() => {
let start = performance.now();
const tree = makeTree([
{
data: searchOptions.personalDetails,
transform: (option) => {
// TODO: there are probably more fields we'd like to add to the search string
return (option.login ?? '') + (option.login !== option.displayName ? option.displayName ?? '' : '');
},
},
{
data: searchOptions.recentReports,
transform: (option) => {
let searchStringForTree = (option.login ?? '') + (option.login !== option.displayName ? option.displayName ?? '' : '');
searchStringForTree += option.reportID ?? '';
searchStringForTree += option.name ?? '';

return searchStringForTree;
},
},
]);
console.log('makeTree', performance.now() - start);
start = performance.now();
tree.build();
console.log('build', performance.now() - start);

function search(searchInput: string) {
start = performance.now();
const [personalDetails, recentReports] = tree.findInSearchTree(searchInput);

return {
personalDetails,
recentReports,
};
}

return search;
}, [searchOptions.personalDetails, searchOptions.recentReports]);

const filteredOptions = useMemo(() => {
if (debouncedSearchValue.trim() === '') {
return {
Expand All @@ -105,17 +150,17 @@ function ChatFinderPage({betas, isSearchingForReports, navigation}: ChatFinderPa
}

Timing.start(CONST.TIMING.SEARCH_FILTER_OPTIONS);
const newOptions = OptionsListUtils.filterOptions(searchOptions, debouncedSearchValue, {sortByReportTypeInSearch: true, preferChatroomsOverThreads: true});
const newOptions = findInSearchTree(debouncedSearchValue.toLowerCase().replace(aToZRegex, ''));
Timing.end(CONST.TIMING.SEARCH_FILTER_OPTIONS);

const header = OptionsListUtils.getHeaderMessage(newOptions.recentReports.length + Number(!!newOptions.userToInvite) > 0, false, debouncedSearchValue);
const header = OptionsListUtils.getHeaderMessage(newOptions.recentReports.length > 0, false, debouncedSearchValue);
return {
recentReports: newOptions.recentReports,
personalDetails: newOptions.personalDetails,
userToInvite: newOptions.userToInvite,
userToInvite: undefined, // newOptions.userToInvite,
headerMessage: header,
};
}, [debouncedSearchValue, searchOptions]);
}, [debouncedSearchValue, findInSearchTree]);

const {recentReports, personalDetails: localPersonalDetails, userToInvite, headerMessage} = debouncedSearchValue.trim() !== '' ? filteredOptions : searchOptions;

Expand Down
Loading