Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search suffix tree implementation #48652

Merged
merged 101 commits into from
Oct 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
101 commits
Select commit Hold shift + click to select a range
5425082
suffix tree impl
hannojg Sep 5, 2024
54a7b60
add some helpful comments
hannojg Sep 5, 2024
8622670
example implementation usage of Suffixtree
hannojg Sep 5, 2024
01162fe
fix: resolved one TODO
kirillzyusko Sep 11, 2024
09e8aa7
fix: reduce code duplication
kirillzyusko Sep 12, 2024
fa81e13
refactor: O(2) -> O(1)
kirillzyusko Sep 12, 2024
e33142f
refactor: minus one TODO
kirillzyusko Sep 12, 2024
30424a9
fix: bring back userToInvite
kirillzyusko Sep 13, 2024
1d11ed2
fix: make Marc discoverable again (when we search in second array, th…
kirillzyusko Sep 13, 2024
b502edc
Merge branch 'main' into perf/search-suffix-ukkonen-tree
kirillzyusko Sep 20, 2024
c0e38b8
add debug timings
hannojg Sep 24, 2024
16e4078
Merge branch 'main' of github.com:Expensify/App into perf/search-suff…
hannojg Sep 24, 2024
955cdcb
adjust values for search
hannojg Sep 24, 2024
d7c8d88
correct display name for personal details search
hannojg Sep 24, 2024
d07940d
deduplicate search results
hannojg Sep 24, 2024
84d3231
unify string cleaning
hannojg Sep 24, 2024
eb6b371
fix search function temporarily
hannojg Sep 24, 2024
6002101
make it work with numbers by base26 everything thats bigger than our …
hannojg Sep 25, 2024
131c303
don't add delimiter char at the end
hannojg Sep 25, 2024
0a89089
support for unicode
hannojg Sep 25, 2024
def772b
extended test case to make sure right item is returned
hannojg Sep 25, 2024
0806fb0
use concat instead of for-loop
hannojg Sep 25, 2024
c660312
Revert "use concat instead of for-loop"
hannojg Sep 25, 2024
24a59b7
add note
hannojg Sep 25, 2024
146d51b
wip: make t dynamic instead of preallocating
hannojg Sep 25, 2024
a2cdaf9
wip: refactor r list
hannojg Sep 25, 2024
f6dbecf
wip: refactor l list
hannojg Sep 25, 2024
bda1e34
wip: refactor p list
hannojg Sep 25, 2024
2508d55
wip: refactor s list
hannojg Sep 25, 2024
c32d57f
removed preallocations!
hannojg Sep 25, 2024
f4812c0
better naming
hannojg Sep 25, 2024
de0b467
renamings
hannojg Sep 25, 2024
c17a95b
add userToToInvite to header count back
hannojg Sep 25, 2024
0f94c05
remove debug logs
hannojg Sep 25, 2024
fc54faa
cleanup
hannojg Sep 25, 2024
96c93e7
add unit test for special char handling
hannojg Sep 25, 2024
81dffdd
wip: document suffix tree functions
hannojg Sep 25, 2024
25909ba
fix recursive search function
hannojg Sep 25, 2024
b85cfd4
small cleanup
hannojg Sep 25, 2024
5de310c
clean search input string
hannojg Sep 25, 2024
d51a4c3
fixed flaky test
hannojg Sep 25, 2024
9e9b574
made test more reliable
hannojg Sep 25, 2024
4fa8390
migrate to useOnyx
hannojg Sep 25, 2024
8e583cb
ignore warning, this algorithm is optimized for speed
hannojg Sep 25, 2024
56a1e8a
add timings to new search
hannojg Sep 25, 2024
309556a
refactor: separate suffix tree from search logic
hannojg Sep 25, 2024
f7528f4
add explanations + fix test
hannojg Sep 25, 2024
8b5b77f
code documentation
hannojg Sep 25, 2024
a6b7939
add test verifying identical words work in suffix tree
hannojg Sep 25, 2024
8b41e88
add failing unit tests for duplicated string issue
hannojg Sep 25, 2024
4e273df
add comment
hannojg Sep 25, 2024
57af9b1
fix search not returning results with same search value
hannojg Sep 26, 2024
8b26a1f
perf: precalculate base26 map, optimize math operations, letter looku…
hannojg Sep 26, 2024
af43c93
experiment: use array buffers
hannojg Sep 26, 2024
615a237
add correctChar optimization back
hannojg Sep 26, 2024
2ededdd
wip: optimizing search string construction
hannojg Sep 26, 2024
02f562d
add unit test
hannojg Sep 26, 2024
e894a7f
Optimize performance by using one ArrayBuffer for search values and s…
hannojg Sep 27, 2024
54dcea8
ignore eslint warning; performance optimization
hannojg Sep 27, 2024
8bb5b1e
rename test file
hannojg Sep 27, 2024
b331193
add note
hannojg Sep 27, 2024
18de46d
Update src/pages/ChatFinderPage/index.tsx
hannojg Sep 27, 2024
a91ffc9
wording
hannojg Sep 27, 2024
8277d37
Update src/libs/SuffixUkkonenTree/utils.ts
hannojg Sep 30, 2024
d1d028a
Update src/libs/FastSearch.ts
hannojg Sep 30, 2024
e829066
Update src/libs/SuffixUkkonenTree/utils.ts
hannojg Sep 30, 2024
3b69cf4
Update src/libs/SuffixUkkonenTree/utils.ts
hannojg Sep 30, 2024
8951708
Update src/libs/SuffixUkkonenTree/index.ts
hannojg Sep 30, 2024
4c4f1dc
Update src/libs/SuffixUkkonenTree/index.ts
hannojg Sep 30, 2024
050a320
use Uint8Array for search numeric representation
hannojg Oct 1, 2024
017e9f6
fix utils
hannojg Oct 1, 2024
4d5ad3d
add comment
hannojg Oct 1, 2024
1dc2a5e
Update src/libs/SuffixUkkonenTree/index.ts
hannojg Oct 1, 2024
fdac05f
implementation using none negative values
hannojg Oct 1, 2024
e8fdd5c
adjust array sizes to be correct
hannojg Oct 1, 2024
418e2cd
fix: can't find by first letter
hannojg Oct 1, 2024
49692fd
remove unnecessary initializations
hannojg Oct 1, 2024
1f90a13
fix: can't find by first letter was only a problem in the unit test
hannojg Oct 1, 2024
ee553da
cleanup
hannojg Oct 1, 2024
29a1934
explain in comments
hannojg Oct 1, 2024
1aec94a
+1 not +2
hannojg Oct 1, 2024
55acdc5
order options
hannojg Oct 1, 2024
a1103a3
skip white spaces
hannojg Oct 1, 2024
41207c6
don't return PD, concat with recentReports
hannojg Oct 1, 2024
736650f
Update src/libs/SuffixUkkonenTree/index.ts
hannojg Oct 3, 2024
4d115ed
Merge branch 'main' of github.com:Expensify/App into perf/search-suff…
hannojg Oct 3, 2024
35b2555
Merge branch 'perf/search-suffix-ukkonen-tree' of github.com:margelo/…
hannojg Oct 3, 2024
6f5be58
disable .at() rule
hannojg Oct 3, 2024
34d2dbd
rename parameter
hannojg Oct 3, 2024
6160173
step A: bring changes over to SearchRouter
hannojg Oct 3, 2024
fabe796
reorder
hannojg Oct 3, 2024
3539a57
fix dot use case
hannojg Oct 3, 2024
fdf63af
Merge branch 'main' of github.com:Expensify/App into perf/search-suff…
hannojg Oct 4, 2024
fa68c64
remove eslint disabling
hannojg Oct 4, 2024
162ad83
Merge branch 'main' of github.com:margelo/expensify-app-fork into per…
hannojg Oct 10, 2024
fd76f43
Update src/libs/FastSearch.ts
hannojg Oct 10, 2024
7c44fa7
improve error messages
hannojg Oct 10, 2024
29e5288
Update src/libs/SuffixUkkonenTree/utils.ts
hannojg Oct 10, 2024
de7f3b5
fix variable name usage
hannojg Oct 10, 2024
f757fd2
Merge branch 'main' of github.com:Expensify/App into perf/search-suff…
hannojg Oct 17, 2024
59d2562
fix duplicate imports after merge
hannojg Oct 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions src/CONST.ts
Original file line number Diff line number Diff line change
Expand Up @@ -1130,6 +1130,9 @@ const CONST = {
SEARCH_OPTION_LIST_DEBOUNCE_TIME: 300,
RESIZE_DEBOUNCE_TIME: 100,
UNREAD_UPDATE_DEBOUNCE_TIME: 300,
SEARCH_CONVERT_SEARCH_VALUES: 'search_convert_search_values',
SEARCH_MAKE_TREE: 'search_make_tree',
roryabraham marked this conversation as resolved.
Show resolved Hide resolved
SEARCH_BUILD_TREE: 'search_build_tree',
SEARCH_FILTER_OPTIONS: 'search_filter_options',
USE_DEBOUNCED_STATE_DELAY: 300,
},
Expand Down
62 changes: 58 additions & 4 deletions src/components/Search/SearchRouter/SearchRouter.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ import useKeyboardShortcut from '@hooks/useKeyboardShortcut';
import useLocalize from '@hooks/useLocalize';
import useResponsiveLayout from '@hooks/useResponsiveLayout';
import useThemeStyles from '@hooks/useThemeStyles';
import FastSearch from '@libs/FastSearch';
import Log from '@libs/Log';
import * as OptionsListUtils from '@libs/OptionsListUtils';
import {getAllTaxRates} from '@libs/PolicyUtils';
Expand Down Expand Up @@ -63,6 +64,49 @@ function SearchRouter({onRouterClose}: SearchRouterProps) {
return OptionsListUtils.getSearchOptions(options, '', betas ?? []);
}, [areOptionsInitialized, betas, options]);

/**
* Builds a suffix tree and returns a function to search in it.
*/
const findInSearchTree = useMemo(() => {
const fastSearch = FastSearch.createFastSearch([
{
data: searchOptions.personalDetails,
toSearchableString: (option) => {
const displayName = option.participantsList?.[0]?.displayName ?? '';
return [option.login ?? '', option.login !== displayName ? displayName : ''].join();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter that we will have a string with like ,something if the login is an empty string or just , if both are empty? Not sure if it can happen in practice, but maybe we should not include any empty strings at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, is one of the charsets to skip, so I don't think it will matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty strings won't be searchable. No characters would be added for that to the search tree, thus they'd never be retrievable

},
},
{
data: searchOptions.recentReports,
toSearchableString: (option) => {
const searchStringForTree = [option.text ?? '', option.login ?? ''];

if (option.isThread) {
if (option.alternateText) {
searchStringForTree.push(option.alternateText);
}
} else if (!!option.isChatRoom || !!option.isPolicyExpenseChat) {
if (option.subtitle) {
searchStringForTree.push(option.subtitle);
}
}

return searchStringForTree.join();
},
},
Comment on lines +71 to +96
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NAB: considering we're doing the same thing in 2 places, we could all this memo in a single function on OptionsListUtils.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100%, this is actually what i want to do in the next PR!

I wouldn't want to change that now, because:

  • i am addressing this in the next PR
  • the ChatFinderPage is going to be deprecated very soon, so then this code would only exist in SearchRouter

is that fine with you? 😊

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to me!

]);
function search(searchInput: string) {
const [personalDetails, recentReports] = fastSearch.search(searchInput);

return {
personalDetails,
recentReports,
};
}

return search;
}, [searchOptions.personalDetails, searchOptions.recentReports]);

const filteredOptions = useMemo(() => {
if (debouncedInputValue.trim() === '') {
return {
Expand All @@ -73,15 +117,25 @@ function SearchRouter({onRouterClose}: SearchRouterProps) {
}

Timing.start(CONST.TIMING.SEARCH_FILTER_OPTIONS);
const newOptions = OptionsListUtils.filterOptions(searchOptions, debouncedInputValue, {sortByReportTypeInSearch: true, preferChatroomsOverThreads: true});
const newOptions = findInSearchTree(debouncedInputValue);
Timing.end(CONST.TIMING.SEARCH_FILTER_OPTIONS);

return {
const recentReports = newOptions.recentReports.concat(newOptions.personalDetails);

const userToInvite = OptionsListUtils.pickUserToInvite({
canInviteUser: true,
recentReports: newOptions.recentReports,
personalDetails: newOptions.personalDetails,
userToInvite: newOptions.userToInvite,
searchValue: debouncedInputValue,
optionsToExclude: [{login: CONST.EMAIL.NOTIFICATIONS}],
});

return {
recentReports,
personalDetails: [],
userToInvite,
};
}, [debouncedInputValue, searchOptions]);
}, [debouncedInputValue, findInSearchTree]);

const recentReports: OptionData[] = useMemo(() => {
const currentSearchOptions = debouncedInputValue === '' ? searchOptions : filteredOptions;
Expand Down
140 changes: 140 additions & 0 deletions src/libs/FastSearch.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
/* eslint-disable rulesdir/prefer-at */
import CONST from '@src/CONST';
import Timing from './actions/Timing';
import SuffixUkkonenTree from './SuffixUkkonenTree';

type SearchableData<T> = {
/**
* The data that should be searchable
*/
data: T[];
/**
* A function that generates a string from a data entry. The string's value is used for searching.
* If you have multiple fields that should be searchable, simply concat them to the string and return it.
*/
toSearchableString: (data: T) => string;
};

// There are certain characters appear very often in our search data (email addresses), which we don't need to search for.
const charSetToSkip = new Set(['@', '.', '#', '$', '%', '&', '*', '+', '-', '/', ':', ';', '<', '=', '>', '?', '_', '~', '!', ' ']);

/**
* Creates a new "FastSearch" instance. "FastSearch" uses a suffix tree to search for substrings in a list of strings.
* You can provide multiple datasets. The search results will be returned for each dataset.
*
* Note: Creating a FastSearch instance with a lot of data is computationally expensive. You should create an instance once and reuse it.
* Searches will be very fast though, even with a lot of data.
*/
function createFastSearch<T>(dataSets: Array<SearchableData<T>>) {
Timing.start(CONST.TIMING.SEARCH_CONVERT_SEARCH_VALUES);
const maxNumericListSize = 400_000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we exceed this? Or is that pretty unlikely? What happens if we do?

Copy link
Contributor Author

@hannojg hannojg Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I made this number to be based on the most extreme case i have seen so far (the search string generated for ten thousands of reports and personal details was below 400.000, don't remember the exact number anymore).
In case the search string is longer, we'd loose search results. I think no error would be thrown, but if the user you are searching for is on position 400.010 you wouldn't be able to find that person.

Let me brainstorm quickly with Szymon if we can do better!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We think it would be best to implement a resize mechanism. If we hit the limit, we'd resize the arrays to be bigger. However, that comes with a performance overhead. So we should pick a default size that makes sense for most users (400_00 might actually be a bit too much for the average case, but for some extreme cases it might be too little).
However, as this isn't directly breaking anything, would you be fine with us tweaking this in a follow up PR? I think the Pr is already quite huge and hard to understand

Copy link
Contributor Author

@hannojg hannojg Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Another solution for performance improvements on this could also be to store the array's arraybuffer in mmkv and rehydrate from there [mmkv has first class support for ArrayBuffers], but
again, it would be better to experiment with that in a follow up PR - i can open a follow up issue ticket, i think that would be best?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to keep it as 400K for now, but have a follow-up issue to make it resizable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks! I think this is something we can wait and see about - my initial thinking is that we still don't have any pagination for either personal details or reports. And we have also spent a lot of time tweaking things for some very high capacity usage.

Though, it was hard for me to tell how many personal details or reports we are talking about here when we'd be hitting that limit. If it's over 10k reports then nobody actually has that AFAIK.

// The user might provide multiple data sets, but internally, the search values will be stored in this one list:
let concatenatedNumericList = new Uint8Array(maxNumericListSize);
// Here we store the index of the data item in the original data list, so we can map the found occurrences back to the original data:
const occurrenceToIndex = new Uint32Array(maxNumericListSize * 4);
// As we are working with ArrayBuffers, we need to keep track of the current offset:
const offset = {value: 1};
// We store the last offset for a dataSet, so we can map the found occurrences to the correct dataSet:
const listOffsets: number[] = [];

for (const {data, toSearchableString} of dataSets) {
// Performance critical: the array parameters are passed by reference, so we don't have to create new arrays every time:
dataToNumericRepresentation(concatenatedNumericList, occurrenceToIndex, offset, {data, toSearchableString});
listOffsets.push(offset.value);
}
concatenatedNumericList[offset.value++] = SuffixUkkonenTree.END_CHAR_CODE;
listOffsets[listOffsets.length - 1] = offset.value;
Timing.end(CONST.TIMING.SEARCH_CONVERT_SEARCH_VALUES);

// The list might be larger than necessary, so we clamp it to the actual size:
concatenatedNumericList = concatenatedNumericList.slice(0, offset.value);

// Create & build the suffix tree:
Timing.start(CONST.TIMING.SEARCH_MAKE_TREE);
const tree = SuffixUkkonenTree.makeTree(concatenatedNumericList);
Timing.end(CONST.TIMING.SEARCH_MAKE_TREE);

Timing.start(CONST.TIMING.SEARCH_BUILD_TREE);
tree.build();
Timing.end(CONST.TIMING.SEARCH_BUILD_TREE);

/**
* Searches for the given input and returns results for each dataset.
*/
function search(searchInput: string): T[][] {
const cleanedSearchString = cleanString(searchInput);
const {numeric} = SuffixUkkonenTree.stringToNumeric(cleanedSearchString, {
charSetToSkip,
// stringToNumeric might return a list that is larger than necessary, so we clamp it to the actual size
// (otherwise the search could fail as we include in our search empty array values):
clamp: true,
});
const result = tree.findSubstring(Array.from(numeric));

const resultsByDataSet = Array.from({length: dataSets.length}, () => new Set<T>());
// eslint-disable-next-line @typescript-eslint/prefer-for-of
for (let i = 0; i < result.length; i++) {
const occurrenceIndex = result[i];
const itemIndexInDataSet = occurrenceToIndex[occurrenceIndex];
const dataSetIndex = listOffsets.findIndex((listOffset) => occurrenceIndex < listOffset);

if (dataSetIndex === -1) {
throw new Error(`[FastSearch] The occurrence index ${occurrenceIndex} is not in any dataset`);
}
const item = dataSets[dataSetIndex].data[itemIndexInDataSet];
if (!item) {
throw new Error(`[FastSearch] The item with index ${itemIndexInDataSet} in dataset ${dataSetIndex} is not defined`);
}
resultsByDataSet[dataSetIndex].add(item);
}

return resultsByDataSet.map((set) => Array.from(set));
}

return {
search,
};
}

/**
* The suffix tree can only store string like values, and internally stores those as numbers.
* This function converts the user data (which are most likely objects) to a numeric representation.
* Additionally a list of the original data and their index position in the numeric list is created, which is used to map the found occurrences back to the original data.
*/
function dataToNumericRepresentation<T>(concatenatedNumericList: Uint8Array, occurrenceToIndex: Uint32Array, offset: {value: number}, {data, toSearchableString}: SearchableData<T>): void {
data.forEach((option, index) => {
const searchStringForTree = toSearchableString(option);
const cleanedSearchStringForTree = cleanString(searchStringForTree);

if (cleanedSearchStringForTree.length === 0) {
return;
}

SuffixUkkonenTree.stringToNumeric(cleanedSearchStringForTree, {
charSetToSkip,
out: {
outArray: concatenatedNumericList,
offset,
outOccurrenceToIndex: occurrenceToIndex,
index,
},
});
// eslint-disable-next-line no-param-reassign
occurrenceToIndex[offset.value] = index;
// eslint-disable-next-line no-param-reassign
concatenatedNumericList[offset.value++] = SuffixUkkonenTree.DELIMITER_CHAR_CODE;
});
}

/**
* Everything in the tree is treated as lowercase.
*/
function cleanString(input: string) {
return input.toLowerCase();
}

const FastSearch = {
createFastSearch,
};

export default FastSearch;
38 changes: 28 additions & 10 deletions src/libs/OptionsListUtils.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2383,6 +2383,31 @@ function getPersonalDetailSearchTerms(item: Partial<ReportUtils.OptionData>) {
function getCurrentUserSearchTerms(item: ReportUtils.OptionData) {
return [item.text ?? '', item.login ?? '', item.login?.replace(CONST.EMAIL_SEARCH_REGEX, '') ?? ''];
}

type PickUserToInviteParams = {
canInviteUser: boolean;
recentReports: ReportUtils.OptionData[];
personalDetails: ReportUtils.OptionData[];
searchValue: string;
config?: FilterOptionsConfig;
optionsToExclude: Option[];
};

const pickUserToInvite = ({canInviteUser, recentReports, personalDetails, searchValue, config, optionsToExclude}: PickUserToInviteParams) => {
let userToInvite = null;
if (canInviteUser) {
if (recentReports.length === 0 && personalDetails.length === 0) {
userToInvite = getUserToInviteOption({
searchValue,
selectedOptions: config?.selectedOptions,
optionsToExclude,
});
}
}

return userToInvite;
};

/**
* Filters options based on the search input value
*/
Expand Down Expand Up @@ -2459,16 +2484,7 @@ function filterOptions(options: Options, searchInputValue: string, config?: Filt
recentReports = orderOptions(recentReports, searchValue);
}

let userToInvite = null;
if (canInviteUser) {
if (recentReports.length === 0 && personalDetails.length === 0) {
userToInvite = getUserToInviteOption({
searchValue,
selectedOptions: config?.selectedOptions,
optionsToExclude,
});
}
}
const userToInvite = pickUserToInvite({canInviteUser, recentReports, personalDetails, searchValue, config, optionsToExclude});

if (maxRecentReportsToShow > 0 && recentReports.length > maxRecentReportsToShow) {
recentReports.splice(maxRecentReportsToShow);
Expand Down Expand Up @@ -2536,6 +2552,7 @@ export {
formatMemberForList,
formatSectionsFromSearchTerm,
getShareLogOptions,
orderOptions,
filterOptions,
createOptionList,
createOptionFromReport,
Expand All @@ -2549,6 +2566,7 @@ export {
getEmptyOptions,
shouldUseBoldText,
getAlternateText,
pickUserToInvite,
hasReportErrors,
};

Expand Down
Loading
Loading