-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(index): lightweight structure #57
Conversation
/// Stores a mapping from hashed instance uri to their types | ||
#[derive(Serialize, Deserialize)] | ||
pub struct TypeIndex { | ||
pub types: Vec<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Its not clear from reading the doc string, what types
contains and
what SmallVec
contains (be specific).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome: Some considerations about from_iter
and the insert
function
Nice job also with the benchmark!!! 💯
I took the path of least resistance (and consistency?) in this PR by using
Comparing perf/index against main. TimingsRun time compared using hyperfine Indexing
Pseudonymization
MemoryHeap memory usage compared using heaptrack Indexingmain: peak heap memory consumption: 520.13M Pseudonymizationmain: peak heap memory consumption: 1.02G |
Proposed Changes
So far we used ntriples as the type index serialization, and loaded it into a
HashMap<subject_uri: String, type_uri: String>
. This has two drawbacks:This PR changes the index structure to HashMap<subject_hash: u64, type_indices: SmallVec<[u64; 1]>>`. This has following advantages:
Additionally the default channel size of the log is reduced from 5000000 to 1000, as it was causing an overhead of 500MB RAM to log a single line.
Types of Changes
What types of changes does your code introduce? Put an
x
in the boxes thatapply
functionality to not work as expected).
other choices apply).
Checklist
Put an
x
in the boxes that apply. You can also fill these out after creatingthe PR. If you're unsure about any of them, don't hesitate to ask. We're here to
help! This is simply a reminder of what we are going to look for before merging
your code.
CONTRIBUTING
guidelines.
works.
Further Comments
Benchmark
Results below were generated using the benchmarking script added by this PR in
tools/bench
:Timings
Run time compared using hyperfine
Indexing
main
perf/index
Pseudonymization
main
perf/index
Memory
Heap memory usage compared using heaptrack
Indexing
main: peak heap memory consumption: 520.13M
perf/index: peak heap memory consumption: 344.50M
Pseudonymization
main: peak heap memory consumption: 1.02G
perf/index: peak heap memory consumption: 2.11G
The initial memory peak appears to be caused be deserialization via serde_yml
Once the whole index is loaded, we plateau at 265MB which is great!