-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compute topic similarity more efficiently? #38
Comments
what about instead iteratively querying the data: when done with first window -> drop the first year + only load the additional year needed -> ... |
that is an option, although it would require some rewriting. and the parallelization would "only" be over fields of study. an alternative is DuckDB . #39 |
Ah yes, didn’t have the parallelization in mind. But as the results stored, is there a big benefit to do it more efficiently? Am 2023-03-09 um 20:07 schrieb f-hafner ***@***.***>:
that is an option, although it would require some rewriting. and the parallelization would "only" be over fields of study. an alternative is DuckDB . #39
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>
|
because we calculate the similarities for all graduates-potential employers, this step only comes after linking. so when we update the linking, we need to rerun the calculations for similarity as well. but it's not urgent and not very important, but it crossed my mind and I wanted to keep it as an issue for the moment. |
Currently, we iterate over each graduation year, but for each iteration, we load a window of data +/- 5 years into memory. If we compute the similarity for a 2 or more neighboring graduation years, we only have to add data for two additional years. This could speed up the calculations. The trade-off is that this needs more memory.
The text was updated successfully, but these errors were encountered: