Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove all time inefficient loops from mlbf #22861

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

KevinMind
Copy link
Contributor

@KevinMind KevinMind commented Nov 15, 2024

Fixes: mozilla/addons#15170

Description

  • extract the database querying logic from MLBFDatabaseLoader to module scope functions for easier mocking
  • reduce number of loops required to build initial dataset in MLBFDatabaseLoader to 1
  • filter not_blocked_items with an O(1) set instead of list
  • refactor generate_diffs to accept a block_type and only diff one set of data at a time.
  • add an lru_cache to generate_diffs to ensure computation is executed only one time per instance + block_type + previous_mlbf
  • collect all mlbf tests under test_mlbf mark

Context

This patch includes logic that should reduce the number of long loops required to process our mlbf data sets and determine the cache and diff of blocks across different scenarios.

Testing

Run the mlbf suite

make test_mlbf

This will run all of the mlbf related tests. Expect everything passes

Test the timeout for the slow test

  1. modify the pytest.mark.timeout(120) reducing the time to 3 (or some low number
  2. run the test
pytest ./src/olympia/blocklist/tests/test_cron.py::TestUploadToRemoteSettings::test_large_data_set
  1. expect the test to fail due to timeout
FAILED src/olympia/blocklist/tests/test_cron.py::TestUploadToRemoteSettings::test_large_data_set - Failed: Timeout >10.0s

Test in local environment

IF you want to test with a real data set you will have to create a large amount of blocks and execute the cron

Code snippet

Execute this code. You will need to execute many time or bump `1_000 up to between 1 and 2 million.. grab a coffee this will take a while.

After that is finished, execute ./manage.py cron upload_mlbf_to_remote_settings and expect it to take up to 2 minutes to complete.

Checklist

  • Add #ISSUENUM at the top of your PR to an existing open issue in the mozilla/addons repository.
  • Successfully verified the change locally.
  • The change is covered by automated tests, or otherwise indicated why doing so is unnecessary/impossible.
  • Add before and after screenshots (Only for changes that impact the UI).
  • Add or update relevant docs reflecting the changes made.

@KevinMind KevinMind force-pushed the remove-all-quadratics branch 2 times, most recently from dfd3c9c to 42483e3 Compare November 15, 2024 20:14
@KevinMind KevinMind marked this pull request as ready for review November 15, 2024 20:42
@KevinMind KevinMind force-pushed the remove-all-quadratics branch 2 times, most recently from b5495d2 to 64858dc Compare November 18, 2024 12:29
@KevinMind KevinMind marked this pull request as draft November 18, 2024 14:01
@KevinMind KevinMind marked this pull request as ready for review November 18, 2024 14:06
@willdurand willdurand requested review from a team and eviljeff and removed request for willdurand and a team November 19, 2024 09:59
@willdurand
Copy link
Member

willdurand commented Nov 19, 2024

Thanks for adding the new tests, that looks good at first glance. I'm going to focus on #22828 so I'll let someone else take care of this PR.

Fixes: mozilla/addons#15170

I'd suggest filing a new issue for this, it could be about adding tests (and fixing more things revealed by the tests)

Copy link
Member

@eviljeff eviljeff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wdyt?

@@ -21,15 +22,47 @@


def ordered_diff_lists(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just use sets instead? They're much more efficient for comparison, and, to me, sticking to sets everywhere would be more straightforward.

We only need a specific order when we're writing to JSON, (because it doesn't support sets), so we could just sort the data just before writing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid chat GPT and myself disagree with you on this one. Here's the breakdown.

TLDR; we save no time (lose a little bit actually) using set difference. On top of that it is less space efficient, and as you said does not preserve order. I see no reason to prefer it based on merit.

In the current implementation we have:

  • Building previous_set: This takes O(len(previous)) time.
  • List comprehension:
    • Iterates over current, which is O(len(current)).
    • The x not in previous_set check is O(1) per element due to the constant-time lookup in a set.

Total Time Complexity: O(len(current) + len(previous))
Space Complexity: O(len(previous)) for previous_set.

In the set only option:

  • Building set(current): O(len(current))
  • Building set(previous): O(len(previous))
  • Set difference set(current) - set(previous):
    • Iterates over set(current), which is O(len(set(current))).
    • For each element, checks if it's in set(previous), which is O(1) per element.
  • Converting the set difference to a list: O(len(result))

Total Time Complexity: O(2 X len(current) + len(previous))
Space Complexity: O(len(current) + len(previous)) for both sets.

Comparison:

  • Time Complexity: Both functions have the same time complexity of O(len(current) + len(previous)).
  • Space Complexity: Function 1 uses less additional space because it only builds one set (previous_set), whereas Function 2 builds two sets (set(current) and set(previous)).
  • Order Preservation: Function 1 preserves the order of elements in current, while Function 2 does not, because sets are unordered collections.

src/olympia/blocklist/mlbf.py Outdated Show resolved Hide resolved
src/olympia/blocklist/mlbf.py Outdated Show resolved Hide resolved
Comment on lines 180 to 182
for guid, version, id, block_type in get_all_blocked_items():
_blocked_version_ids.append(id)
_blocked[block_type].append((guid, version))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for guid, version, id, block_type in get_all_blocked_items():
_blocked_version_ids.append(id)
_blocked[block_type].append((guid, version))
for guid, version_string, version_id, block_type in get_all_blocked_items():
_blocked_version_ids.append(version_id)
_blocked[block_type].append((guid, version_string))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I do wonder if it's more efficient to generate _blocked_version_ids and _blocked with list/set generator syntax, rather than loop through with a for loop and append one at a time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well this loop will be an O(n) over all blocked items.
And there will be 2 O(1) appends per iteration.. so still O(n)

Do you have an example of something that would be more efficient? I thnk this doesn't look as pretty but is actually the efficient way to go.

src/olympia/blocklist/mlbf.py Outdated Show resolved Hide resolved
src/olympia/blocklist/mlbf.py Outdated Show resolved Hide resolved
src/olympia/blocklist/tests/test_cron.py Show resolved Hide resolved
Remove test

Remove all time inefficient loops from mlbf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: upload_mlbf_to_remote_settings hangs on the diff
3 participants