feat(pacer): add command to fetch docs filtered by page count from PACER #4901

elisa-a-v · 2025-01-08T21:35:39Z

This PR implements a Django command to fetch docs from PACER within a given range of page_count.

The command is split in two stages, so instead of processing all tasks for each doc in a single chain—like the do_pacer_fetch method does—we first fetch docs from PACER without processing them. This means only executing the first task (fetch_pacer_doc_by_rd) which is the only one that interacts with the PACER API. This gives us more control to avoid hitting the same court too often, because the processing stage takes up most of the execution time for a given doc, and this time can range from a few seconds to several minutes.

To make sure we don't hit the same court too often, we keep track of the FQs in progress per court, and only try to fetch the next document in the same court if two conditions are met:

the previous FQ in that court has been completed.
enough time has elapsed since its completion (defaults to 2 seconds).

If these conditions are not met, we skip that court in that round, and try again next round. This is retried up to max_retries (defaults to 5) attempts, and if after all those checks the FQ is still not complete, we store it in cache as "timed out", and we continue with the next doc in said court.

During the fetching stage, we keep track of the fetched docs in cache so we know which ones to process in the second stage, and to avoid retrying to fetch them again in case we need to run the first stage again for some reason, like an interrumption of any sort.

After all docs have been fetched, we run the second stage and process all the fetched docs without any external rate limit. We know which docs to process by checking the cache.

Separating the command in two stages also gives us a chance to make sure all docs were fetched correctly before processing them, and to fix anything that would need to be fixed in case of any errors, as well as figure out what happened to the timed out FQs.

…etch - introduces new build_pdf_retrieval_task_chain method - refactors do_pacer_fetch to now use that new method instead

Instead of processing all tasks for each doc in a single chain, now we first fetch docs from PACER without processing them. This gives us more control to avoid hitting the same court too often. After all docs have been fetched, we process them all without any external rate limit.

…gress A local variable was being used and passed through several methods as arguments to keep track of the last fetch queue checked per court. We now use the instance attribute instead.

…e in cache We first check if the key exists and use the previous value if it does, otherwise the value is initialized with an empty list. We then append the value to that key-value pair. The logic is abstracted from update_cached_docs_to_process which now uses this new helper function.

…redundant DB queries later

The pacer_bulk_fetch command can now be ran in two distinct stages, one at a time, using the --stage arg. This gives us more control over the command execution, and allows us time to check and fix any issues when fetching docs before beginning the processing stage execution.

…or testability

… per test

mlissner

Elisa — this looks good! I made a few smallish comments, but I think you've got the design right. A few other thoughts:

I haven't checked the tests. I leave that to Alberto.
I've only skimmed the code for architecture, and anything else that jumps out, but it looks about right.
I think this is a bit overbuilt, and I highlighted a few places where we can just let it crash, simply, etc. (Not the worst thing!)
It looks like there are throttles to prevent hitting the DB and Redis too much, but let's make sure we don't have a loop that hits them non-stop. I think I checked this, but not carefully enough to be sure.
Using redis for this is helpful, but we don't want to make objects in redis that last forever and that exist long after we've run this code. Can you add timeouts to them? Even if they're very long, we want that memory back eventually.

Looks good otherwise. Moby Dick would be proud, I think you caught your white whale.

mlissner · 2025-01-31T21:07:01Z