Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(pacer): add command to fetch docs filtered by page count from PACER #4901
base: main
Are you sure you want to change the base?
feat(pacer): add command to fetch docs filtered by page count from PACER #4901
Changes from 33 commits
14c48e8
b85778a
9dca383
6e4b14b
88df431
2915cb9
e7e7d14
00eabb5
bd3ce86
3f01cba
8606f18
2b750da
221f6fa
ade0846
20f1d85
261da6e
5234012
66038a3
1b7299b
9c238ff
db07b63
b9d2cc3
818b5ca
f03c091
bf45fbb
326df59
f83126f
624c616
5c19ef8
a96d9fe
84a181b
e85d996
ec0e8cf
32e62e8
7af8b38
b375b1c
467c18a
02f8300
5e7f6da
42b63e1
6460057
be1392b
40fbf0b
fe62d27
cfd9363
02222b4
9c6c5d5
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for collecting logs into a file for this command? We usually monitor logs via Loki in Grafana, so this might not be necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this method. Since we don't require to store logs in a file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Workers won't do tasks in queues they're not configured for, so if you run this code and we don't have a queue named
pacer_bulk_fetch
, you'll create tasks that never get processed. We have a handful of queues set up that you can use calledbatch0
tobatch5
.But that aside, what you're doing here is setting a default value for a command arg, which should be done up in argparse instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
solved, now default's to
batch0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a default, then we shouldn't need this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I agree, this code has been removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just let it crash at this point, honestly, which makes this entire method unneeded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, method removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use argparse to make min_page_count a required parameter and to default max_page_count to 10_000, you can get rid of Q altogether and just build the query the same way every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was tempted to ask for max_page_count to be removed under YAGNI, but I think we will need it when we do shorter docs next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
min_page_count
is now a required parameter but only for--stage fetch
since this is not required for the extracting stage.--max-page-count
default's to10_000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps
ids_to_skip
is more clear?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't completely thought this through, but it feels like you've got this double-throttled. If you remove this throttle completely, would it work just fine? This throttle is usually used to prevent the queue from growing out of control, but that shouldn't happen anyway, since we're only doing one court at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine to keep. Even though we're processing one court at a time, if documents belong to many courts—let's say around 200 it will still be possible to schedule 200 tasks at once. Using
throttle.maybe_wait()
will limit this to a maximum of 100 tasks enqueued.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found an issue related to this line, but it also affects another line. I'll describe the issue here and then provide a related comment below.
The main problem is that
PacerFetchQueue
is not being marked as successfully completed during the retrieval stage. This process is handled bymark_fq_successful
, which is currently called only after the document extraction is finished in the second stage of the command.This can cause the command to get stuck while waiting for the FQs to be marked as completed. Specifically, in this line:
Since all FQs remain in progress, the command keeps waiting unnecessarily.
To prevent this, the FQ should be marked as completed immediately after retrieval in
fetch_pacer_doc_by_rd
.I tested this approach:
This partially resolves the issue. However, marking the FQ as successful can still be delayed because both
fetch_pacer_doc_by_rd
andmark_fq_successful
are processed by workers in the queue. Depending on worker availability,mark_fq_successful
may take a few seconds to execute, leading to unnecessary retries or even exceeding the retry count, causing FQs to be incorrectly marked as timed out.To fully resolve this,
mark_fq_successful
should be executed as part of thefetch_pacer_doc_by_rd
task.Currently, the
mark_fq_successful
task also sends webhooks for the Fetch API, so it makes sense to execute it as a secondary task. However, since webhooks are not required for this retrieval, it seems reasonable to update the FQdate_completed
,status
, andmessage
directly withinfetch_pacer_doc_by_rd
.We just need to ensure that this logic is not duplicated for regular FQ processing, which will continue relying on
mark_fq_successful
. One approach could be to introduce a new argument infetch_pacer_doc_by_rd
to indicate whether the successful FQ status should be updated within the task. Alternatively, we could move this logic entirely intofetch_pacer_doc_by_rd
for all processes using it and simply split the webhook sending functionality into a secondary task, whichever approach you think is best.This should resolve the issue of incorrectly waiting for an FQ to finish when it was actually completed earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been resolved by moving the original
fetch_pacer_doc_by_rd
code tofetch_pacer_doc_by_rd_base
. Now,fetch_pacer_doc_by_rd
simply acts as a celery task wrapper, retaining the exact same arguments and behavior as before.For the scope of this command, a new celery task,
fetch_pacer_doc_by_rd_and_mark_fq_completed
, has been introduced. It essentially retains the same functionality asfetch_pacer_doc_by_rd
, with the addition that it is also responsible for marking the task as successful if it completes successfully within the same celery task. This resolves the issue described here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, this method checks whether the time elapsed since the FQ was marked as completed is more than
self.interval
, right? If so, I think the comparison operator here should be>
?Otherwise, if the FQ takes longer to complete than
self.interval
, this could cause the process to get stuck while checking this FQ.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the operator was inverted. It's fixed now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While debugging the issue described in a previous comment regarding
mark_fq_successful
, I noticed one of the reasons the command got stuck in a loop was when this condition was met:I wondered why this was happening since there is logic above to abort an FQ in
fetches_in_progress
ifretry_count >= self.max_retries
. However, that was not occurring.The issue is that the retry count is only updated within
update_fetches_in_progress
, which only seems to sets the retry count to 1 for the FQ that was just scheduled. However, it is also necessary to update theretry_count
for an FQ that remains in progress but has not yet been completed.I think this logic could be added here (I tested it, and it worked), unless you identify a better place to implement it. One consideration is that retries should only be increased if
fq_in_progress = True
and not whenself.enough_time_elapsed(date_completed) = False
, because in the latter case, the FQ has already been marked as completed and just needs more time to satisfy theenough_time_elapsed
condition.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now resolved. The
retry_count
is now updated within the newprocess_skip_reason
method, but only if the previously computed exponential backoff time has elapsed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another issue I noticed is that if the last FQ in a court is not marked as completed for some reason, it will not be added to
pacer_bulk_fetch.timed_out_docs
in Redis. This happens because once it is removed fromfetches_in_progress
, it is no longer checked inshould_skip
.The
should_skip
method is responsible for verifying whether the FQ has been completed or, if it exceeds the retry limit, adding it topacer_bulk_fetch.timed_out_docs
. Since the FQ is removed before this check happens, it never gets added totimed_out_docs
.I think a possible solution is to confirm that the FQ has been completed before removing it from
fetches_in_progress
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now handled within the new
cleanup_finished_court
method. An FQ is only removed fromremaining_courts
andfetches_in_progress
if it has been completed."There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should become, since
mark_fq_successful
should happen in the retrieval stage.Also, it would be a good idea to add a logger below this line to indicate which extraction document ID has been scheduled. Additionally, logging the progress status of tasks such as the number of tasks scheduled and the remaining tasks to be scheduled, would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated and added a progress logger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, the default should come from argparse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
solved now defaults to
recap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can omit the
try/except
block here since it only logs an error message that might not be necessary. Instead, we can just let it crash, which would have almost the same effect as this block.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the
try/except
blockThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you're getting much from this method. Suggest merging with process_docs_fetched (or removing it and letting Sentry catch the error).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. This method has been removed.