-
-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(cl_back_scrape_citations): command to scrape citations #4303
feat(cl_back_scrape_citations): command to scrape citations #4303
Conversation
- Remove "method" argument - Log errors instead of returning the error message - Return the cleaned up content using site.cleanup_content - Update tests - Update opinions and oral arguments scraper caller to reflect changes
for more information, see https://pre-commit.ci
0255163
to
d0ea84b
Compare
d0ea84b
to
fac3f13
Compare
We can test this by cloning a few
Citations to pre-populate
Commands to run the test
Query to check everything went OKAssumes you are running the query inmediately after the scrapers, otherwise change the interval hours to a bigger value select * from search_citation where cluster_id in (
select id from search_opinioncluster where docket_id in (
select id from search_docket where court_id = 'md' and date_created > (now() - interval '6 hours')
)
); and you should see something like this, the cluster_id should be small for new documents, big for "updated" citations. In this case we have a single big cluster_id since we only downloaded one cluster where the hash will match
|
@mlissner can you review this? When writing the tests I realized that we don't have a backscraper for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. A few thoughts for you, but this will be very nice to have.
Reword docstrings, catch exceptions and refactor code following code review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes so far look good. Just need that last refactor.
- create cl.scrapers.exceptions file to hold exceptions raised when ingesting a single case - use the exceptions to bubble errors to the main loop, to avoid returning break / continue flags - refactor DupChecker to raise errors - refactor get_binary_content to raise errors - refactor cl_scrape_oral_arguments to new paradigm - cl_back_scrape_citations can now re-scrape a single case without re-downloading the binary content or manipulating the site object - adapted DupChecker and ContentType tests to changes - refactores logger.calls to use lazy formatting
@mlissner can you review this again? I updated the PR with the requested changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few thoughts for you. Looks like a nice refactor though.
Very nice. I've updated to HEAD and set for auto-merge. Thank you! |
I am using the "dismiss review" button, since I understand this PR has been approved, but @mlissner didn't clear the "changes requested" status and merging is still blocked
Suspect IssuesThis pull request was deployed and Sentry observed the following issues:
Did you find this useful? React with a 👍 or 👎 |
Related to preventing further duplicates as seen on freelawproject#4376, due to changes introduced in freelawproject#4303 - Refactor tests for DupChecker.press_on method: replaces fixtures, loops and if clauses by explicit test objects and explicit press_on calls for each scenario
get_binary_content
to be re-used by the new command and to reduce copy pasted codeApplies to a specific family of court's sites described here:
freelawproject/juriscraper#858 (comment)