-
-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(lactapp_2): new scraper for Lousiana Court of Appeals Second Circuit #1299
base: main
Are you sure you want to change the base?
feat(lactapp_2): new scraper for Lousiana Court of Appeals Second Circuit #1299
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, all parts working properly
There are some comments to address, yet. Some of them have to do with our documentation not being good enough to define the content of each field
The date filtering in the backscraper will be useful to filter by month to avoid creating duplicates from different hashes when we hit our data cutoff from a merger
@@ -205,7 +205,9 @@ tests/fixtures/cassettes/ | |||
### Other ### | |||
# File created by Mac OS X | |||
.DS_Store | |||
# Devcontainer folder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
waht is this precisely for? What extension or app creates this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This folder contains a devcontainer.json file (used by Docker, WSL, and VS Code) to set up my development environment. I’ve added it to the .gitignore file since it’s not relevant to the project. Is there a better way to handle this?
"lower_courts": "Franklin", | ||
"summaries": "Published.", | ||
"case_name_shorts": "", | ||
"authors": "COX, J" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Judge names names should be cleaned up.
In this case, "COX, J" stands for "Cox, Judge", we should delete de "J"
In "PITMAN, C.J", "C.J" stands for "Chief Judge"
return | ||
|
||
tables = self.html.cssselect("table#datatable") | ||
if tables and tables[0].cssselect("tbody tr"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use an early return here to save 1 indentation level
if not tables or not tables[0].cssselect("tbody tr"):
return
# Skip if before first opinion date | ||
if case_date < self.first_opinion_date.date(): | ||
continue | ||
|
||
# Only apply date filtering during backscrape | ||
if ( | ||
hasattr(self, "back_scrape_iterable") | ||
and self.back_scrape_iterable | ||
): | ||
if self.target_date: | ||
target_month = self.target_date.month | ||
target_year = self.target_date.year | ||
if ( | ||
case_date.year != target_year | ||
or case_date.month != target_month | ||
): | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all of this should be abstracted out into a function
if skip_row_by_date(case_date):
continue
"author": get_row_column_text(row, 4), | ||
"disposition": get_row_column_text(row, 5), | ||
"lower_court": get_row_column_text(row, 6), | ||
"summary": get_row_column_text(row, 7), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not what we consider a "summary"; I would just ignore that column; unless you can find a row with an actual summary
You can check for the definition of fields (most times the juriscraper name matches the CL name, sometimes not so straigthforward) on https://github.com/freelawproject/courtlistener/blob/main/cl/search/models.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will check it, thanks!
"date_filed_is_approximate": false, | ||
"dispositions": "Affirmed", | ||
"docket_numbers": "55,974-CA", | ||
"lower_courts": "Franklin", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For lower_courts we usually want the full name. You can get it by looking into the PDFs themselves
In this case:
"Fifth Judicial District Court for the Parish of Franklin, Louisiana"
You could use a mapper from short name to full name. A similar idea, for the inverse, is used here:
lower_court_to_abbreviation = { | |
"USBC - District of New Hampshire": "NH", | |
"USBC - District of Massachusetts (Worcester)": "MW", | |
"USBC - District of Puerto Rico": "PR", | |
"USBC - District of Massachusetts (Boston)": "MB", | |
"USBC - District of Maine (Portland)": "EP", | |
"USBC - District of Rhode Island": "RI", | |
"Bankrupcty Court of ME - Bangor": "EB", | |
"Bankruptcy Court of MA - Boston": "MB", | |
"Bankruptcy Court of MA - Springfield": "MS", | |
"Bankruptcy Court of ME - Portland": "EP", | |
"Bankruptcy Court - Rhode Island": "RI", | |
"Bankruptcy Court - San Juan Puerto Rico": "PR", | |
"Bankruptcy Court of MA - Worcester": "MW", | |
"Bankruptcy Court - Ponce Puerto Rico": "PR", | |
"Bankruptcy Court of NH, Concord": "NH", | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use such a mapper; but I have seen some irregularities such as
- OUACHITA Monroe City Court
- Ouachita OWC District 1-E
I think we could just keep the most common cases, and ignore weirder ones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I’ll create a mapper. Also, I assumed that "Trial Court of Origin" on the court webpage is the same as "lower court". Is that correct @grossir?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is correct
self.target_date = target_date | ||
self.year = target_date.year | ||
self.url = f"{self.base_url}?opinion_year={self.year}" | ||
self.html = self._download() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should comment here that pagination is not required , since all data are sent on the first request.
self.html = self._download() | ||
self._process_html() | ||
|
||
def make_backscrape_iterable(self, kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not generate only unique years, while saving the filter dates and using those dates as filters instead of a month? Since all data for the year is sent on a request, we are making an extra X requests just for making filtering easier for us?
@grossir Made the updates based on your feedback, let me know if it looks good |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks @giancohs, I think we are just missing a few details and I will send this for a final review to Bill
class Site(OpinionSiteLinear): | ||
first_opinion_date = datetime(2019, 7, 17) | ||
days_interval = 28 # Monthly interval | ||
abbreviation_to_lower_court = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. The district numbering seems odd but can be checked here
https://www.lasc.org/MapsofJudicialDistricts?p=DistrictCourtMap
https://www.lasc.org/MapsofJudicialDistricts?p=COAMap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I checked each district abbreviation against a sample PDF opinion to be sure, it should be ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a moment to look at other scrapers - that all or mostly all follow a format for a comment
I think it would be good practice to do the same for this new scraper.
"""
Scraper for Armed Services Board of Contract Appeals
CourtID: asbca
Court Short Name: ASBCA
Author: Jon Andersen
Reviewer: mlr
History:
2014-09-11: Created by Jon Andersen
2016-03-17: Website and phone are dead. Scraper disabled in init.py.
"""
|
||
def make_backscrape_iterable(self, kwargs): | ||
super().make_backscrape_iterable(kwargs) | ||
self.back_scrape_iterable = unique_year_month( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a second look, I think we should just use the year as a filter, like here
def make_backscrape_iterable(self, kwargs: dict) -> None: | |
"""Checks if backscrape start and end arguments have been passed | |
by caller, and parses them accordingly | |
:param kwargs: passed when initializing the scraper, may or | |
may not contain backscrape controlling arguments | |
:return None | |
""" | |
start = kwargs.get("backscrape_start") | |
end = kwargs.get("backscrape_end") | |
start = int(start) if start else self.start_year | |
end = int(end) + 1 if end else self.current_year | |
self.back_scrape_iterable = range(start, end) |
You can override make_backscrape_iterable
as in the above example and it will work this way
python sample_caller.py -c juriscraper.opinions.united_states.state.lactapp_2 -vvv --backscrape --backscrape-start=2019 --backscrape-end=2023
Then, you can just keep using the harcoded "start date", as it is already in place
# Skip if before first opinion date
if case_date < self.first_opinion_date.date():
return True
So, you could just delete this whole block
# Only apply date filtering during backscrape
if hasattr(self, "back_scrape_iterable") and self.back_scrape_iterable:
if self.target_date:
target_month = self.target_date.month
target_year = self.target_date.year
if (
case_date.year != target_year
or case_date.month != target_month
):
return True
I think this simplifies a lot the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the changes on parsing, you need to delete and recreate this file lactapp_2_example.compare.json
Delete it manually and recreate it by running
python -m unittest -v tests.local.test_ScraperExampleTest
It is currently failing the tests
Made the last changes you asked for @grossir. Tests are passing now, let me know if there’s anything else to improve. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Just a minor comment on unused imports.
I am sending this to @flooie for final approval
@@ -0,0 +1,148 @@ | |||
from datetime import date, datetime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused imports
- from datetime import date
- from juriscraper.lib.date_utils import unique_year_month
for more information, see https://pre-commit.ci
|
||
# Swap files | ||
*.swp | ||
*~ | ||
*~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whats happening here?
"Caddo": "First Judicial District Court for the Parish of Caddo, Louisiana", | ||
"Ouachita": "Fourth Judicial District Court for the Parish of Ouachita, Louisiana", | ||
"Bossier": "Twenty-Sixth Judicial District Court for the Parish of Bossier, Louisiana", | ||
"DeSoto": "Forty-Second Judicial District Court for the Parish of DeSoto, Louisiana", | ||
"Lincoln": "Third Judicial District Court for the Parish of Lincoln, Louisiana", | ||
"Webster": "Twenty-Sixth Judicial District Court for the Parish of Webster, Louisiana", | ||
"Franklin": "Fifth Judicial District Court for the Parish of Franklin, Louisiana", | ||
"Richland": "Fifth Judicial District Court for the Parish of Richland, Louisiana", | ||
"Union": "Third Judicial District Court for the Parish of Union, Louisiana", | ||
"Winn": "Eighth Judicial District Court for the Parish of Winn, Louisiana", | ||
"Morehouse": "Fourth Judicial District Court for the Parish of Morehouse, Louisiana", | ||
"Claiborne": "Second Judicial District Court for the Parish of Claiborne, Louisiana", | ||
"Ouachita Monroe City Court": "Monroe City Court for the Parish of Ouachita, Louisiana", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate you attempting to capture all of these, but this seems to be a fragile approach.
I would strip this out and use extract_from_text to capture lower_court_str
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this should work ...
def extract_from_text(self, scraped_text):
match = re.search(r"Appealed from the\s*(.*?\s*),\s*Louisiana", scraped_text, re.DOTALL)
if match:
result = match.group(1).replace("\n", " ")
return {
"Docket": {
"appeal_from_str": result
},
}
return {}
it will require a corresponding addition to the extract from text tests
self.base_url = "https://www.la2nd.org/opinions/" | ||
self.year = datetime.now().year | ||
self.url = f"{self.base_url}?opinion_year={self.year}" | ||
self.status = "Published" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Status is not always published. so you cant just assign it. There are two opinions in 2024 that do not share that distinction.
Thankfully you can just use
status_str = get_row_column_text(row, 7)
to get that status_str
and do something like this
status_str = get_row_column_text(row, 7)
status = "Published" if "Published" in status_str else "Unpublished"
to get status
def _download(self): | ||
html = super()._download() | ||
# Currenly there are no opinions for 2025, so we need to go back one year | ||
if html is not None: | ||
tables = html.cssselect("table#datatable") | ||
if not tables or not tables[0].cssselect("tbody tr"): | ||
self.year -= 1 | ||
self.url = f"{self.base_url}?opinion_year={self.year}" | ||
return self._download() | ||
return html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should just delete this _download method here - if we dont capture anything thats fine with me.
self.court_id = self.__module__ | ||
self.base_url = "https://www.la2nd.org/opinions/" | ||
self.year = datetime.now().year | ||
self.url = f"{self.base_url}?opinion_year={self.year}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been trained by @grossir to use better tools around making URLs
What about something like this
params = {"opinion_year": self.year}
self.url = urljoin(base_url, f"?{urlencode(params)}")
if self.html is None: | ||
return | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is redundant - if self.html is none it wont run.
if not tables or not tables[0].cssselect("tbody tr"): | ||
return | ||
|
||
logger.info(f"Processing cases for year: {self.year}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here I think
this could just be
def _process_html(self):
tables = self.html.cssselect("table#datatable")
if not tables:
logger.info(f"No data found, {self.year}")
return
logger.info(f"Processing cases for year: {self.year}")
for row in tables[0].cssselect("tbody tr"):
author = get_row_column_text(row, 4) | ||
clean_author = self.clean_judge_name(author) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the clean author method but you should just chuck it. we have tools to help clean up judge strings. I think
author_str = get_row_column_text(row, 4)
normalize_judge_string(author_str)[0]
will most likely solve it - and it will make the judge name nice and neat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from juriscraper.lib.judge_parsers import (
normalize_judge_string,
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@giancohs this is a great first effort. Thank you for submitting this. I haven't tested the back scraper, but I think we can get this close to the finish line with some of the changes.
Solves #1296