-
-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(lactapp_2): new scraper for Lousiana Court of Appeals Second Circuit #1299
base: main
Are you sure you want to change the base?
Changes from all commits
b56a955
75ce2b6
6fda2fc
57abc9b
3c1ce77
aade79e
d2989e8
1c8ba33
3d85f9c
680fb16
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -205,7 +205,9 @@ tests/fixtures/cassettes/ | |
### Other ### | ||
# File created by Mac OS X | ||
.DS_Store | ||
# Devcontainer folder | ||
.devcontainer/ | ||
|
||
# Swap files | ||
*.swp | ||
*~ | ||
*~ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. whats happening here? |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -61,6 +61,7 @@ | |
"kyctapp", | ||
"la", | ||
"lactapp_1", | ||
"lactapp_2", | ||
"lactapp_5", | ||
"mass", | ||
"massappct", | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,147 @@ | ||
from datetime import datetime | ||
|
||
from juriscraper.AbstractSite import logger | ||
from juriscraper.lib.html_utils import ( | ||
get_row_column_links, | ||
get_row_column_text, | ||
) | ||
from juriscraper.OpinionSiteLinear import OpinionSiteLinear | ||
|
||
|
||
class Site(OpinionSiteLinear): | ||
first_opinion_date = datetime(2019, 7, 17) | ||
days_interval = 28 # Monthly interval | ||
abbreviation_to_lower_court = { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks good. The district numbering seems odd but can be checked here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks! I checked each district abbreviation against a sample PDF opinion to be sure, it should be ok There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Take a moment to look at other scrapers - that all or mostly all follow a format for a comment I think it would be good practice to do the same for this new scraper. """ |
||
"Caddo": "First Judicial District Court for the Parish of Caddo, Louisiana", | ||
"Ouachita": "Fourth Judicial District Court for the Parish of Ouachita, Louisiana", | ||
"Bossier": "Twenty-Sixth Judicial District Court for the Parish of Bossier, Louisiana", | ||
"DeSoto": "Forty-Second Judicial District Court for the Parish of DeSoto, Louisiana", | ||
"Lincoln": "Third Judicial District Court for the Parish of Lincoln, Louisiana", | ||
"Webster": "Twenty-Sixth Judicial District Court for the Parish of Webster, Louisiana", | ||
"Franklin": "Fifth Judicial District Court for the Parish of Franklin, Louisiana", | ||
"Richland": "Fifth Judicial District Court for the Parish of Richland, Louisiana", | ||
"Union": "Third Judicial District Court for the Parish of Union, Louisiana", | ||
"Winn": "Eighth Judicial District Court for the Parish of Winn, Louisiana", | ||
"Morehouse": "Fourth Judicial District Court for the Parish of Morehouse, Louisiana", | ||
"Claiborne": "Second Judicial District Court for the Parish of Claiborne, Louisiana", | ||
"Ouachita Monroe City Court": "Monroe City Court for the Parish of Ouachita, Louisiana", | ||
Comment on lines
+15
to
+27
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I appreciate you attempting to capture all of these, but this seems to be a fragile approach. I would strip this out and use extract_from_text to capture There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Something like this should work ...
it will require a corresponding addition to the extract from text tests |
||
"Bienville": "Second Judicial District Court for the Parish of Bienville, Louisiana", | ||
"Madison": "Sixth Judicial District Court for the Parish of Madison, Louisiana", | ||
"Red River": "Ninth Judicial District Court for the Parish of Red River, Louisiana", | ||
"Tensas": "Sixth Judicial District Court for the Parish of Tensas, Louisiana", | ||
"Jackson": "Second Judicial District Court for the Parish of Jackson, Louisiana", | ||
"Ouachita OWC District 1-E": "Office of Workers' Compensation District 1-E for the Parish of Ouachita, Louisiana", | ||
"Caddo OWC District 1-W": "Office of Workers' Compensation District 1-W for the Parish of Caddo, Louisiana", | ||
"Caldwell": "Thirty-Seventh Judicial District Court for the Parish of Caldwell, Louisiana", | ||
"West Carroll": "Fifth Judicial District Court for the Parish of West Carroll, Louisiana", | ||
"East Carroll": "Sixth Judicial District Court for the Parish of East Carroll, Louisiana", | ||
"Caddo Juvenile Court": "Juvenile Court for the Parish of Caddo, Louisiana", | ||
"Caddo Shreveport City Court": "Shreveport City Court for the Parish of Caddo, Louisiana", | ||
"DeSoto OWC District 1-W": "Office of Workers' Compensation District 1-W for the Parish of DeSoto, Louisiana", | ||
"Lincoln Ruston City Court": "Ruston City Court for the Parish of Lincoln, Louisiana", | ||
"Ouachita West Monroe City Court": "West Monroe City Court for the Parish of Ouachita, Louisiana", | ||
"OUACHITA Monroe City Court": "Monroe City Court for the Parish of Ouachita, Louisiana", | ||
"Franklin OWC District 1-E": "Office of Workers' Compensation District 1-E for the Parish of Franklin, Louisiana", | ||
"Minden City Court Webster": "Minden City Court for the Parish of Webster, Louisiana", | ||
"Morehouse Bastrop City Court": "Bastrop City Court for the Parish of Morehouse, Louisiana", | ||
"Morehouse OWC District 1-E": "Office of Workers' Compensation District 1-E for the Parish of Morehouse, Louisiana", | ||
"Webster Minden City Court": "Minden City Court for the Parish of Webster, Louisiana", | ||
"Winn OWC District 2": "Office of Workers' Compensation District 2 for the Parish of Winn, Louisiana", | ||
} | ||
|
||
def __init__(self, *args, **kwargs): | ||
super().__init__(*args, **kwargs) | ||
self.court_id = self.__module__ | ||
self.base_url = "https://www.la2nd.org/opinions/" | ||
self.year = datetime.now().year | ||
self.url = f"{self.base_url}?opinion_year={self.year}" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've been trained by @grossir to use better tools around making URLs What about something like this
|
||
self.status = "Published" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Status is not always published. so you cant just assign it. There are two opinions in 2024 that do not share that distinction. Thankfully you can just use
and do something like this
to get status |
||
self.target_date = None | ||
self.make_backscrape_iterable(kwargs) | ||
|
||
def _download(self): | ||
html = super()._download() | ||
# Currenly there are no opinions for 2025, so we need to go back one year | ||
if html is not None: | ||
tables = html.cssselect("table#datatable") | ||
if not tables or not tables[0].cssselect("tbody tr"): | ||
grossir marked this conversation as resolved.
Show resolved
Hide resolved
|
||
self.year -= 1 | ||
self.url = f"{self.base_url}?opinion_year={self.year}" | ||
return self._download() | ||
return html | ||
Comment on lines
+62
to
+71
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you should just delete this _download method here - if we dont capture anything thats fine with me. |
||
|
||
def _process_html(self): | ||
if self.html is None: | ||
return | ||
|
||
Comment on lines
+74
to
+76
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is redundant - if self.html is none it wont run. |
||
tables = self.html.cssselect("table#datatable") | ||
if not tables or not tables[0].cssselect("tbody tr"): | ||
return | ||
|
||
logger.info(f"Processing cases for year: {self.year}") | ||
Comment on lines
+78
to
+81
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same here I think this could just be
|
||
for row in tables[0].cssselect("tbody tr"): | ||
case_date = datetime.strptime( | ||
get_row_column_text(row, 1), "%m/%d/%Y" | ||
).date() | ||
|
||
if self.skip_row_by_date(case_date): | ||
continue | ||
|
||
author = get_row_column_text(row, 4) | ||
clean_author = self.clean_judge_name(author) | ||
|
||
Comment on lines
+90
to
+92
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I appreciate the clean author method but you should just chuck it. we have tools to help clean up judge strings. I think
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. from juriscraper.lib.judge_parsers import ( |
||
# Get the lower court abbreviation | ||
lower_court_abbr = get_row_column_text(row, 6) | ||
|
||
# Replace abbreviation with full name | ||
lower_court_full = self.abbreviation_to_lower_court.get( | ||
lower_court_abbr, lower_court_abbr | ||
) | ||
|
||
self.cases.append( | ||
{ | ||
"date": get_row_column_text(row, 1), | ||
"docket": get_row_column_text(row, 2), | ||
"name": get_row_column_text(row, 3), | ||
"author": clean_author, | ||
"disposition": get_row_column_text(row, 5), | ||
"lower_court": lower_court_full, | ||
"url": get_row_column_links(row, 8), | ||
} | ||
) | ||
|
||
def skip_row_by_date(self, case_date): | ||
"""Determine if a row should be skipped based on the case date.""" | ||
# Skip if before first opinion date | ||
if case_date < self.first_opinion_date.date(): | ||
return True | ||
|
||
def clean_judge_name(self, name): | ||
"""Remove everything after a comma in the judge's name.""" | ||
return name.split(",")[0].strip() | ||
|
||
def _download_backwards(self, target_year: int) -> None: | ||
logger.info(f"Backscraping for date: {target_year}") | ||
self.year = target_year | ||
self.url = f"{self.base_url}?opinion_year={self.year}" | ||
|
||
# Pagination not required, all the opinions data is sent in the first request | ||
self.html = self._download() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should comment here that pagination is not required , since all data are sent on the first request. |
||
self._process_html() | ||
|
||
def make_backscrape_iterable(self, kwargs: dict) -> None: | ||
"""Sets up the back scrape iterable using start and end year arguments. | ||
|
||
:param kwargs: passed when initializing the scraper, may or | ||
may not contain backscrape controlling arguments | ||
:return: None | ||
""" | ||
start = kwargs.get("backscrape_start") | ||
end = kwargs.get("backscrape_end") | ||
|
||
# Convert start and end to integers, defaulting to the scraper's start and current year | ||
start = int(start) if start else self.first_opinion_date.year | ||
end = int(end) + 1 if end else datetime.now().year + 1 | ||
|
||
# Create a range of years for back scraping | ||
self.back_scrape_iterable = range(start, end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
waht is this precisely for? What extension or app creates this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This folder contains a devcontainer.json file (used by Docker, WSL, and VS Code) to set up my development environment. I’ve added it to the .gitignore file since it’s not relevant to the project. Is there a better way to handle this?