Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(lactapp_2): new scraper for Lousiana Court of Appeals Second Circuit #1299

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -205,7 +205,9 @@ tests/fixtures/cassettes/
### Other ###
# File created by Mac OS X
.DS_Store
# Devcontainer folder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waht is this precisely for? What extension or app creates this file?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This folder contains a devcontainer.json file (used by Docker, WSL, and VS Code) to set up my development environment. I’ve added it to the .gitignore file since it’s not relevant to the project. Is there a better way to handle this?

.devcontainer/

# Swap files
*.swp
*~
*~
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats happening here?

1 change: 1 addition & 0 deletions juriscraper/opinions/united_states/state/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@
"kyctapp",
"la",
"lactapp_1",
"lactapp_2",
"lactapp_5",
"mass",
"massappct",
Expand Down
147 changes: 147 additions & 0 deletions juriscraper/opinions/united_states/state/lactapp_2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
from datetime import datetime

from juriscraper.AbstractSite import logger
from juriscraper.lib.html_utils import (
get_row_column_links,
get_row_column_text,
)
from juriscraper.OpinionSiteLinear import OpinionSiteLinear


class Site(OpinionSiteLinear):
first_opinion_date = datetime(2019, 7, 17)
days_interval = 28 # Monthly interval
abbreviation_to_lower_court = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I checked each district abbreviation against a sample PDF opinion to be sure, it should be ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a moment to look at other scrapers - that all or mostly all follow a format for a comment

I think it would be good practice to do the same for this new scraper.

"""
Scraper for Armed Services Board of Contract Appeals
CourtID: asbca
Court Short Name: ASBCA
Author: Jon Andersen
Reviewer: mlr
History:
2014-09-11: Created by Jon Andersen
2016-03-17: Website and phone are dead. Scraper disabled in init.py.
"""

"Caddo": "First Judicial District Court for the Parish of Caddo, Louisiana",
"Ouachita": "Fourth Judicial District Court for the Parish of Ouachita, Louisiana",
"Bossier": "Twenty-Sixth Judicial District Court for the Parish of Bossier, Louisiana",
"DeSoto": "Forty-Second Judicial District Court for the Parish of DeSoto, Louisiana",
"Lincoln": "Third Judicial District Court for the Parish of Lincoln, Louisiana",
"Webster": "Twenty-Sixth Judicial District Court for the Parish of Webster, Louisiana",
"Franklin": "Fifth Judicial District Court for the Parish of Franklin, Louisiana",
"Richland": "Fifth Judicial District Court for the Parish of Richland, Louisiana",
"Union": "Third Judicial District Court for the Parish of Union, Louisiana",
"Winn": "Eighth Judicial District Court for the Parish of Winn, Louisiana",
"Morehouse": "Fourth Judicial District Court for the Parish of Morehouse, Louisiana",
"Claiborne": "Second Judicial District Court for the Parish of Claiborne, Louisiana",
"Ouachita Monroe City Court": "Monroe City Court for the Parish of Ouachita, Louisiana",
Comment on lines +15 to +27
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate you attempting to capture all of these, but this seems to be a fragile approach.

I would strip this out and use extract_from_text to capture lower_court_str

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this should work ...


    def extract_from_text(self, scraped_text):
        match = re.search(r"Appealed from the\s*(.*?\s*),\s*Louisiana", scraped_text, re.DOTALL)
        if match:
            result = match.group(1).replace("\n", " ")
            return {
                "Docket": {
                    "appeal_from_str": result
                },
            }
        return {}

it will require a corresponding addition to the extract from text tests

"Bienville": "Second Judicial District Court for the Parish of Bienville, Louisiana",
"Madison": "Sixth Judicial District Court for the Parish of Madison, Louisiana",
"Red River": "Ninth Judicial District Court for the Parish of Red River, Louisiana",
"Tensas": "Sixth Judicial District Court for the Parish of Tensas, Louisiana",
"Jackson": "Second Judicial District Court for the Parish of Jackson, Louisiana",
"Ouachita OWC District 1-E": "Office of Workers' Compensation District 1-E for the Parish of Ouachita, Louisiana",
"Caddo OWC District 1-W": "Office of Workers' Compensation District 1-W for the Parish of Caddo, Louisiana",
"Caldwell": "Thirty-Seventh Judicial District Court for the Parish of Caldwell, Louisiana",
"West Carroll": "Fifth Judicial District Court for the Parish of West Carroll, Louisiana",
"East Carroll": "Sixth Judicial District Court for the Parish of East Carroll, Louisiana",
"Caddo Juvenile Court": "Juvenile Court for the Parish of Caddo, Louisiana",
"Caddo Shreveport City Court": "Shreveport City Court for the Parish of Caddo, Louisiana",
"DeSoto OWC District 1-W": "Office of Workers' Compensation District 1-W for the Parish of DeSoto, Louisiana",
"Lincoln Ruston City Court": "Ruston City Court for the Parish of Lincoln, Louisiana",
"Ouachita West Monroe City Court": "West Monroe City Court for the Parish of Ouachita, Louisiana",
"OUACHITA Monroe City Court": "Monroe City Court for the Parish of Ouachita, Louisiana",
"Franklin OWC District 1-E": "Office of Workers' Compensation District 1-E for the Parish of Franklin, Louisiana",
"Minden City Court Webster": "Minden City Court for the Parish of Webster, Louisiana",
"Morehouse Bastrop City Court": "Bastrop City Court for the Parish of Morehouse, Louisiana",
"Morehouse OWC District 1-E": "Office of Workers' Compensation District 1-E for the Parish of Morehouse, Louisiana",
"Webster Minden City Court": "Minden City Court for the Parish of Webster, Louisiana",
"Winn OWC District 2": "Office of Workers' Compensation District 2 for the Parish of Winn, Louisiana",
}

def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.court_id = self.__module__
self.base_url = "https://www.la2nd.org/opinions/"
self.year = datetime.now().year
self.url = f"{self.base_url}?opinion_year={self.year}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trained by @grossir to use better tools around making URLs

What about something like this

    params = {"opinion_year": self.year}
    self.url = urljoin(base_url, f"?{urlencode(params)}")

self.status = "Published"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Status is not always published. so you cant just assign it. There are two opinions in 2024 that do not share that distinction.

Thankfully you can just use

status_str = get_row_column_text(row, 7) to get that status_str

and do something like this

        status_str = get_row_column_text(row, 7)
        status = "Published" if "Published" in status_str else "Unpublished"

to get status

self.target_date = None
self.make_backscrape_iterable(kwargs)

def _download(self):
html = super()._download()
# Currenly there are no opinions for 2025, so we need to go back one year
if html is not None:
tables = html.cssselect("table#datatable")
if not tables or not tables[0].cssselect("tbody tr"):
grossir marked this conversation as resolved.
Show resolved Hide resolved
self.year -= 1
self.url = f"{self.base_url}?opinion_year={self.year}"
return self._download()
return html
Comment on lines +62 to +71
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should just delete this _download method here - if we dont capture anything thats fine with me.


def _process_html(self):
if self.html is None:
return

Comment on lines +74 to +76
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is redundant - if self.html is none it wont run.

tables = self.html.cssselect("table#datatable")
if not tables or not tables[0].cssselect("tbody tr"):
return

logger.info(f"Processing cases for year: {self.year}")
Comment on lines +78 to +81
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here I think

this could just be

    def _process_html(self):
        tables = self.html.cssselect("table#datatable")
        if not tables:
            logger.info(f"No data found, {self.year}")
            return

        logger.info(f"Processing cases for year: {self.year}")
        for row in tables[0].cssselect("tbody tr"):

for row in tables[0].cssselect("tbody tr"):
case_date = datetime.strptime(
get_row_column_text(row, 1), "%m/%d/%Y"
).date()

if self.skip_row_by_date(case_date):
continue

author = get_row_column_text(row, 4)
clean_author = self.clean_judge_name(author)

Comment on lines +90 to +92
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the clean author method but you should just chuck it. we have tools to help clean up judge strings. I think

author_str = get_row_column_text(row, 4)
normalize_judge_string(author_str)[0] will most likely solve it - and it will make the judge name nice and neat.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from juriscraper.lib.judge_parsers import (
normalize_judge_string,
)

# Get the lower court abbreviation
lower_court_abbr = get_row_column_text(row, 6)

# Replace abbreviation with full name
lower_court_full = self.abbreviation_to_lower_court.get(
lower_court_abbr, lower_court_abbr
)

self.cases.append(
{
"date": get_row_column_text(row, 1),
"docket": get_row_column_text(row, 2),
"name": get_row_column_text(row, 3),
"author": clean_author,
"disposition": get_row_column_text(row, 5),
"lower_court": lower_court_full,
"url": get_row_column_links(row, 8),
}
)

def skip_row_by_date(self, case_date):
"""Determine if a row should be skipped based on the case date."""
# Skip if before first opinion date
if case_date < self.first_opinion_date.date():
return True

def clean_judge_name(self, name):
"""Remove everything after a comma in the judge's name."""
return name.split(",")[0].strip()

def _download_backwards(self, target_year: int) -> None:
logger.info(f"Backscraping for date: {target_year}")
self.year = target_year
self.url = f"{self.base_url}?opinion_year={self.year}"

# Pagination not required, all the opinions data is sent in the first request
self.html = self._download()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should comment here that pagination is not required , since all data are sent on the first request.

self._process_html()

def make_backscrape_iterable(self, kwargs: dict) -> None:
"""Sets up the back scrape iterable using start and end year arguments.

:param kwargs: passed when initializing the scraper, may or
may not contain backscrape controlling arguments
:return: None
"""
start = kwargs.get("backscrape_start")
end = kwargs.get("backscrape_end")

# Convert start and end to integers, defaulting to the scraper's start and current year
start = int(start) if start else self.first_opinion_date.year
end = int(end) + 1 if end else datetime.now().year + 1

# Create a range of years for back scraping
self.back_scrape_iterable = range(start, end)
Loading
Loading