Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(lactapp_2): new scraper for Lousiana Court of Appeals Second Circuit #1299

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

giancohs
Copy link

Solves #1296

@grossir grossir self-requested a review January 11, 2025 23:14
Copy link
Contributor

@grossir grossir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, all parts working properly
There are some comments to address, yet. Some of them have to do with our documentation not being good enough to define the content of each field

The date filtering in the backscraper will be useful to filter by month to avoid creating duplicates from different hashes when we hit our data cutoff from a merger

@@ -205,7 +205,9 @@ tests/fixtures/cassettes/
### Other ###
# File created by Mac OS X
.DS_Store
# Devcontainer folder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waht is this precisely for? What extension or app creates this file?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This folder contains a devcontainer.json file (used by Docker, WSL, and VS Code) to set up my development environment. I’ve added it to the .gitignore file since it’s not relevant to the project. Is there a better way to handle this?

juriscraper/opinions/united_states/state/lactapp_2.py Outdated Show resolved Hide resolved
"lower_courts": "Franklin",
"summaries": "Published.",
"case_name_shorts": "",
"authors": "COX, J"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Judge names names should be cleaned up.
In this case, "COX, J" stands for "Cox, Judge", we should delete de "J"

In "PITMAN, C.J", "C.J" stands for "Chief Judge"

return

tables = self.html.cssselect("table#datatable")
if tables and tables[0].cssselect("tbody tr"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use an early return here to save 1 indentation level

if not tables or not tables[0].cssselect("tbody tr"):
      return

Comment on lines 49 to 65
# Skip if before first opinion date
if case_date < self.first_opinion_date.date():
continue

# Only apply date filtering during backscrape
if (
hasattr(self, "back_scrape_iterable")
and self.back_scrape_iterable
):
if self.target_date:
target_month = self.target_date.month
target_year = self.target_date.year
if (
case_date.year != target_year
or case_date.month != target_month
):
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all of this should be abstracted out into a function

if skip_row_by_date(case_date):
      continue

"author": get_row_column_text(row, 4),
"disposition": get_row_column_text(row, 5),
"lower_court": get_row_column_text(row, 6),
"summary": get_row_column_text(row, 7),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not what we consider a "summary"; I would just ignore that column; unless you can find a row with an actual summary

You can check for the definition of fields (most times the juriscraper name matches the CL name, sometimes not so straigthforward) on https://github.com/freelawproject/courtlistener/blob/main/cl/search/models.py

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will check it, thanks!

"date_filed_is_approximate": false,
"dispositions": "Affirmed",
"docket_numbers": "55,974-CA",
"lower_courts": "Franklin",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For lower_courts we usually want the full name. You can get it by looking into the PDFs themselves
In this case:
"Fifth Judicial District Court for the Parish of Franklin, Louisiana"

You could use a mapper from short name to full name. A similar idea, for the inverse, is used here:

lower_court_to_abbreviation = {
"USBC - District of New Hampshire": "NH",
"USBC - District of Massachusetts (Worcester)": "MW",
"USBC - District of Puerto Rico": "PR",
"USBC - District of Massachusetts (Boston)": "MB",
"USBC - District of Maine (Portland)": "EP",
"USBC - District of Rhode Island": "RI",
"Bankrupcty Court of ME - Bangor": "EB",
"Bankruptcy Court of MA - Boston": "MB",
"Bankruptcy Court of MA - Springfield": "MS",
"Bankruptcy Court of ME - Portland": "EP",
"Bankruptcy Court - Rhode Island": "RI",
"Bankruptcy Court - San Juan Puerto Rico": "PR",
"Bankruptcy Court of MA - Worcester": "MW",
"Bankruptcy Court - Ponce Puerto Rico": "PR",
"Bankruptcy Court of NH, Concord": "NH",
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use such a mapper; but I have seen some irregularities such as

  • OUACHITA Monroe City Court
  • Ouachita OWC District 1-E
    I think we could just keep the most common cases, and ignore weirder ones

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I’ll create a mapper. Also, I assumed that "Trial Court of Origin" on the court webpage is the same as "lower court". Is that correct @grossir?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is correct

self.target_date = target_date
self.year = target_date.year
self.url = f"{self.base_url}?opinion_year={self.year}"
self.html = self._download()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should comment here that pagination is not required , since all data are sent on the first request.

self.html = self._download()
self._process_html()

def make_backscrape_iterable(self, kwargs):
Copy link
Contributor

@grossir grossir Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not generate only unique years, while saving the filter dates and using those dates as filters instead of a month? Since all data for the year is sent on a request, we are making an extra X requests just for making filtering easier for us?

@giancohs
Copy link
Author

@grossir Made the updates based on your feedback, let me know if it looks good

Copy link
Contributor

@grossir grossir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks @giancohs, I think we are just missing a few details and I will send this for a final review to Bill

class Site(OpinionSiteLinear):
first_opinion_date = datetime(2019, 7, 17)
days_interval = 28 # Monthly interval
abbreviation_to_lower_court = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I checked each district abbreviation against a sample PDF opinion to be sure, it should be ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a moment to look at other scrapers - that all or mostly all follow a format for a comment

I think it would be good practice to do the same for this new scraper.

"""
Scraper for Armed Services Board of Contract Appeals
CourtID: asbca
Court Short Name: ASBCA
Author: Jon Andersen
Reviewer: mlr
History:
2014-09-11: Created by Jon Andersen
2016-03-17: Website and phone are dead. Scraper disabled in init.py.
"""


def make_backscrape_iterable(self, kwargs):
super().make_backscrape_iterable(kwargs)
self.back_scrape_iterable = unique_year_month(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second look, I think we should just use the year as a filter, like here

def make_backscrape_iterable(self, kwargs: dict) -> None:
"""Checks if backscrape start and end arguments have been passed
by caller, and parses them accordingly
:param kwargs: passed when initializing the scraper, may or
may not contain backscrape controlling arguments
:return None
"""
start = kwargs.get("backscrape_start")
end = kwargs.get("backscrape_end")
start = int(start) if start else self.start_year
end = int(end) + 1 if end else self.current_year
self.back_scrape_iterable = range(start, end)

You can override make_backscrape_iterable as in the above example and it will work this way

python sample_caller.py -c juriscraper.opinions.united_states.state.lactapp_2 -vvv --backscrape --backscrape-start=2019 --backscrape-end=2023

Then, you can just keep using the harcoded "start date", as it is already in place

        # Skip if before first opinion date
        if case_date < self.first_opinion_date.date():
            return True

So, you could just delete this whole block

        # Only apply date filtering during backscrape
        if hasattr(self, "back_scrape_iterable") and self.back_scrape_iterable:
            if self.target_date:
                target_month = self.target_date.month
                target_year = self.target_date.year
                if (
                    case_date.year != target_year
                    or case_date.month != target_month
                ):
                    return True

I think this simplifies a lot the code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the changes on parsing, you need to delete and recreate this file lactapp_2_example.compare.json
Delete it manually and recreate it by running
python -m unittest -v tests.local.test_ScraperExampleTest
It is currently failing the tests

@grossir grossir self-assigned this Jan 13, 2025
@giancohs
Copy link
Author

Made the last changes you asked for @grossir. Tests are passing now, let me know if there’s anything else to improve. Thanks!

grossir
grossir previously approved these changes Jan 13, 2025
Copy link
Contributor

@grossir grossir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Just a minor comment on unused imports.
I am sending this to @flooie for final approval

@@ -0,0 +1,148 @@
from datetime import date, datetime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused imports

  • from datetime import date
  • from juriscraper.lib.date_utils import unique_year_month

@grossir grossir requested a review from flooie January 13, 2025 18:43
@grossir grossir removed their assignment Jan 13, 2025

# Swap files
*.swp
*~
*~
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats happening here?

Comment on lines +15 to +27
"Caddo": "First Judicial District Court for the Parish of Caddo, Louisiana",
"Ouachita": "Fourth Judicial District Court for the Parish of Ouachita, Louisiana",
"Bossier": "Twenty-Sixth Judicial District Court for the Parish of Bossier, Louisiana",
"DeSoto": "Forty-Second Judicial District Court for the Parish of DeSoto, Louisiana",
"Lincoln": "Third Judicial District Court for the Parish of Lincoln, Louisiana",
"Webster": "Twenty-Sixth Judicial District Court for the Parish of Webster, Louisiana",
"Franklin": "Fifth Judicial District Court for the Parish of Franklin, Louisiana",
"Richland": "Fifth Judicial District Court for the Parish of Richland, Louisiana",
"Union": "Third Judicial District Court for the Parish of Union, Louisiana",
"Winn": "Eighth Judicial District Court for the Parish of Winn, Louisiana",
"Morehouse": "Fourth Judicial District Court for the Parish of Morehouse, Louisiana",
"Claiborne": "Second Judicial District Court for the Parish of Claiborne, Louisiana",
"Ouachita Monroe City Court": "Monroe City Court for the Parish of Ouachita, Louisiana",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate you attempting to capture all of these, but this seems to be a fragile approach.

I would strip this out and use extract_from_text to capture lower_court_str

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this should work ...


    def extract_from_text(self, scraped_text):
        match = re.search(r"Appealed from the\s*(.*?\s*),\s*Louisiana", scraped_text, re.DOTALL)
        if match:
            result = match.group(1).replace("\n", " ")
            return {
                "Docket": {
                    "appeal_from_str": result
                },
            }
        return {}

it will require a corresponding addition to the extract from text tests

self.base_url = "https://www.la2nd.org/opinions/"
self.year = datetime.now().year
self.url = f"{self.base_url}?opinion_year={self.year}"
self.status = "Published"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Status is not always published. so you cant just assign it. There are two opinions in 2024 that do not share that distinction.

Thankfully you can just use

status_str = get_row_column_text(row, 7) to get that status_str

and do something like this

        status_str = get_row_column_text(row, 7)
        status = "Published" if "Published" in status_str else "Unpublished"

to get status

Comment on lines +62 to +71
def _download(self):
html = super()._download()
# Currenly there are no opinions for 2025, so we need to go back one year
if html is not None:
tables = html.cssselect("table#datatable")
if not tables or not tables[0].cssselect("tbody tr"):
self.year -= 1
self.url = f"{self.base_url}?opinion_year={self.year}"
return self._download()
return html
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should just delete this _download method here - if we dont capture anything thats fine with me.

self.court_id = self.__module__
self.base_url = "https://www.la2nd.org/opinions/"
self.year = datetime.now().year
self.url = f"{self.base_url}?opinion_year={self.year}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trained by @grossir to use better tools around making URLs

What about something like this

    params = {"opinion_year": self.year}
    self.url = urljoin(base_url, f"?{urlencode(params)}")

Comment on lines +74 to +76
if self.html is None:
return

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is redundant - if self.html is none it wont run.

Comment on lines +78 to +81
if not tables or not tables[0].cssselect("tbody tr"):
return

logger.info(f"Processing cases for year: {self.year}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here I think

this could just be

    def _process_html(self):
        tables = self.html.cssselect("table#datatable")
        if not tables:
            logger.info(f"No data found, {self.year}")
            return

        logger.info(f"Processing cases for year: {self.year}")
        for row in tables[0].cssselect("tbody tr"):

Comment on lines +90 to +92
author = get_row_column_text(row, 4)
clean_author = self.clean_judge_name(author)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the clean author method but you should just chuck it. we have tools to help clean up judge strings. I think

author_str = get_row_column_text(row, 4)
normalize_judge_string(author_str)[0] will most likely solve it - and it will make the judge name nice and neat.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from juriscraper.lib.judge_parsers import (
normalize_judge_string,
)

Copy link
Contributor

@flooie flooie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giancohs this is a great first effort. Thank you for submitting this. I haven't tested the back scraper, but I think we can get this close to the finish line with some of the changes.

@flooie flooie assigned giancohs and unassigned flooie Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Future...
Development

Successfully merging this pull request may close these issues.

3 participants