feat(lactapp_2): new scraper for Lousiana Court of Appeals Second Circuit #1299

giancohs · 2025-01-11T19:20:14Z

for more information, see https://pre-commit.ci

grossir

This is great, all parts working properly
There are some comments to address, yet. Some of them have to do with our documentation not being good enough to define the content of each field

The date filtering in the backscraper will be useful to filter by month to avoid creating duplicates from different hashes when we hit our data cutoff from a merger

grossir · 2025-01-11T23:20:09Z

.gitignore

@@ -205,7 +205,9 @@ tests/fixtures/cassettes/
 ### Other ###
 # File created by Mac OS X
 .DS_Store
+# Devcontainer folder


waht is this precisely for? What extension or app creates this file?

This folder contains a devcontainer.json file (used by Docker, WSL, and VS Code) to set up my development environment. I’ve added it to the .gitignore file since it’s not relevant to the project. Is there a better way to handle this?

juriscraper/opinions/united_states/state/lactapp_2.py

grossir · 2025-01-11T23:25:15Z

tests/examples/opinions/united_states/lactapp_2_example.compare.json

+    "lower_courts": "Franklin",
+    "summaries": "Published.",
+    "case_name_shorts": "",
+    "authors": "COX, J"


Judge names names should be cleaned up.
In this case, "COX, J" stands for "Cox, Judge", we should delete de "J"

In "PITMAN, C.J", "C.J" stands for "Chief Judge"

grossir · 2025-01-11T23:29:57Z

juriscraper/opinions/united_states/state/lactapp_2.py

+            return
+
+        tables = self.html.cssselect("table#datatable")
+        if tables and tables[0].cssselect("tbody tr"):


you can use an early return here to save 1 indentation level

if not tables or not tables[0].cssselect("tbody tr"): return

grossir · 2025-01-11T23:31:43Z

juriscraper/opinions/united_states/state/lactapp_2.py

+                # Skip if before first opinion date
+                if case_date < self.first_opinion_date.date():
+                    continue
+
+                # Only apply date filtering during backscrape
+                if (
+                    hasattr(self, "back_scrape_iterable")
+                    and self.back_scrape_iterable
+                ):
+                    if self.target_date:
+                        target_month = self.target_date.month
+                        target_year = self.target_date.year
+                        if (
+                            case_date.year != target_year
+                            or case_date.month != target_month
+                        ):
+                            continue


I think all of this should be abstracted out into a function

if skip_row_by_date(case_date): continue

juriscraper/opinions/united_states/state/lactapp_2.py

grossir · 2025-01-11T23:37:09Z

juriscraper/opinions/united_states/state/lactapp_2.py

+                        "author": get_row_column_text(row, 4),
+                        "disposition": get_row_column_text(row, 5),
+                        "lower_court": get_row_column_text(row, 6),
+                        "summary": get_row_column_text(row, 7),


This is not what we consider a "summary"; I would just ignore that column; unless you can find a row with an actual summary

You can check for the definition of fields (most times the juriscraper name matches the CL name, sometimes not so straigthforward) on https://github.com/freelawproject/courtlistener/blob/main/cl/search/models.py

Ok, will check it, thanks!

grossir · 2025-01-11T23:40:16Z

tests/examples/opinions/united_states/lactapp_2_example.compare.json

+    "date_filed_is_approximate": false,
+    "dispositions": "Affirmed",
+    "docket_numbers": "55,974-CA",
+    "lower_courts": "Franklin",


For lower_courts we usually want the full name. You can get it by looking into the PDFs themselves
In this case:
"Fifth Judicial District Court for the Parish of Franklin, Louisiana"

You could use a mapper from short name to full name. A similar idea, for the inverse, is used here:

juriscraper/juriscraper/opinions/united_states/federal_bankruptcy/bap1.py

Lines 23 to 39 in cbdf1d5

lower_court_to_abbreviation = {

"USBC - District of New Hampshire": "NH",

"USBC - District of Massachusetts (Worcester)": "MW",

"USBC - District of Puerto Rico": "PR",

"USBC - District of Massachusetts (Boston)": "MB",

"USBC - District of Maine (Portland)": "EP",

"USBC - District of Rhode Island": "RI",

"Bankrupcty Court of ME - Bangor": "EB",

"Bankruptcy Court of MA - Boston": "MB",

"Bankruptcy Court of MA - Springfield": "MS",

"Bankruptcy Court of ME - Portland": "EP",

"Bankruptcy Court - Rhode Island": "RI",

"Bankruptcy Court - San Juan Puerto Rico": "PR",

"Bankruptcy Court of MA - Worcester": "MW",

"Bankruptcy Court - Ponce Puerto Rico": "PR",

"Bankruptcy Court of NH, Concord": "NH",

}

You could use such a mapper; but I have seen some irregularities such as

OUACHITA Monroe City Court

Ouachita OWC District 1-E
I think we could just keep the most common cases, and ignore weirder ones

Ok, I’ll create a mapper. Also, I assumed that "Trial Court of Origin" on the court webpage is the same as "lower court". Is that correct @grossir?

It is correct

grossir · 2025-01-11T23:51:08Z

juriscraper/opinions/united_states/state/lactapp_2.py

+        self.target_date = target_date
+        self.year = target_date.year
+        self.url = f"{self.base_url}?opinion_year={self.year}"
+        self.html = self._download()


You should comment here that pagination is not required , since all data are sent on the first request.

grossir · 2025-01-11T23:57:56Z

juriscraper/opinions/united_states/state/lactapp_2.py

+        self.html = self._download()
+        self._process_html()
+
+    def make_backscrape_iterable(self, kwargs):


Why not generate only unique years, while saving the filter dates and using those dates as filters instead of a month? Since all data for the year is sent on a request, we are making an extra X requests just for making filtering easier for us?

for more information, see https://pre-commit.ci

giancohs · 2025-01-13T04:07:21Z

@grossir Made the updates based on your feedback, let me know if it looks good

grossir

Great, thanks @giancohs, I think we are just missing a few details and I will send this for a final review to Bill

grossir · 2025-01-13T13:32:04Z

juriscraper/opinions/united_states/state/lactapp_2.py

+class Site(OpinionSiteLinear):
+    first_opinion_date = datetime(2019, 7, 17)
+    days_interval = 28  # Monthly interval
+    abbreviation_to_lower_court = {


This looks good. The district numbering seems odd but can be checked here
https://www.lasc.org/MapsofJudicialDistricts?p=DistrictCourtMap
https://www.lasc.org/MapsofJudicialDistricts?p=COAMap

Thanks! I checked each district abbreviation against a sample PDF opinion to be sure, it should be ok

Take a moment to look at other scrapers - that all or mostly all follow a format for a comment

I think it would be good practice to do the same for this new scraper.

"""
Scraper for Armed Services Board of Contract Appeals
CourtID: asbca
Court Short Name: ASBCA
Author: Jon Andersen
Reviewer: mlr
History:
2014-09-11: Created by Jon Andersen
2016-03-17: Website and phone are dead. Scraper disabled in init.py.
"""

grossir · 2025-01-13T13:44:43Z

juriscraper/opinions/united_states/state/lactapp_2.py

+
+    def make_backscrape_iterable(self, kwargs):
+        super().make_backscrape_iterable(kwargs)
+        self.back_scrape_iterable = unique_year_month(


On a second look, I think we should just use the year as a filter, like here

juriscraper/juriscraper/opinions/united_states/federal_special/afcca.py

Lines 53 to 67 in cbdf1d5

def make_backscrape_iterable(self, kwargs: dict) -> None:

"""Checks if backscrape start and end arguments have been passed

by caller, and parses them accordingly

:param kwargs: passed when initializing the scraper, may or

may not contain backscrape controlling arguments

:return None

"""

start = kwargs.get("backscrape_start")

end = kwargs.get("backscrape_end")

start = int(start) if start else self.start_year

end = int(end) + 1 if end else self.current_year

self.back_scrape_iterable = range(start, end)

You can override make_backscrape_iterable as in the above example and it will work this way

python sample_caller.py -c juriscraper.opinions.united_states.state.lactapp_2 -vvv --backscrape --backscrape-start=2019 --backscrape-end=2023

Then, you can just keep using the harcoded "start date", as it is already in place

# Skip if before first opinion date if case_date < self.first_opinion_date.date(): return True

So, you could just delete this whole block

# Only apply date filtering during backscrape if hasattr(self, "back_scrape_iterable") and self.back_scrape_iterable: if self.target_date: target_month = self.target_date.month target_year = self.target_date.year if ( case_date.year != target_year or case_date.month != target_month ): return True

I think this simplifies a lot the code

grossir · 2025-01-13T13:46:31Z

tests/examples/opinions/united_states/lactapp_2_example.compare.json

Given the changes on parsing, you need to delete and recreate this file lactapp_2_example.compare.json
Delete it manually and recreate it by running
python -m unittest -v tests.local.test_ScraperExampleTest
It is currently failing the tests

for more information, see https://pre-commit.ci

giancohs · 2025-01-13T18:25:42Z

Made the last changes you asked for @grossir. Tests are passing now, let me know if there’s anything else to improve. Thanks!

grossir

This looks great. Just a minor comment on unused imports.
I am sending this to @flooie for final approval

grossir · 2025-01-13T18:41:31Z

juriscraper/opinions/united_states/state/lactapp_2.py

@@ -0,0 +1,148 @@
+from datetime import date, datetime


Unused imports

from datetime import date

from juriscraper.lib.date_utils import unique_year_month

for more information, see https://pre-commit.ci

flooie · 2025-01-14T18:54:49Z

.gitignore


 # Swap files
 *.swp
-*~
+*~


whats happening here?

flooie · 2025-01-14T18:56:08Z

juriscraper/opinions/united_states/state/lactapp_2.py

+        "Caddo": "First Judicial District Court for the Parish of Caddo, Louisiana",
+        "Ouachita": "Fourth Judicial District Court for the Parish of Ouachita, Louisiana",
+        "Bossier": "Twenty-Sixth Judicial District Court for the Parish of Bossier, Louisiana",
+        "DeSoto": "Forty-Second Judicial District Court for the Parish of DeSoto, Louisiana",
+        "Lincoln": "Third Judicial District Court for the Parish of Lincoln, Louisiana",
+        "Webster": "Twenty-Sixth Judicial District Court for the Parish of Webster, Louisiana",
+        "Franklin": "Fifth Judicial District Court for the Parish of Franklin, Louisiana",
+        "Richland": "Fifth Judicial District Court for the Parish of Richland, Louisiana",
+        "Union": "Third Judicial District Court for the Parish of Union, Louisiana",
+        "Winn": "Eighth Judicial District Court for the Parish of Winn, Louisiana",
+        "Morehouse": "Fourth Judicial District Court for the Parish of Morehouse, Louisiana",
+        "Claiborne": "Second Judicial District Court for the Parish of Claiborne, Louisiana",
+        "Ouachita Monroe City Court": "Monroe City Court for the Parish of Ouachita, Louisiana",


I appreciate you attempting to capture all of these, but this seems to be a fragile approach.

I would strip this out and use extract_from_text to capture lower_court_str

Something like this should work ...

def extract_from_text(self, scraped_text): match = re.search(r"Appealed from the\s*(.*?\s*),\s*Louisiana", scraped_text, re.DOTALL) if match: result = match.group(1).replace("\n", " ") return { "Docket": { "appeal_from_str": result }, } return {}

it will require a corresponding addition to the extract from text tests

flooie · 2025-01-14T19:00:35Z

juriscraper/opinions/united_states/state/lactapp_2.py

+        self.base_url = "https://www.la2nd.org/opinions/"
+        self.year = datetime.now().year
+        self.url = f"{self.base_url}?opinion_year={self.year}"
+        self.status = "Published"


Status is not always published. so you cant just assign it. There are two opinions in 2024 that do not share that distinction.

Thankfully you can just use

status_str = get_row_column_text(row, 7) to get that status_str

and do something like this

status_str = get_row_column_text(row, 7) status = "Published" if "Published" in status_str else "Unpublished"

to get status

flooie · 2025-01-14T19:01:32Z

juriscraper/opinions/united_states/state/lactapp_2.py

+    def _download(self):
+        html = super()._download()
+        # Currenly there are no opinions for 2025, so we need to go back one year
+        if html is not None:
+            tables = html.cssselect("table#datatable")
+            if not tables or not tables[0].cssselect("tbody tr"):
+                self.year -= 1
+                self.url = f"{self.base_url}?opinion_year={self.year}"
+                return self._download()
+        return html


you should just delete this _download method here - if we dont capture anything thats fine with me.

flooie · 2025-01-14T19:06:44Z

juriscraper/opinions/united_states/state/lactapp_2.py

+        self.court_id = self.__module__
+        self.base_url = "https://www.la2nd.org/opinions/"
+        self.year = datetime.now().year
+        self.url = f"{self.base_url}?opinion_year={self.year}"


I've been trained by @grossir to use better tools around making URLs

What about something like this

params = {"opinion_year": self.year} self.url = urljoin(base_url, f"?{urlencode(params)}")

flooie · 2025-01-14T19:07:21Z

juriscraper/opinions/united_states/state/lactapp_2.py

+        if self.html is None:
+            return
+


I think this is redundant - if self.html is none it wont run.

flooie · 2025-01-14T19:07:51Z

juriscraper/opinions/united_states/state/lactapp_2.py

+        if not tables or not tables[0].cssselect("tbody tr"):
+            return
+
+        logger.info(f"Processing cases for year: {self.year}")


same here I think

this could just be

def _process_html(self): tables = self.html.cssselect("table#datatable") if not tables: logger.info(f"No data found, {self.year}") return logger.info(f"Processing cases for year: {self.year}") for row in tables[0].cssselect("tbody tr"):

flooie · 2025-01-14T19:09:02Z

juriscraper/opinions/united_states/state/lactapp_2.py

+            author = get_row_column_text(row, 4)
+            clean_author = self.clean_judge_name(author)
+


I appreciate the clean author method but you should just chuck it. we have tools to help clean up judge strings. I think

author_str = get_row_column_text(row, 4)
normalize_judge_string(author_str)[0] will most likely solve it - and it will make the judge name nice and neat.

from juriscraper.lib.judge_parsers import (
normalize_judge_string,
)

flooie

@giancohs this is a great first effort. Thank you for submitting this. I haven't tested the back scraper, but I think we can get this close to the finish line with some of the changes.

giancohs and others added 4 commits January 11, 2025 07:16

scraper for lousiana 2nd circuit

b56a955

scraper for lousiana 2nd circuit

75ce2b6

fixing lousiana 2nd circuit backscrape

6fda2fc

[pre-commit.ci] auto fixes from pre-commit.com hooks

57abc9b

for more information, see https://pre-commit.ci

grossir self-requested a review January 11, 2025 23:14

grossir reviewed Jan 11, 2025

View reviewed changes

giancohs and others added 2 commits January 13, 2025 03:59

address PR comments: map court abbreviations and refactor code

3c1ce77

[pre-commit.ci] auto fixes from pre-commit.com hooks

aade79e

for more information, see https://pre-commit.ci

grossir reviewed Jan 13, 2025

View reviewed changes

grossir self-assigned this Jan 13, 2025

giancohs and others added 2 commits January 13, 2025 18:16

update court names mapping and simplify scraper code

d2989e8

[pre-commit.ci] auto fixes from pre-commit.com hooks

1c8ba33

for more information, see https://pre-commit.ci

grossir previously approved these changes Jan 13, 2025

View reviewed changes

grossir requested a review from flooie January 13, 2025 18:43

grossir assigned flooie Jan 13, 2025

delete unused imports

3d85f9c

giancohs dismissed grossir’s stale review via 3d85f9c January 13, 2025 19:07

[pre-commit.ci] auto fixes from pre-commit.com hooks

680fb16

for more information, see https://pre-commit.ci

grossir removed their assignment Jan 13, 2025

flooie reviewed Jan 14, 2025

View reviewed changes

.gitignore

# Swap files

*.swp

*~

*~

Copy link

Contributor

flooie Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats happening here?

flooie reviewed Jan 14, 2025

View reviewed changes

flooie requested changes Jan 14, 2025

View reviewed changes

flooie assigned giancohs and unassigned flooie Jan 16, 2025

	lower_court_to_abbreviation = {
	"USBC - District of New Hampshire": "NH",
	"USBC - District of Massachusetts (Worcester)": "MW",
	"USBC - District of Puerto Rico": "PR",
	"USBC - District of Massachusetts (Boston)": "MB",
	"USBC - District of Maine (Portland)": "EP",
	"USBC - District of Rhode Island": "RI",
	"Bankrupcty Court of ME - Bangor": "EB",
	"Bankruptcy Court of MA - Boston": "MB",
	"Bankruptcy Court of MA - Springfield": "MS",
	"Bankruptcy Court of ME - Portland": "EP",
	"Bankruptcy Court - Rhode Island": "RI",
	"Bankruptcy Court - San Juan Puerto Rico": "PR",
	"Bankruptcy Court of MA - Worcester": "MW",
	"Bankruptcy Court - Ponce Puerto Rico": "PR",
	"Bankruptcy Court of NH, Concord": "NH",
	}

	def make_backscrape_iterable(self, kwargs: dict) -> None:
	"""Checks if backscrape start and end arguments have been passed
	by caller, and parses them accordingly

	:param kwargs: passed when initializing the scraper, may or
	may not contain backscrape controlling arguments
	:return None
	"""
	start = kwargs.get("backscrape_start")
	end = kwargs.get("backscrape_end")

	start = int(start) if start else self.start_year
	end = int(end) + 1 if end else self.current_year

	self.back_scrape_iterable = range(start, end)

		author = get_row_column_text(row, 4)
		clean_author = self.clean_judge_name(author)

feat(lactapp_2): new scraper for Lousiana Court of Appeals Second Circuit #1299

Are you sure you want to change the base?

feat(lactapp_2): new scraper for Lousiana Court of Appeals Second Circuit #1299

Conversation

giancohs commented Jan 11, 2025

grossir left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grossir Jan 11, 2025 • edited Loading

Choose a reason for hiding this comment

giancohs commented Jan 13, 2025

grossir left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

giancohs commented Jan 13, 2025

grossir left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flooie left a comment

Choose a reason for hiding this comment

grossir left a comment •

edited

Loading

grossir Jan 11, 2025 •

edited

Loading