Get citations using scrapers #858

flooie · 2024-01-12T18:29:29Z

I'm going to workshop my thoughts on prioritization here - and welcome feedback and thoughts.

flooie · 2024-07-19T16:58:50Z

@grossir can you please add your suggestion for using back scrapers to collect citations or other material posted later.

grossir · 2024-07-19T20:12:39Z

Sure @flooie

This would work for sources that have

Have a "citation" column on their HTML pages
The court leaves it as a placeholder for some time, until it populates it

An example is md, compare this 2 images from 2023 and 2024 (the current year), where citations are not populated yet

The approach is to run the backscraper with a custom caller. Here is some pseudocode

from juriscraper.opinions.united_states.state import md as scraper_module
from juriscraper.lib.importer import site_yielder
from cl.search.models import Opinion, OpinionCluster
from cl.scrapers.management.commands.cl_scrape_opinions import make_citation
import logging


logger = logging.getLogger(__name__)

class CitationCollector:
    def scrape_citations(self, start_date, end_date):
        for site in site_yielder(
            scraper_module.Site(
                backscrape_start=start_date,
                backscrape_end=end_date,
            ).back_scrape_iterable,
            scraper_module,
        ):
            # get case dicts by parsing HTML
            site.parse()
            
            court_id = scraper_module.court_id.split("/")[-1].split("_")[0]
            
            for record in site:
                citation = record['citations']
                if not citation:
                    continue
                
                # get cluster using download_url or hash of the document
                cluster = Opinion.objects.get(download_url=record['download_urls']).cluster
                
                # check if citation exists
                if self.citation_exists(citation, cluster):
                    logger.info("Citation already exists '%s' for cluster %s", record['citations'], cluster.id)
                    continue
                    
                citation = make_citation(citation, cluster, court_id)
                citation.save()
                
    def citation_exists(self, citation, cluster):
        """To implement"""
        return False

mlissner · 2024-07-22T14:57:23Z

Simple enough. Is it a good idea to analyze this across all states to figure out:

Which have it
How delayed each is
How far back each goes
How difficult each is to scrape
?

Thank you guys.

Also, should we spin this off into it's own ticket and task? My hope was to use this issue to discuss high level architecture of a new Juriscraper system, not features we want to add?

flooie · 2024-07-23T13:31:02Z

I have a spreadsheet that looked at each state - and where these citations could be pulled from. In many cases the citations appear later on the scrapers and in others there is a second cite that could be scraped. The two probably are lexis or west cites that could be scraped (maybe).

https://docs.google.com/spreadsheets/d/1zYP_4ivL2XQF8mlrgdTmzXB57sTn6UYv8GrrRkq7X5Q/edit?usp=sharing

STATE CITES	COUNT
YES	27
PROBABLE	2
UNCLEAR	6
NO	16

10 with neutral citations

mlissner · 2024-07-23T14:16:31Z

That's not too bad! Let's keep filling this in with info about how far back each goes, and things like that.

flooie · 2024-07-23T14:39:35Z

Yes but I think many of these links are unrelated to the current scrapers - so it's more of a jumping off point for this .

grossir · 2024-07-24T20:49:19Z

A draft list that answers the questions, organized by "How difficult each is to scrape" in the sense if we have the scraper already implemented

Which have it? How delayed each is? How difficult each is to scrape?

I haven't checked this, how far back each goes

Sources that publish citations in the same URL we scrape

In other words, we just need to run (or implement) the backscraper with a custom caller

Source	Time lag until citation is published	Example
`md`	1 year	See above, most recent citation is from August 2023
`scotus_slip`	1 month	Most recent citation is `602 U.S. 406` for 22-976 Garland v. Cargill published on June 14, 2024
`colo`	3 months	Earliest non neutral citation I could find is `545 P.3d 942` for a decision from 15 April 2024, which we do not have in the CL
`minn`	3 months	Earliest citation I could find is `5 N.W.3d 680` for an opinion from May 1st 2024 (today is August 12th)
`ohio`	6 months	Earliest citation I could find is `175 Ohio St.3d 155` for an opinion from April 4th 2024 (today is Sep 10th)

Sources that have a neutral citation inside the opinion's document, but we didn't extract it

To collect past neutral citations, we would need to run the recently updated scraper with extract_from_text against older Opinion.plain_text already in the DB

Source	citation extractor implemented in	Status	Date since collected	Citations added
vt	PR	Ran	2017-01-01	658
wis	Sep 3rd, 2024. PR	Ran	2020-01-01	386
wisctapp	...	Ran	2020-01-01	0
pasuperct	PR	Pending	?	?
or and orctapp	TBD	Pending	?	?

Sources that need a backscraper for a different URL than we scrape

In other words, the backscraper may need to go into the united_states_backscrapers, if not a different category folder

Source	Time lag until citation is published	Example	Modification required
`okla`	2 months	Most recent citation `549 P.3d 1260` for case published in 05/21/2024, KNOX v. OKLAHOMA GAS AND ELECTRIC CO. We don't have the citation in CL	We just changed the target URL, but we have code in the Git history to scrape and parse the site where citations are published.
`conn`	1.5 months	Most recent citation `349 Conn. 417` for case published in 06/25/2024	We would have to scrape a different page, and extract the data from PDFs, but they are nicely separated, 1 link per each opinion back to volume 326 from 2017. Before, back to volume 320, is a single PDF for all opinions

mlissner · 2024-07-25T15:44:55Z

Thanks @grossir. Should we rename this issue to be about capturing citations, and make a new one to talk about Juriscraper 3.0 architecture?

Related to freelawproject/juriscraper#858

To be used for freelawproject#858

grossir · 2024-08-22T16:00:27Z

Happy to report that the citation backscraper is working, just ran it in prod on md and will soon run it with scotus_slip.

Added 305 citations by running

manage.py cl_back_scrape_citations --courts juriscraper.opinions.united_states.state.md --backscrape-start=2019 --backscrape-end=2023 --verbosity 3

Also added 89 opinions, some of which may be opinions we already had, for which the hash has changed due to corrections

grossir · 2024-08-23T21:22:14Z

We ran this for scotus_slip, only term 22, and duplicated all records from that term. If the duplications are not too big of a problem, we could run it for all of scotus_slip and get all the citations that we are missing

Anyway, it would be very nice to address the duplication problem freelawproject/courtlistener#3803

The command:

manage.py cl_back_scrape_citations --courts juriscraper.opinions.united_states.federal_appellate.scotus_slip --backscrape-start=2023/01/01 --backscrape-end=2023/06/01 --verbosity 3

mlissner · 2024-08-23T22:26:59Z

Yikes, those duplicates aren't great, no. Let's clean that up somehow, and figure out how to avoid dups before we have 20M opinions. :)

grossir · 2024-09-10T03:59:02Z

For sources where the citations are inside the document's text, but we just recently implemented extract_from_text to get them, we can run a script like the following (currently, we can do this over vt, wis and wisctapp)

from juriscraper.opinions.united_states.state.vt import Site
from cl.search.models import OpinionCluster, Citation
from django.db import transaction
import traceback

"""
Tested with the following clusters:

Already has a neutral citation in the system
python manage.py clone_from_cl --type search.OpinionCluster --id 4335586

Recent document, Doesn't have a neutral citation in the system
python manage.py clone_from_cl --type search.OpinionCluster --id 10099996

Is an order, doesn't have a neutral citation
python manage.py clone_from_cl --type search.OpinionCluster --id 10044928

Old document (2017), doesn't have a neutral citation in the system
python manage.py clone_from_cl --type search.OpinionCluster --id 4489376
"""


site = Site()
# according to the citations search page, 
# latest VT neutral citations we have are from 2015
# https://www.courtlistener.com/c/vt/

# However, we can find neutral citations from 2017?
# https://www.courtlistener.com/opinion/4335586/representative-donald-turner-jr-and-senator-joseph-benning-v-governor/

query = """
SELECT *
FROM search_opinioncluster
WHERE 
        docket_id IN (SELECT id FROM search_docket sd WHERE court_id = 'vt') 
    AND
        id NOT IN (
            SELECT cluster_id 
            FROM search_citation
            WHERE reporter = 'VT'
        ) 
    AND
        precedential_status = 'Published'
    AND
        date_filed > '2018-01-01'::date
"""
# This query selects all 'vt' opinion clusters created from 2018 or later
# which do not have a "VT" reporter neutral citation
# It queries over indexes

success, failure, iterated = 0, 0, 0
queryset = OpinionCluster.objects.raw(query).prefetch_related('sub_opinions')
for cluster in queryset:
    iterated += 1
    
    for opinion in cluster.sub_opinions.all():
        metadata = site.extract_from_text(opinion.plain_text)
        if not metadata:
            continue

        citation_kwargs = metadata['Citation']
        citation_kwargs = cluster.id
        
        try:
            with transaction.atomic():
                Citation.objects.create(**citation_kwargs)
                print(f"Created citation {citation_kwargs}")
            success += 1
        except Exception:
            print(f"Failed creating citation for {citation_kwargs}")
            print(traceback.format_exc())
            failure += 1
        
print(f"Created {success}\nFailed {failure}\nIterated {iterated}")

Helps solve: freelawproject/juriscraper#858 - New command to re-run Site.extract_from_text over downloaded opinions - Able to filter by Docket.court_id , OpinionCluster.date_filed, OpinionCluster.precedential_status - Updates tasks.update_from_document_text to return information for logging purposes - Updates test_opinion_scraper to get a Site.extract_from_text method

rlfordon · 2025-01-10T17:07:13Z

I've noticed two citation gaps in Ohio, both documented in courtlistener issue #3882.

Missing neutral citations in unpublished cases. I think this perhaps happened because Ohio added neutral citations at some point in time that may have been after we scraped. It's possible other states have done this. I haven't systematically tested the extent of this, but I think there are lots of these.
Missing neutral citations in published cases. Again, some webcites have been added retroactively, so if we got print cases from Harvard (especially 1990s and early 2000s), we may not have the neutral citation parallel cite.

Both of these issues have increased urgency because, as I note in that issue, Ohio Supreme Court has changed style rules to only require neutral citations when they are available, so we're going to start to see a lot of new published opinions that only refer to prior cases by neutral citation.

flooie added the question label Jan 12, 2024

flooie self-assigned this Jan 12, 2024

flooie added this to @flooie's backlog Jan 12, 2024

github-project-automation bot moved this to 🆕 New in @flooie's backlog Jan 12, 2024

flooie moved this from 🆕 New to 🏗 In progress in @flooie's backlog Jan 12, 2024

This was referenced Aug 8, 2024

United States Reports citation gaps freelawproject/courtlistener#4290

Open

Fill minn and minnctapp gaps #1115

Open

grossir added a commit to grossir/courtlistener that referenced this issue Aug 15, 2024

feat(scrapers.tests): add tests for cl_back_scrape_citations

ad975d9

Related to freelawproject/juriscraper#858

grossir mentioned this issue Aug 15, 2024

feat(cl_back_scrape_citations): command to scrape citations freelawproject/courtlistener#4303

Merged

grossir added a commit to grossir/juriscraper that referenced this issue Aug 20, 2024

feat(scotus_slip): implement backscraper

9cae603

To be used for freelawproject#858

grossir mentioned this issue Aug 20, 2024

feat(scotus_slip): implement backscraper #1127

Merged

grossir added a commit to grossir/juriscraper that referenced this issue Aug 20, 2024

feat(scotus_slip): implement backscraper

560fbe3

To be used for freelawproject#858

grossir mentioned this issue Oct 1, 2024

feat(scrapers.update_from_text): new command freelawproject/courtlistener#4520

Merged

This was referenced Nov 21, 2024

Implement extract_from_text to get neutral citations for pasuperct #1251

Closed

Collect regional citations for pasuperct from API #1252

Closed

Implement extract_from_text to collect P3d regional citations for or and orctapp #1226

Open

grossir changed the title ~~Big Picture Priorities~~ Get citations using scrapers Nov 27, 2024

grossir mentioned this issue Nov 27, 2024

Re-run extract from text to get wis and wisctapp neutral citations #1260

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get citations using scrapers #858

Get citations using scrapers #858

flooie commented Jan 12, 2024

flooie commented Jul 19, 2024

grossir commented Jul 19, 2024 •

edited

Loading

mlissner commented Jul 22, 2024

flooie commented Jul 23, 2024

mlissner commented Jul 23, 2024

flooie commented Jul 23, 2024

grossir commented Jul 24, 2024 •

edited

Loading

mlissner commented Jul 25, 2024

grossir commented Aug 22, 2024

grossir commented Aug 23, 2024 •

edited

Loading

mlissner commented Aug 23, 2024

grossir commented Sep 10, 2024

rlfordon commented Jan 10, 2025

Get citations using scrapers #858

Get citations using scrapers #858

Comments

flooie commented Jan 12, 2024

flooie commented Jul 19, 2024

grossir commented Jul 19, 2024 • edited Loading

mlissner commented Jul 22, 2024

flooie commented Jul 23, 2024

mlissner commented Jul 23, 2024

flooie commented Jul 23, 2024

grossir commented Jul 24, 2024 • edited Loading

Sources that publish citations in the same URL we scrape

Sources that have a neutral citation inside the opinion's document, but we didn't extract it

Sources that need a backscraper for a different URL than we scrape

mlissner commented Jul 25, 2024

grossir commented Aug 22, 2024

grossir commented Aug 23, 2024 • edited Loading

mlissner commented Aug 23, 2024

grossir commented Sep 10, 2024

rlfordon commented Jan 10, 2025

grossir commented Jul 19, 2024 •

edited

Loading

grossir commented Jul 24, 2024 •

edited

Loading

grossir commented Aug 23, 2024 •

edited

Loading