Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get citations using scrapers #858

Open
flooie opened this issue Jan 12, 2024 · 13 comments
Open

Get citations using scrapers #858

flooie opened this issue Jan 12, 2024 · 13 comments
Assignees
Labels

Comments

@flooie
Copy link
Contributor

flooie commented Jan 12, 2024

I'm going to workshop my thoughts on prioritization here - and welcome feedback and thoughts.

@flooie flooie self-assigned this Jan 12, 2024
@flooie flooie moved this from 🆕 New to 🏗 In progress in @flooie's backlog Jan 12, 2024
@flooie
Copy link
Contributor Author

flooie commented Jul 19, 2024

@grossir can you please add your suggestion for using back scrapers to collect citations or other material posted later.

@grossir
Copy link
Contributor

grossir commented Jul 19, 2024

Sure @flooie

This would work for sources that have

  1. Have a "citation" column on their HTML pages
  2. The court leaves it as a placeholder for some time, until it populates it

An example is md, compare this 2 images from 2023 and 2024 (the current year), where citations are not populated yet

image
image

The approach is to run the backscraper with a custom caller. Here is some pseudocode

from juriscraper.opinions.united_states.state import md as scraper_module
from juriscraper.lib.importer import site_yielder
from cl.search.models import Opinion, OpinionCluster
from cl.scrapers.management.commands.cl_scrape_opinions import make_citation
import logging


logger = logging.getLogger(__name__)

class CitationCollector:
    def scrape_citations(self, start_date, end_date):
        for site in site_yielder(
            scraper_module.Site(
                backscrape_start=start_date,
                backscrape_end=end_date,
            ).back_scrape_iterable,
            scraper_module,
        ):
            # get case dicts by parsing HTML
            site.parse()
            
            court_id = scraper_module.court_id.split("/")[-1].split("_")[0]
            
            for record in site:
                citation = record['citations']
                if not citation:
                    continue
                
                # get cluster using download_url or hash of the document
                cluster = Opinion.objects.get(download_url=record['download_urls']).cluster
                
                # check if citation exists
                if self.citation_exists(citation, cluster):
                    logger.info("Citation already exists '%s' for cluster %s", record['citations'], cluster.id)
                    continue
                    
                citation = make_citation(citation, cluster, court_id)
                citation.save()
                
    def citation_exists(self, citation, cluster):
        """To implement"""
        return False
    

@mlissner
Copy link
Member

Simple enough. Is it a good idea to analyze this across all states to figure out:

  • Which have it
  • How delayed each is
  • How far back each goes
  • How difficult each is to scrape
  • ?

Thank you guys.

Also, should we spin this off into it's own ticket and task? My hope was to use this issue to discuss high level architecture of a new Juriscraper system, not features we want to add?

@flooie
Copy link
Contributor Author

flooie commented Jul 23, 2024

I have a spreadsheet that looked at each state - and where these citations could be pulled from. In many cases the citations appear later on the scrapers and in others there is a second cite that could be scraped. The two probably are lexis or west cites that could be scraped (maybe).

https://docs.google.com/spreadsheets/d/1zYP_4ivL2XQF8mlrgdTmzXB57sTn6UYv8GrrRkq7X5Q/edit?usp=sharing

STATE CITES COUNT
YES 27
PROBABLE 2
UNCLEAR 6
NO 16

10 with neutral citations

@mlissner
Copy link
Member

That's not too bad! Let's keep filling this in with info about how far back each goes, and things like that.

@flooie
Copy link
Contributor Author

flooie commented Jul 23, 2024

Yes but I think many of these links are unrelated to the current scrapers - so it's more of a jumping off point for this .

@grossir
Copy link
Contributor

grossir commented Jul 24, 2024

A draft list that answers the questions, organized by "How difficult each is to scrape" in the sense if we have the scraper already implemented

Which have it? How delayed each is? How difficult each is to scrape?

I haven't checked this, how far back each goes

Sources that publish citations in the same URL we scrape

In other words, we just need to run (or implement) the backscraper with a custom caller

Source Time lag until citation is published Example
md 1 year See above, most recent citation is from August 2023
scotus_slip 1 month Most recent citation is 602 U.S. 406 for 22-976 Garland v. Cargill published on June 14, 2024
colo 3 months Earliest non neutral citation I could find is 545 P.3d 942 for a decision from 15 April 2024, which we do not have in the CL
minn 3 months Earliest citation I could find is 5 N.W.3d 680 for an opinion from May 1st 2024 (today is August 12th)
ohio 6 months Earliest citation I could find is 175 Ohio St.3d 155 for an opinion from April 4th 2024 (today is Sep 10th)

Sources that have a neutral citation inside the opinion's document, but we didn't extract it

To collect past neutral citations, we would need to run the recently updated scraper with extract_from_text against older Opinion.plain_text already in the DB

Source citation extractor implemented in Status Date since collected Citations added
vt PR Ran 2017-01-01 658
wis Sep 3rd, 2024. PR Ran 2020-01-01 386
wisctapp ... Ran 2020-01-01 0
pasuperct PR Pending ? ?
or and orctapp TBD Pending ? ?

Sources that need a backscraper for a different URL than we scrape

In other words, the backscraper may need to go into the united_states_backscrapers, if not a different category folder

Source Time lag until citation is published Example Modification required
okla 2 months Most recent citation 549 P.3d 1260 for case published in 05/21/2024, KNOX v. OKLAHOMA GAS AND ELECTRIC CO. We don't have the citation in CL We just changed the target URL, but we have code in the Git history to scrape and parse the site where citations are published.
conn 1.5 months Most recent citation 349 Conn. 417 for case published in 06/25/2024 We would have to scrape a different page, and extract the data from PDFs, but they are nicely separated, 1 link per each opinion back to volume 326 from 2017. Before, back to volume 320, is a single PDF for all opinions

@mlissner
Copy link
Member

Thanks @grossir. Should we rename this issue to be about capturing citations, and make a new one to talk about Juriscraper 3.0 architecture?

@grossir
Copy link
Contributor

grossir commented Aug 22, 2024

Happy to report that the citation backscraper is working, just ran it in prod on md and will soon run it with scotus_slip.

image

Added 305 citations by running

manage.py cl_back_scrape_citations --courts juriscraper.opinions.united_states.state.md --backscrape-start=2019 --backscrape-end=2023 --verbosity 3

Also added 89 opinions, some of which may be opinions we already had, for which the hash has changed due to corrections

@grossir
Copy link
Contributor

grossir commented Aug 23, 2024

We ran this for scotus_slip, only term 22, and duplicated all records from that term. If the duplications are not too big of a problem, we could run it for all of scotus_slip and get all the citations that we are missing

Anyway, it would be very nice to address the duplication problem freelawproject/courtlistener#3803

The command:

manage.py cl_back_scrape_citations --courts juriscraper.opinions.united_states.federal_appellate.scotus_slip --backscrape-start=2023/01/01 --backscrape-end=2023/06/01 --verbosity 3

@mlissner
Copy link
Member

Yikes, those duplicates aren't great, no. Let's clean that up somehow, and figure out how to avoid dups before we have 20M opinions. :)

@grossir
Copy link
Contributor

grossir commented Sep 10, 2024

For sources where the citations are inside the document's text, but we just recently implemented extract_from_text to get them, we can run a script like the following (currently, we can do this over vt, wis and wisctapp)

from juriscraper.opinions.united_states.state.vt import Site
from cl.search.models import OpinionCluster, Citation
from django.db import transaction
import traceback

"""
Tested with the following clusters:

Already has a neutral citation in the system
python manage.py clone_from_cl --type search.OpinionCluster --id 4335586

Recent document, Doesn't have a neutral citation in the system
python manage.py clone_from_cl --type search.OpinionCluster --id 10099996

Is an order, doesn't have a neutral citation
python manage.py clone_from_cl --type search.OpinionCluster --id 10044928

Old document (2017), doesn't have a neutral citation in the system
python manage.py clone_from_cl --type search.OpinionCluster --id 4489376
"""


site = Site()
# according to the citations search page, 
# latest VT neutral citations we have are from 2015
# https://www.courtlistener.com/c/vt/

# However, we can find neutral citations from 2017?
# https://www.courtlistener.com/opinion/4335586/representative-donald-turner-jr-and-senator-joseph-benning-v-governor/

query = """
SELECT *
FROM search_opinioncluster
WHERE 
        docket_id IN (SELECT id FROM search_docket sd WHERE court_id = 'vt') 
    AND
        id NOT IN (
            SELECT cluster_id 
            FROM search_citation
            WHERE reporter = 'VT'
        ) 
    AND
        precedential_status = 'Published'
    AND
        date_filed > '2018-01-01'::date
"""
# This query selects all 'vt' opinion clusters created from 2018 or later
# which do not have a "VT" reporter neutral citation
# It queries over indexes

success, failure, iterated = 0, 0, 0
queryset = OpinionCluster.objects.raw(query).prefetch_related('sub_opinions')
for cluster in queryset:
    iterated += 1
    
    for opinion in cluster.sub_opinions.all():
        metadata = site.extract_from_text(opinion.plain_text)
        if not metadata:
            continue

        citation_kwargs = metadata['Citation']
        citation_kwargs = cluster.id
        
        try:
            with transaction.atomic():
                Citation.objects.create(**citation_kwargs)
                print(f"Created citation {citation_kwargs}")
            success += 1
        except Exception:
            print(f"Failed creating citation for {citation_kwargs}")
            print(traceback.format_exc())
            failure += 1
        
print(f"Created {success}\nFailed {failure}\nIterated {iterated}")

grossir added a commit to freelawproject/courtlistener that referenced this issue Oct 1, 2024
Helps solve: freelawproject/juriscraper#858

- New command to re-run Site.extract_from_text over downloaded opinions
- Able to filter by Docket.court_id ,  OpinionCluster.date_filed, OpinionCluster.precedential_status
- Updates tasks.update_from_document_text to return information for logging purposes
- Updates test_opinion_scraper to get a Site.extract_from_text method
@grossir grossir changed the title Big Picture Priorities Get citations using scrapers Nov 27, 2024
@rlfordon
Copy link
Contributor

I've noticed two citation gaps in Ohio, both documented in courtlistener issue #3882.

  1. Missing neutral citations in unpublished cases. I think this perhaps happened because Ohio added neutral citations at some point in time that may have been after we scraped. It's possible other states have done this. I haven't systematically tested the extent of this, but I think there are lots of these.
  2. Missing neutral citations in published cases. Again, some webcites have been added retroactively, so if we got print cases from Harvard (especially 1990s and early 2000s), we may not have the neutral citation parallel cite.

Both of these issues have increased urgency because, as I note in that issue, Ohio Supreme Court has changed style rules to only require neutral citations when they are available, so we're going to start to see a lot of new published opinions that only refer to prior cases by neutral citation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 🏗 In progress
Development

No branches or pull requests

4 participants