-
-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get citations using scrapers #858
Comments
@grossir can you please add your suggestion for using back scrapers to collect citations or other material posted later. |
Sure @flooie This would work for sources that have
An example is The approach is to run the backscraper with a custom caller. Here is some pseudocode from juriscraper.opinions.united_states.state import md as scraper_module
from juriscraper.lib.importer import site_yielder
from cl.search.models import Opinion, OpinionCluster
from cl.scrapers.management.commands.cl_scrape_opinions import make_citation
import logging
logger = logging.getLogger(__name__)
class CitationCollector:
def scrape_citations(self, start_date, end_date):
for site in site_yielder(
scraper_module.Site(
backscrape_start=start_date,
backscrape_end=end_date,
).back_scrape_iterable,
scraper_module,
):
# get case dicts by parsing HTML
site.parse()
court_id = scraper_module.court_id.split("/")[-1].split("_")[0]
for record in site:
citation = record['citations']
if not citation:
continue
# get cluster using download_url or hash of the document
cluster = Opinion.objects.get(download_url=record['download_urls']).cluster
# check if citation exists
if self.citation_exists(citation, cluster):
logger.info("Citation already exists '%s' for cluster %s", record['citations'], cluster.id)
continue
citation = make_citation(citation, cluster, court_id)
citation.save()
def citation_exists(self, citation, cluster):
"""To implement"""
return False
|
Simple enough. Is it a good idea to analyze this across all states to figure out:
Thank you guys. Also, should we spin this off into it's own ticket and task? My hope was to use this issue to discuss high level architecture of a new Juriscraper system, not features we want to add? |
I have a spreadsheet that looked at each state - and where these citations could be pulled from. In many cases the citations appear later on the scrapers and in others there is a second cite that could be scraped. The two probably are lexis or west cites that could be scraped (maybe). https://docs.google.com/spreadsheets/d/1zYP_4ivL2XQF8mlrgdTmzXB57sTn6UYv8GrrRkq7X5Q/edit?usp=sharing
10 with neutral citations |
That's not too bad! Let's keep filling this in with info about how far back each goes, and things like that. |
Yes but I think many of these links are unrelated to the current scrapers - so it's more of a jumping off point for this . |
A draft list that answers the questions, organized by "How difficult each is to scrape" in the sense if we have the scraper already implemented
I haven't checked this, how far back each goes Sources that publish citations in the same URL we scrapeIn other words, we just need to run (or implement) the backscraper with a custom caller
Sources that have a neutral citation inside the opinion's document, but we didn't extract itTo collect past neutral citations, we would need to run the recently updated scraper with extract_from_text against older
Sources that need a backscraper for a different URL than we scrapeIn other words, the backscraper may need to go into the
|
Thanks @grossir. Should we rename this issue to be about capturing citations, and make a new one to talk about Juriscraper 3.0 architecture? |
To be used for freelawproject#858
To be used for freelawproject#858
Happy to report that the citation backscraper is working, just ran it in prod on Added 305 citations by running
Also added 89 opinions, some of which may be opinions we already had, for which the hash has changed due to corrections |
We ran this for Anyway, it would be very nice to address the duplication problem freelawproject/courtlistener#3803 The command:
|
Yikes, those duplicates aren't great, no. Let's clean that up somehow, and figure out how to avoid dups before we have 20M opinions. :) |
For sources where the citations are inside the document's text, but we just recently implemented from juriscraper.opinions.united_states.state.vt import Site
from cl.search.models import OpinionCluster, Citation
from django.db import transaction
import traceback
"""
Tested with the following clusters:
Already has a neutral citation in the system
python manage.py clone_from_cl --type search.OpinionCluster --id 4335586
Recent document, Doesn't have a neutral citation in the system
python manage.py clone_from_cl --type search.OpinionCluster --id 10099996
Is an order, doesn't have a neutral citation
python manage.py clone_from_cl --type search.OpinionCluster --id 10044928
Old document (2017), doesn't have a neutral citation in the system
python manage.py clone_from_cl --type search.OpinionCluster --id 4489376
"""
site = Site()
# according to the citations search page,
# latest VT neutral citations we have are from 2015
# https://www.courtlistener.com/c/vt/
# However, we can find neutral citations from 2017?
# https://www.courtlistener.com/opinion/4335586/representative-donald-turner-jr-and-senator-joseph-benning-v-governor/
query = """
SELECT *
FROM search_opinioncluster
WHERE
docket_id IN (SELECT id FROM search_docket sd WHERE court_id = 'vt')
AND
id NOT IN (
SELECT cluster_id
FROM search_citation
WHERE reporter = 'VT'
)
AND
precedential_status = 'Published'
AND
date_filed > '2018-01-01'::date
"""
# This query selects all 'vt' opinion clusters created from 2018 or later
# which do not have a "VT" reporter neutral citation
# It queries over indexes
success, failure, iterated = 0, 0, 0
queryset = OpinionCluster.objects.raw(query).prefetch_related('sub_opinions')
for cluster in queryset:
iterated += 1
for opinion in cluster.sub_opinions.all():
metadata = site.extract_from_text(opinion.plain_text)
if not metadata:
continue
citation_kwargs = metadata['Citation']
citation_kwargs = cluster.id
try:
with transaction.atomic():
Citation.objects.create(**citation_kwargs)
print(f"Created citation {citation_kwargs}")
success += 1
except Exception:
print(f"Failed creating citation for {citation_kwargs}")
print(traceback.format_exc())
failure += 1
print(f"Created {success}\nFailed {failure}\nIterated {iterated}") |
Helps solve: freelawproject/juriscraper#858 - New command to re-run Site.extract_from_text over downloaded opinions - Able to filter by Docket.court_id , OpinionCluster.date_filed, OpinionCluster.precedential_status - Updates tasks.update_from_document_text to return information for logging purposes - Updates test_opinion_scraper to get a Site.extract_from_text method
I've noticed two citation gaps in Ohio, both documented in courtlistener issue #3882.
Both of these issues have increased urgency because, as I note in that issue, Ohio Supreme Court has changed style rules to only require neutral citations when they are available, so we're going to start to see a lot of new published opinions that only refer to prior cases by neutral citation. |
I'm going to workshop my thoughts on prioritization here - and welcome feedback and thoughts.
The text was updated successfully, but these errors were encountered: