Skip to content

Commit

Permalink
Merge pull request #287 from georgetown-cset/nov-updates
Browse files Browse the repository at this point in the history
Update data, improve query ordering and repo matching
  • Loading branch information
jmelot authored Dec 5, 2023
2 parents cb6ffd9 + e0d766e commit 0a8de65
Show file tree
Hide file tree
Showing 10 changed files with 15,631 additions and 14,777 deletions.
2 changes: 1 addition & 1 deletion github-metrics/src/data/config.json
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"start_year": 2017, "end_year": 2023, "last_updated": "October 17, 2023"}
{"start_year": 2017, "end_year": 2023, "last_updated": "November 17, 2023"}
2 changes: 1 addition & 1 deletion github-metrics/src/data/field_to_repos.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion github-metrics/src/data/fields.json
Original file line number Diff line number Diff line change
@@ -1 +1 @@
["Nuclear magnetic resonance", "Natural language processing", "International economics", "Physical chemistry", "Computer security", "Zoology", "Econometrics", "Advertising", "Machine learning", "Quantum mechanics", "Molecular biology", "Earth Systems", "Emissions", "Algebra", "Ecology", "Climate and Earth Science", "Geometry", "Theoretical computer science", "Simulation", "Optics", "Energy Storage", "Cell biology", "riscv", "Seismology", "Reliability engineering", "Astrobiology", "Computer vision", "Astrophysics", "Atmospheric sciences", "Topology", "Visual arts", "Neuroscience", "Cardiology", "Agronomy", "Speech recognition", "Particle physics", "Medical education", "Chemical physics", "Computational biology", "ai_safety", "Information retrieval", "Sustainable Development", "Bioinformatics", "Distributed computing", "Data science", "Renewable Energy", "Organic chemistry", "Embedded system", "Genetics", "Natural Resources", "Computational chemistry", "Multimedia", "Computational physics", "Quantum electrodynamics", "Gender studies", "Food science", "Epistemology", "Virology", "Pharmacology", "Pattern recognition", "Operating system", "Parallel computing", "Automotive engineering", "Finance", "Meteorology", "Social science", "Linguistics", "World Wide Web", "Calculus", "Consumption of Energy and Resources", "Geophysics", "Control theory", "Energy Systems", "Nuclear physics", "Industrial Ecology", "Artificial intelligence", "Molecular physics", "Mathematical optimization", "Geomorphology", "Oceanography", "Pathology", "Radiology", "Hydrology", "Theoretical physics", "Surgery", "Social psychology", "Botany", "Knowledge management", "Classics", "Financial economics", "Software engineering", "Microbiology", "Remote sensing", "Financial system", "Immunology", "Water resource management", "Mathematical analysis", "Astronomy", "Computer graphics (images)", "Media studies", "Cognitive science", "Cancer research", "Acoustics", "Condensed matter physics", "Thermodynamics", "Anatomy", "Transport engineering", "Evolutionary biology", "weto"]
["Knowledge management", "weto", "Particle physics", "Cardiology", "Mathematical optimization", "Computational physics", "Algebra", "Financial economics", "Pattern recognition", "Classics", "Financial system", "Embedded system", "Meteorology", "Control theory", "Medical education", "Consumption of Energy and Resources", "Virology", "Genetics", "Machine learning", "Ecology", "Condensed matter physics", "Optics", "Sustainable Development", "Emissions", "Immunology", "Paleontology", "Simulation", "Natural Resources", "Social psychology", "Geophysics", "Information retrieval", "Remote sensing", "Agronomy", "Atmospheric sciences", "Speech recognition", "Bioinformatics", "Advertising", "Theoretical computer science", "Gender studies", "Computer security", "Seismology", "Topology", "Computer graphics (images)", "Quantum mechanics", "Thermodynamics", "Nuclear magnetic resonance", "Astrobiology", "Cell biology", "Geomorphology", "Energy Storage", "Econometrics", "ai_safety", "Computational biology", "Calculus", "Industrial Ecology", "Earth Systems", "Astrophysics", "Chemical physics", "World Wide Web", "Artificial intelligence", "Data science", "Software engineering", "Oncology", "Climate and Earth Science", "Geometry", "Quantum electrodynamics", "Visual arts", "Multimedia", "Molecular physics", "Anatomy", "Astronomy", "Cognitive science", "Media studies", "Theoretical physics", "Oceanography", "Epistemology", "Pathology", "Botany", "Microbiology", "Pharmacology", "Radiology", "International economics", "Zoology", "Endocrinology", "Nuclear physics", "Automotive engineering", "Cancer research", "Neuroscience", "Mathematical analysis", "Food science", "Organic chemistry", "Linguistics", "Finance", "Operating system", "Water resource management", "Transport engineering", "Evolutionary biology", "Computer vision", "Molecular biology", "Acoustics", "Computational chemistry", "Hydrology", "Renewable Energy", "Natural language processing", "Parallel computing", "Surgery", "Distributed computing", "Physical chemistry", "Geotechnical engineering", "Reliability engineering", "Energy Systems", "Social science", "riscv"]
2 changes: 1 addition & 1 deletion github-metrics/src/data/id_to_repo.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion github-metrics/src/data/name_to_id.json

Large diffs are not rendered by default.

30,365 changes: 15,599 additions & 14,766 deletions github-metrics/static/orca_download.jsonl

Large diffs are not rendered by default.

12 changes: 12 additions & 0 deletions schemas/repos_with_full_meta_raw.json
Original file line number Diff line number Diff line change
Expand Up @@ -2388,6 +2388,18 @@
"mode": "NULLABLE",
"name": "dependabot_security_updates",
"type": "RECORD"
},
{
"fields": [
{
"mode": "NULLABLE",
"name": "status",
"type": "STRING"
}
],
"mode": "NULLABLE",
"name": "secret_scanning_validity_checks",
"type": "RECORD"
}
],
"mode": "NULLABLE",
Expand Down
2 changes: 1 addition & 1 deletion scripts/retrieve_repos.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ def read_bq_repos(self) -> Generator:
"""
client = bigquery.Client()
query_job = client.query(
"SELECT repo, datasets, merged_ids from staging_github_metrics.repos_in_papers"
"SELECT repo, datasets, merged_ids from staging_orca.repos_in_papers"
)
results = query_job.result()
for row in results:
Expand Down
15 changes: 11 additions & 4 deletions sql/top_cited_repo_citers.sql
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ with citation_counts as (
),

paper_relevance as (
select
select distinct
papers.merged_id,
id,
papers.year,
Expand All @@ -18,10 +18,17 @@ paper_relevance as (
from literature.papers
inner join
(select
id,
meta.merged_id as merged_id
from {{ staging_dataset }}.website_stats cross join unnest(paper_meta) as meta)
repo,
merged_id
from {{ staging_dataset }}.repos_in_papers cross join unnest(merged_ids) as merged_id)
using (merged_id)
inner join
(select
concat(owner_name, "/", repo_name) as repo,
full_metadata.id
from
{{ staging_dataset }}.repos_with_full_meta_raw)
using (repo)
inner join citation_counts
using (merged_id)
left join (
Expand Down
4 changes: 3 additions & 1 deletion sql/website_stats.sql
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,9 @@ distinct_repo_papers AS ( -- noqa: L045
LEFT JOIN
{{ staging_dataset }}.repos_with_full_meta
ON
LOWER(CONCAT(matched_owner, "/", matched_name)) = LOWER(repo)
(
LOWER(CONCAT(matched_owner, "/", matched_name)) = LOWER(repo)
) OR (LOWER(CONCAT(current_owner, "/", current_name)) = LOWER(repo))
),

repo_paper_meta AS (
Expand Down

0 comments on commit 0a8de65

Please sign in to comment.