Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Github Scraper Python to TypeScript (GSoC 2024 Mid-Term Evaluation) #458

Merged

Conversation

dgparmar14
Copy link
Contributor

@dgparmar14 dgparmar14 commented Jun 6, 2024

Description

GSoC Mid-Term Evaluation: Refactoring Scraper from Python to TypeScript

This task involves refactoring the existing Python-based scraper into TypeScript. The transition aims to enhance the codebase by introducing type safety and leveraging the capabilities of Octokit for more efficient GitHub interactions.

Fixes: #212

Week 1:

  1. Refactor Github.py to Github.ts: Convert all scraper functionalities from the Python file to TypeScript.
  2. Define Types: Define all necessary types for the TypeScript conversion.

Week 2:

  1. Modularize Scraper: Break down the github.ts file into different modules for improved readability.

  2. File Structure and Features:

    • index.ts: Entry point of the scraper containing main() and scrapGithub() functions.
    • fetchEvents.ts: Fetches all GitHub events and filters out blacklisted users (configurable via the .env file).
    • parseEvents.ts: Parses the events fetched by fetchEvents.ts based on required GitHub event types.
    • fetchUserData.ts: Fetches user-related data using fetchOpenPulls() and fetchMergeEvents().
    • config.ts: Handles Octokit authentication using GITHUB_TOKEN.
    • saveData.ts: Contains the mergedData() function to merge scraped data with previous contributor data.
    • types.ts: Contains all required types.
    • utils.ts: Contains common functions like calculateTurnaroundTime(), resolveAutonomyResponsibility(), loadUserData(), and saveUserData().

Week 3:

  1. GitHub Discussions: Fetch all GitHub discussions of the organization.
  2. Define Types: Define types for discussions.
  3. Create discussion.ts: Fetches discussions and stores them in the data/github/discussion directory as discussion.json.
  4. Update scraper-dry-run.yaml: Modify the dry-run file to work with a Node.js and npm environment.
  5. Testing: Create the test file github-discussion-schema.test() for discussions.

How Has This Been Tested?

The refactored scraper can be tested using the following commands:

  1. Build the project: pnpm build
  2. Start the project: pnpm start org_name data_dir date(format:YYYY-MM-DD) num_days
  3. Run all commands at once: pnpm dev org_name data_dir date(format:YYYY-MM-DD) num_days

(Default values will be used if date (current date) and num_days (1) are not provided).

Copy link

vercel bot commented Jun 6, 2024

@dgparmar14 is attempting to deploy a commit to the Open Healthcare Network Team on Vercel.

A member of the Team first needs to authorize it.

app/api/leaderboard/functions.ts Outdated Show resolved Hide resolved
app/feed/page.tsx Outdated Show resolved Hide resolved
app/feed/page.tsx Outdated Show resolved Hide resolved
components/gh_events/GitHubEvents.tsx Outdated Show resolved Hide resolved
lib/gh_events.ts Outdated Show resolved Hide resolved
scraper/src/github-scraper/utils.ts Outdated Show resolved Hide resolved
scraper/src/github-scraper/utils.ts Outdated Show resolved Hide resolved
scraper/src/github-scraper/utils.ts Outdated Show resolved Hide resolved
tsconfig.json Outdated Show resolved Hide resolved
scraper/src/github-scraper/utils.ts Outdated Show resolved Hide resolved
.github/workflows/scraper-dry-run.yaml Outdated Show resolved Hide resolved
data/github/discussions/discussions.json Outdated Show resolved Hide resolved
public/logo.png Outdated Show resolved Hide resolved
schemas/discussion-data.yaml Outdated Show resolved Hide resolved
schemas/github-data.yaml Outdated Show resolved Hide resolved
schemas/github-data.yaml Outdated Show resolved Hide resolved
scraper/src/github-scraper/types.ts Outdated Show resolved Hide resolved
scraper/src/github-scraper/utils.ts Show resolved Hide resolved
Copy link

vercel bot commented Jun 28, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
leaderboard ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 8, 2024 8:01am

@dgparmar14 dgparmar14 changed the title GSoC_Week1_Refactor Github Scrapper Refactor Github Scrapper Python to TypeScript (GSoC 2024 Mid-Term Evaluation) Jun 28, 2024
@dgparmar14 dgparmar14 changed the title Refactor Github Scrapper Python to TypeScript (GSoC 2024 Mid-Term Evaluation) Refactor Github Scraper Python to TypeScript (GSoC 2024 Mid-Term Evaluation) Jun 28, 2024
@dgparmar14 dgparmar14 marked this pull request as ready for review June 28, 2024 11:32
Copy link
Member

@rithviknishad rithviknishad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dgparmar14 dgparmar14 requested a review from rithviknishad July 2, 2024 09:55
@dgparmar14
Copy link
Contributor Author

I guess there will be no type errors in the scraper. I checked multiple time if there still let me know.

@rithviknishad rithviknishad changed the base branch from main to gsoc/gh-discussions July 12, 2024 10:11
@rithviknishad rithviknishad merged commit 0141997 into ohcnetwork:gsoc/gh-discussions Jul 12, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants