Skip to content

Latest commit

 

History

History
146 lines (104 loc) · 5.03 KB

README.md

File metadata and controls

146 lines (104 loc) · 5.03 KB

Flow Data Sources

This repository contains a Python script that updates daily a list of Flow-related sites, GitHub repositories, and GitHub discussions and converts them into Markdown files. The resulting .md files are intended for AI ingestion, Retrieval-Augmented Generation (RAG) pipelines, chatbots, or any other knowledge base platform that benefits from structured text.


Table of Contents


Purpose

We want a single repository that periodically crawls all relevant Flow ecosystem content—documentation, code examples, and community discussions—and stores them in a consolidated Markdown format. You can then feed these files into:

  • ChatGPT plugins (for enhanced Q&A)
  • Retrieval-Augmented Generation (indexing and searching them in a vector database)
  • Discord/Telegram bots that cite official doc sections
  • Any other knowledge base for advanced Q&A or search.

How It Works

The Python script performs domain-limited BFS (Breadth-First Search) and specialized scraping logic based on each URL:

1. Normal Docs Sites (HTML → MD)

  • Non-GitHub URLs are treated as “normal” websites.
  • The script fetches each page and removes <script>, <style>, <noscript> tags.
  • Then it uses markdownify to convert the remaining HTML into Markdown.
  • It recurses only within the same domain to avoid crawling unrelated pages.

2. GitHub Repos (Raw Code)

  • For GitHub repo links like https://github.com/onflow/flow-ft/, the script visits:
    • The repo root
    • tree/(main|master)/... subdirectories
    • blob/(main|master)/... file pages
  • Files with certain extensions (like .cdc, .md, .json, etc.) or any README are downloaded in their raw form from raw.githubusercontent.com.
  • The file contents are saved in a .md file, wrapped in triple backticks for easy code parsing.

3. GitHub Discussions (Q&A Text Only)

  • For https://github.com/orgs/onflow/discussions, the script:
    • Crawls the listing pages, discovers discussion links like /orgs/onflow/discussions/1330
    • For each thread, it extracts only the text from user posts (skipping the GitHub UI) and converts it to Markdown.
  • This yields .md files containing the original question and comments/replies.

Usage

Requirements

  1. Python 3.7+
  2. requests, beautifulsoup4, markdownify

Install all dependencies:

pip install requests beautifulsoup4 markdownify

Running the Scraper

Clone or download this repo locally.

In the repo directory, run:

python scraper.py

The script will crawl each site listed in SITES (inside scraper.py) and output the results under scraped_docs/.

Modifying the List of Sites

Inside scraper.py, near the top, you’ll see:

SITES = [
    "https://developers.flow.com/",
    "https://academy.ecdao.org/en/cadence-by-example",
    ...
    "https://github.com/onflow/flow-ft/",
    ...
    "https://github.com/orgs/onflow/discussions"
]
  • Add a docs site by appending its URL if it’s not on GitHub.
  • Add a GitHub repo by appending the base URL (e.g. "https://github.com/onflow/another-repo").
  • Add another GitHub Discussions page if needed.
  • Remove any site by deleting or commenting out its line.

For private sites or repos, you may need authentication tokens/cookies to see content that’s not public.

Merging

You can merge all the .md files into a single file or a file containing only the essentials (removing code blocks, etc.).
That will be useful for indexing or searching or being used in a chatbot.

python merge.py

Output Structure

After a successful run, you’ll see:

scraped_docs/
  ├─ developers_flow_com/
  │   ├─ index.md
  │   ├─ docs_tutorial_somepage.md
  │   └─ ...
  ├─ github_com_onflow_flow_ft/
  │   ├─ blob_main_contracts_exampletoken_cdc.md
  │   ├─ ...
  ├─ github_com_orgs_onflow_discussions/
  │   ├─ discussion_1330.md
  │   ├─ discussion_1514.md
  │   └─ ...
  └─ ...
merged_docs/
  ├─ all_merged.md
  └─ essentials_merged.md
  • Docs directories for each site
  • Repos with code files in .md (wrapped code blocks)
  • Discussions as discussion_<id>.md, each containing Q&A text.

Scheduling Automation

The script can be scheduled to run daily using GitHub Actions.