Flow Data Sources

This repository contains a Python script that updates daily a list of Flow-related sites, GitHub repositories, and GitHub discussions and converts them into Markdown files. The resulting .md files are intended for AI ingestion, Retrieval-Augmented Generation (RAG) pipelines, chatbots, or any other knowledge base platform that benefits from structured text.

Purpose

We want a single repository that periodically crawls all relevant Flow ecosystem content—documentation, code examples, and community discussions—and stores them in a consolidated Markdown format. You can then feed these files into:

ChatGPT plugins (for enhanced Q&A)
Retrieval-Augmented Generation (indexing and searching them in a vector database)
Discord/Telegram bots that cite official doc sections
Any other knowledge base for advanced Q&A or search.

How It Works

The Python script performs domain-limited BFS (Breadth-First Search) and specialized scraping logic based on each URL:

1. Normal Docs Sites (HTML → MD)

Non-GitHub URLs are treated as “normal” websites.
The script fetches each page and removes <script>, <style>, <noscript> tags.
Then it uses markdownify to convert the remaining HTML into Markdown.
It recurses only within the same domain to avoid crawling unrelated pages.

2. GitHub Repos (Raw Code)

For GitHub repo links like https://github.com/onflow/flow-ft/, the script visits:
- The repo root
- tree/(main|master)/... subdirectories
- blob/(main|master)/... file pages
Files with certain extensions (like .cdc, .md, .json, etc.) or any README are downloaded in their raw form from raw.githubusercontent.com.
The file contents are saved in a .md file, wrapped in triple backticks for easy code parsing.

3. GitHub Discussions (Q&A Text Only)

For https://github.com/orgs/onflow/discussions, the script:
- Crawls the listing pages, discovers discussion links like /orgs/onflow/discussions/1330
- For each thread, it extracts only the text from user posts (skipping the GitHub UI) and converts it to Markdown.
This yields .md files containing the original question and comments/replies.

Usage

Requirements

Python 3.7+
requests, beautifulsoup4, markdownify

Install all dependencies:

pip install requests beautifulsoup4 markdownify

Running the Scraper

Clone or download this repo locally.

In the repo directory, run:

python scraper.py

The script will crawl each site listed in SITES (inside scraper.py) and output the results under scraped_docs/.

Modifying the List of Sites

Inside scraper.py, near the top, you’ll see:

SITES = [
    "https://developers.flow.com/",
    "https://academy.ecdao.org/en/cadence-by-example",
    ...
    "https://github.com/onflow/flow-ft/",
    ...
    "https://github.com/orgs/onflow/discussions"
]

Add a docs site by appending its URL if it’s not on GitHub.
Add a GitHub repo by appending the base URL (e.g. "https://github.com/onflow/another-repo").
Add another GitHub Discussions page if needed.
Remove any site by deleting or commenting out its line.

For private sites or repos, you may need authentication tokens/cookies to see content that’s not public.

Merging

You can merge all the .md files into a single file or a file containing only the essentials (removing code blocks, etc.).
That will be useful for indexing or searching or being used in a chatbot.

python merge.py

Output Structure

After a successful run, you’ll see:

scraped_docs/
  ├─ developers_flow_com/
  │   ├─ index.md
  │   ├─ docs_tutorial_somepage.md
  │   └─ ...
  ├─ github_com_onflow_flow_ft/
  │   ├─ blob_main_contracts_exampletoken_cdc.md
  │   ├─ ...
  ├─ github_com_orgs_onflow_discussions/
  │   ├─ discussion_1330.md
  │   ├─ discussion_1514.md
  │   └─ ...
  └─ ...
merged_docs/
  ├─ all_merged.md
  └─ essentials_merged.md

Docs directories for each site
Repos with code files in .md (wrapped code blocks)
Discussions as discussion_<id>.md, each containing Q&A text.

Scheduling Automation

The script can be scheduled to run daily using GitHub Actions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Flow Data Sources

Table of Contents

Purpose

How It Works

1. Normal Docs Sites (HTML → MD)

2. GitHub Repos (Raw Code)

3. GitHub Discussions (Q&A Text Only)

Usage

Requirements

Running the Scraper

Modifying the List of Sites

Merging

Output Structure

Scheduling Automation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Flow Data Sources

Table of Contents

Purpose

How It Works

1. Normal Docs Sites (HTML → MD)

2. GitHub Repos (Raw Code)

3. GitHub Discussions (Q&A Text Only)

Usage

Requirements

Running the Scraper

Modifying the List of Sites

Merging

Output Structure

Scheduling Automation