This repository contains a Python script that updates daily a list of Flow-related sites, GitHub repositories, and GitHub discussions and converts them into Markdown files. The resulting .md
files are intended for AI ingestion, Retrieval-Augmented Generation (RAG) pipelines, chatbots, or any other knowledge base platform that benefits from structured text.
We want a single repository that periodically crawls all relevant Flow ecosystem content—documentation, code examples, and community discussions—and stores them in a consolidated Markdown format. You can then feed these files into:
- ChatGPT plugins (for enhanced Q&A)
- Retrieval-Augmented Generation (indexing and searching them in a vector database)
- Discord/Telegram bots that cite official doc sections
- Any other knowledge base for advanced Q&A or search.
The Python script performs domain-limited BFS (Breadth-First Search) and specialized scraping logic based on each URL:
- Non-GitHub URLs are treated as “normal” websites.
- The script fetches each page and removes
<script>
,<style>
,<noscript>
tags. - Then it uses
markdownify
to convert the remaining HTML into Markdown. - It recurses only within the same domain to avoid crawling unrelated pages.
- For GitHub repo links like
https://github.com/onflow/flow-ft/
, the script visits:- The repo root
tree/(main|master)/...
subdirectoriesblob/(main|master)/...
file pages
- Files with certain extensions (like
.cdc
,.md
,.json
, etc.) or anyREADME
are downloaded in their raw form fromraw.githubusercontent.com
. - The file contents are saved in a
.md
file, wrapped in triple backticks for easy code parsing.
- For
https://github.com/orgs/onflow/discussions
, the script:- Crawls the listing pages, discovers discussion links like
/orgs/onflow/discussions/1330
- For each thread, it extracts only the text from user posts (skipping the GitHub UI) and converts it to Markdown.
- Crawls the listing pages, discovers discussion links like
- This yields
.md
files containing the original question and comments/replies.
- Python 3.7+
requests
,beautifulsoup4
,markdownify
Install all dependencies:
pip install requests beautifulsoup4 markdownify
Clone or download this repo locally.
In the repo directory, run:
python scraper.py
The script will crawl each site listed in SITES
(inside scraper.py
) and output the results under scraped_docs/
.
Inside scraper.py
, near the top, you’ll see:
SITES = [
"https://developers.flow.com/",
"https://academy.ecdao.org/en/cadence-by-example",
...
"https://github.com/onflow/flow-ft/",
...
"https://github.com/orgs/onflow/discussions"
]
- Add a docs site by appending its URL if it’s not on GitHub.
- Add a GitHub repo by appending the base URL (e.g. "https://github.com/onflow/another-repo").
- Add another GitHub Discussions page if needed.
- Remove any site by deleting or commenting out its line.
For private sites or repos, you may need authentication tokens/cookies to see content that’s not public.
You can merge all the .md
files into a single file or a file containing only the essentials (removing code blocks, etc.).
That will be useful for indexing or searching or being used in a chatbot.
python merge.py
After a successful run, you’ll see:
scraped_docs/
├─ developers_flow_com/
│ ├─ index.md
│ ├─ docs_tutorial_somepage.md
│ └─ ...
├─ github_com_onflow_flow_ft/
│ ├─ blob_main_contracts_exampletoken_cdc.md
│ ├─ ...
├─ github_com_orgs_onflow_discussions/
│ ├─ discussion_1330.md
│ ├─ discussion_1514.md
│ └─ ...
└─ ...
merged_docs/
├─ all_merged.md
└─ essentials_merged.md
- Docs directories for each site
- Repos with code files in
.md
(wrapped code blocks) - Discussions as
discussion_<id>.md
, each containing Q&A text.
The script can be scheduled to run daily using GitHub Actions.