Docs to Markdown

Docs to Markdown is a Python tool designed to scrape website content, specifically documentation, and save the content into a single markdown file. The scraped Markdown files can then be used as source data for custom GPTs, facilitating the creation of tailored language models with domain-specific knowledge.

Features

Scrapes website content starting from a given URL.
Extracts the main content from web pages using common CSS selectors (or by querying the OpenAI API, if it can't figure out where the content is).
Converts HTML content to Markdown format.
Saves the Markdown content in a structured directory.
Compiles all individual Markdown files into a single Markdown file.
Configurable to ignore content after a specified string.

Installation

Clone the repository:

git clone https://github.com/danmenzies/docs-to-markdown.git
cd docs-to-markdown

Create a virtual environment (optional but recommended):

python3 -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

Install the dependencies:
```
pip install .
```

Set up environment variables:

Open the .env file and add your OpenAI API key:

OPENAI_API_KEY=your_openai_api_key
OPENAI_MODEL=gpt-4  # or any other model you prefer

Usage

To use the Docs to Markdown tool, you can run the command-line interface as follows:

python main.py --start <URL> [--ignore_after <STRING>] [--debug]

Arguments

--start: The starting URL for the scraper (required).
--ignore_after: A string after which content should be ignored (optional).
--debug: Enable debug mode with a visible browser (optional).

Examples

Basic Usage:

scrape-website --start https://example.com/docs

Ignoring Content After a Specific String:

scrape-website --start https://example.com/docs --ignore_after "Footer"

Enabling Debug Mode:

scrape-website --start https://example.com/docs --debug

Output

The tool will scrape the content from the specified URL and save it as Markdown files in the downloaded directory. It will also compile all individual Markdown files into a single Markdown file in the same directory.

Project Structure

setup.py: Script for setting up the package, including dependencies and entry points.
main.py: Entry point for the command-line interface.
src/: Directory containing the core modules.
- __init__.py: Indicates that src is a Python package.
- utils.py: Utility functions for saving and compiling Markdown files, converting HTML to Markdown, and sanitizing filenames.
- scraper.py: Core scraper module for extracting and saving website content.

License

This project is licensed under the MIT License.

Author

Dan Menzies - [email protected]

Contributing

Contributions are welcome! Please feel free to submit a pull request or open an issue on GitHub.

Acknowledgements

By using Docs to Markdown, you can easily scrape and convert web-based documentation into Markdown format, providing a convenient way to compile and use this content for custom GPT models.

Invitation to Contributors

We invite developers and enthusiasts to contribute to Docs to Markdown. Whether you have suggestions for new features, improvements to existing functionalities, or bug fixes, your contributions are highly valued. Feel free to fork the repository, submit pull requests, or open issues with any ideas or problems you encounter. Together, we can enhance this tool and make documentation scraping and conversion even more efficient and effective.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Docs to Markdown

Features

Installation

Usage

Arguments

Examples

Output

Project Structure

License

Author

Contributing

Acknowledgements

Invitation to Contributors

About

Languages

License

danmenzies/docs-to-markdown

Folders and files

Latest commit

History

Repository files navigation

Docs to Markdown

Features

Installation

Usage

Arguments

Examples

Output

Project Structure

License

Author

Contributing

Acknowledgements

Invitation to Contributors

About

Resources

License

Stars

Watchers

Forks

Languages