This repository provides tools and infrastructure for creating, sharing, and utilizing high-quality synthetic training data for fine-tuning language models through Ollama. Our primary focus is building a community-driven collection of specialized training examples for capabilities like ethical reasoning, with a structured methodology that anyone can follow.
Our goal is to:
- Enable community-driven data creation: Provide tools that let anyone generate valuable synthetic training data
- Build a shared resource: Collect diverse, high-quality examples that benefit the entire AI community
- Standardize data formats: Create consistent, well-structured training examples for specialized capabilities
- Foster collaboration: Make it easy to contribute new data and methodologies
- Simple Agent Chat Interface: Tools for interacting with Ollama models
- CommonCrawl Data Extraction: Utilities to download and process data from CommonCrawl
- Synthetic Training Data Generation: Scripts to create structured training examples
- Ethical Reasoning Agent: Implementation demonstrating specialized capabilities
- Python 3.6+
- Ollama installed and running locally (or remotely)
- Language models loaded in Ollama (e.g., deepseek-r1, llama3, etc.)
- MikeyBeez/Ollama_Agents (for some components - see installation instructions below)
-
Clone the repository:
git clone https://github.com/MikeyBeez/Ollama_Experiments.git cd Ollama_Experiments
-
Create a virtual environment:
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Set up Ollama_Agents:
# Clone the Ollama_Agents repository git clone https://github.com/MikeyBeez/Ollama_Agents.git ../Ollama_Agents # Add to your Python path (you may want to add this to your .bashrc or .zshrc) export PYTHONPATH="../Ollama_Agents:$PYTHONPATH" # Install its dependencies pip install -r ../Ollama_Agents/requirements.txt
-
Configure the environment:
cp .env.sample .env # Edit the .env file to set your preferred model and Ollama host
For a step-by-step walkthrough of the entire process from setup to contributing, see our Complete Tutorial.
This repository demonstrates creating synthetic training data for AI models with specific objective functions, such as ethical reasoning. Here's the general workflow:
CommonCrawl is a vast repository of web crawl data that provides a rich source of human-generated content for training data.
# Download a sample from CommonCrawl
python download_cc_sample.py --size 100 # Download ~100MB of data
The script will:
- Download WARC files from CommonCrawl
- Extract and process the data into a more usable format
- Save the processed data to the
data/jsonl
directory
Once you have source data, you can generate synthetic training examples:
# Generate ethical reasoning training data
python generate_ethical_data.py --count 20 # Generate 20 examples
For quicker testing of the synthetic data generation:
# Test with a single example
python generate_single_example.py --output data/test_example.json
The repository includes an implementation of an ethical reasoning agent that can analyze scenarios and provide structured ethical analyses:
# Run with preset scenarios
python ethical_agent.py
# Analyze a specific scenario
python ethical_agent.py --scenario "Companies track user data without consent" --category privacy
# Interactive mode
python ethical_agent.py --interactive
The synthetic training data is structured to train models with specialized capabilities:
{
"passage": "Text describing an ethically relevant situation",
"category": "privacy|fairness|autonomy|harm|deception",
"reasoning": "<|begin_of_thought|>\nDetailed ethical analysis...\n<|end_of_thought|>\n\n<|begin_of_solution|>\nEthical conclusion and recommendations...\n<|end_of_solution|>"
}
This format encourages models to:
- Perform thorough analysis in the "thought" section
- Provide clear conclusions in the "solution" section
We welcome and encourage contributions of synthetic training data and improvements to the generation methodology. Here's how you can contribute:
- Generate synthetic examples using the tools provided in this repository
- Validate your examples for quality and effectiveness
- Submit a pull request with your new data in the appropriate format
- Document your contribution including any special considerations or insights
- Make sure your examples follow the structured format described in this README
- Include metadata about how the examples were generated
- Ensure your examples don't contain personally identifiable information (PII)
- Test your examples with the provided agent implementations
- Datasets for new capabilities beyond ethical reasoning
- Improvements to the generation methodology
- New prompt templates that produce better results
- Tools for validating or filtering generated examples
This framework can be adapted for other specialized AI capabilities:
- Create a data extraction script to obtain relevant source material
- Define the objective function (what capability you want the AI to learn)
- Design a prompt template that elicits the desired reasoning pattern
- Generate synthetic examples using existing models
- Create a specialized agent that demonstrates the capability
simple_agent.py
- Basic agent interface for Ollama modelsdownload_cc_sample.py
- Tool for downloading data from CommonCrawlgenerate_ethical_data.py
- Generate ethical reasoning training dataethical_agent.py
- Specialized agent for ethical reasoningtest_ollama.py
- Diagnostic tool for Ollama APIgenerate_single_example.py
- Generate a single training example.env.sample
- Example environment configuration
To see what models are available on your Ollama instance, run:
ollama list
If you encounter issues or have questions:
- Open an issue in this repository
- Provide detailed information about your environment and problem
- Share any error messages or unexpected behavior
We encourage you to share how you've used this framework:
- If you've created interesting examples, submit them through a pull request
- If you've extended the framework for a new capability, consider contributing your code
- Share your success stories in the discussions section
All data in this repository is available under the MIT License. You are free to:
- Use it for research purposes
- Include it in your own projects
- Build upon it for your own applications
We only ask that you:
- Cite this repository if you use it in academic work
- Consider contributing back improvements or extensions