Ollama Experiments: Collaborative Synthetic Training Data

This repository provides tools and infrastructure for creating, sharing, and utilizing high-quality synthetic training data for fine-tuning language models through Ollama. Our primary focus is building a community-driven collection of specialized training examples for capabilities like ethical reasoning, with a structured methodology that anyone can follow.

Project Mission

Our goal is to:

Enable community-driven data creation: Provide tools that let anyone generate valuable synthetic training data
Build a shared resource: Collect diverse, high-quality examples that benefit the entire AI community
Standardize data formats: Create consistent, well-structured training examples for specialized capabilities
Foster collaboration: Make it easy to contribute new data and methodologies

Repository Contents

Simple Agent Chat Interface: Tools for interacting with Ollama models
CommonCrawl Data Extraction: Utilities to download and process data from CommonCrawl
Synthetic Training Data Generation: Scripts to create structured training examples
Ethical Reasoning Agent: Implementation demonstrating specialized capabilities

Requirements

Python 3.6+
Ollama installed and running locally (or remotely)
Language models loaded in Ollama (e.g., deepseek-r1, llama3, etc.)
MikeyBeez/Ollama_Agents (for some components - see installation instructions below)

Setup

Clone the repository:

git clone https://github.com/MikeyBeez/Ollama_Experiments.git
cd Ollama_Experiments

Create a virtual environment:

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Set up Ollama_Agents:

# Clone the Ollama_Agents repository
git clone https://github.com/MikeyBeez/Ollama_Agents.git ../Ollama_Agents

# Add to your Python path (you may want to add this to your .bashrc or .zshrc)
export PYTHONPATH="../Ollama_Agents:$PYTHONPATH"

# Install its dependencies
pip install -r ../Ollama_Agents/requirements.txt

Configure the environment:

cp .env.sample .env
# Edit the .env file to set your preferred model and Ollama host

Complete Tutorial

For a step-by-step walkthrough of the entire process from setup to contributing, see our Complete Tutorial.

Creating Synthetic Training Data

This repository demonstrates creating synthetic training data for AI models with specific objective functions, such as ethical reasoning. Here's the general workflow:

1. Acquiring Source Data from CommonCrawl

CommonCrawl is a vast repository of web crawl data that provides a rich source of human-generated content for training data.

# Download a sample from CommonCrawl
python download_cc_sample.py --size 100  # Download ~100MB of data

The script will:

Download WARC files from CommonCrawl
Extract and process the data into a more usable format
Save the processed data to the data/jsonl directory

2. Generating Synthetic Training Data

Once you have source data, you can generate synthetic training examples:

# Generate ethical reasoning training data
python generate_ethical_data.py --count 20  # Generate 20 examples

For quicker testing of the synthetic data generation:

# Test with a single example
python generate_single_example.py --output data/test_example.json

3. Using the Ethical Reasoning Agent

The repository includes an implementation of an ethical reasoning agent that can analyze scenarios and provide structured ethical analyses:

# Run with preset scenarios
python ethical_agent.py

# Analyze a specific scenario
python ethical_agent.py --scenario "Companies track user data without consent" --category privacy

# Interactive mode
python ethical_agent.py --interactive

Training Data Formats

The synthetic training data is structured to train models with specialized capabilities:

Ethical Reasoning Format

{
  "passage": "Text describing an ethically relevant situation",
  "category": "privacy|fairness|autonomy|harm|deception",
  "reasoning": "<|begin_of_thought|>\nDetailed ethical analysis...\n<|end_of_thought|>\n\n<|begin_of_solution|>\nEthical conclusion and recommendations...\n<|end_of_solution|>"
}

This format encourages models to:

Perform thorough analysis in the "thought" section
Provide clear conclusions in the "solution" section

Contributing to the Repository

We welcome and encourage contributions of synthetic training data and improvements to the generation methodology. Here's how you can contribute:

Contributing Training Data

Generate synthetic examples using the tools provided in this repository
Validate your examples for quality and effectiveness
Submit a pull request with your new data in the appropriate format
Document your contribution including any special considerations or insights

Contribution Guidelines

Make sure your examples follow the structured format described in this README
Include metadata about how the examples were generated
Ensure your examples don't contain personally identifiable information (PII)
Test your examples with the provided agent implementations

Ideas for Contributions

Datasets for new capabilities beyond ethical reasoning
Improvements to the generation methodology
New prompt templates that produce better results
Tools for validating or filtering generated examples

Extending for Other Objective Functions

This framework can be adapted for other specialized AI capabilities:

Create a data extraction script to obtain relevant source material
Define the objective function (what capability you want the AI to learn)
Design a prompt template that elicits the desired reasoning pattern
Generate synthetic examples using existing models
Create a specialized agent that demonstrates the capability

Components

simple_agent.py - Basic agent interface for Ollama models
download_cc_sample.py - Tool for downloading data from CommonCrawl
generate_ethical_data.py - Generate ethical reasoning training data
ethical_agent.py - Specialized agent for ethical reasoning
test_ollama.py - Diagnostic tool for Ollama API
generate_single_example.py - Generate a single training example
.env.sample - Example environment configuration

Available Models

To see what models are available on your Ollama instance, run:

ollama list

Community and Support

Asking for Help

If you encounter issues or have questions:

Open an issue in this repository
Provide detailed information about your environment and problem
Share any error messages or unexpected behavior

Sharing Your Work

We encourage you to share how you've used this framework:

If you've created interesting examples, submit them through a pull request
If you've extended the framework for a new capability, consider contributing your code
Share your success stories in the discussions section

Using the Data

All data in this repository is available under the MIT License. You are free to:

Use it for research purposes
Include it in your own projects
Build upon it for your own applications

We only ask that you:

Cite this repository if you use it in academic work
Consider contributing back improvements or extensions

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
docs		docs
.env.sample		.env.sample
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
api_document.md		api_document.md
download_cc_sample.py		download_cc_sample.py
ethical_agent.py		ethical_agent.py
ethical_analyses.json		ethical_analyses.json
generate_ethical_data.py		generate_ethical_data.py
generate_single_example.py		generate_single_example.py
requirements.txt		requirements.txt
simple_agent.py		simple_agent.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ollama Experiments: Collaborative Synthetic Training Data

Project Mission

Repository Contents

Requirements

Setup

Complete Tutorial

Creating Synthetic Training Data

1. Acquiring Source Data from CommonCrawl

2. Generating Synthetic Training Data

3. Using the Ethical Reasoning Agent

Training Data Formats

Ethical Reasoning Format

Contributing to the Repository

Contributing Training Data

Contribution Guidelines

Ideas for Contributions

Extending for Other Objective Functions

Components

Available Models

Community and Support

Asking for Help

Sharing Your Work

Using the Data

License

About

Releases

Packages

Languages

License

MikeyBeez/Ollama_Experiments

Folders and files

Latest commit

History

Repository files navigation

Ollama Experiments: Collaborative Synthetic Training Data

Project Mission

Repository Contents

Requirements

Setup

Complete Tutorial

Creating Synthetic Training Data

1. Acquiring Source Data from CommonCrawl

2. Generating Synthetic Training Data

3. Using the Ethical Reasoning Agent

Training Data Formats

Ethical Reasoning Format

Contributing to the Repository

Contributing Training Data

Contribution Guidelines

Ideas for Contributions

Extending for Other Objective Functions

Components

Available Models

Community and Support

Asking for Help

Sharing Your Work

Using the Data

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages