This tutorial will guide you through the entire process of contributing synthetic training data to this repository, from setting up your environment to submitting a pull request with your generated examples.
- Setting Up Your Environment
- Acquiring Data from CommonCrawl
- Generating Synthetic Training Examples
- Customizing the Generation Process
- Validating Your Generated Data
- Submitting Your Contribution
Before starting, make sure you have:
- Python 3.6 or higher installed
- Git installed
- Ollama installed (see Ollama installation guide)
- At least one language model loaded in Ollama (e.g., llama3, deepseek, etc.)
git clone https://github.com/MikeyBeez/Ollama_Experiments.git
cd Ollama_Experiments
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
Some components of this project depend on the Ollama_Agents repository:
# Clone the Ollama_Agents repository (if you don't already have it)
git clone https://github.com/MikeyBeez/Ollama_Agents.git ../Ollama_Agents
# Add to your Python path
export PYTHONPATH="../Ollama_Agents:$PYTHONPATH"
# If you want to make this permanent, add to your shell configuration:
echo 'export PYTHONPATH="../Ollama_Agents:$PYTHONPATH"' >> ~/.bashrc # or ~/.zshrc
# Install its dependencies
pip install -r ../Ollama_Agents/requirements.txt
cp .env.sample .env
Open the .env
file in your editor and configure it:
# Ollama Configuration
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=llama3 # Or another model you have installed
# API Parameters
TEMPERATURE=0.7
TOP_P=0.9
MAX_TOKENS=2048
# Data Generation Settings
EXAMPLES_PER_BATCH=5
OUTPUT_DIR=./data
# Ethical Categories
ETHICAL_CATEGORIES=privacy,fairness,autonomy,harm,deception,general_ethics
# Logging
LOG_LEVEL=INFO
git checkout -b add-my-training-examples
CommonCrawl is a free repository of web crawl data that provides a great source of diverse text from the web.
CommonCrawl data is stored in WARC (Web ARChive) files, which are large archives of web content. Our tools simplify the process of downloading and extracting useful content from these files.
python download_cc_sample.py --size 100 # Download ~100MB of data
This will:
- Download WARC files from CommonCrawl
- Extract text content from web pages
- Filter for relevant content
- Save the processed data to
data/jsonl
directory
# List the downloaded files
ls -la data/jsonl/
# View a sample of the extracted content
head -n 50 data/jsonl/CC-MAIN-*.jsonl | jq
Now that you have raw text data, you can generate synthetic training examples.
Our default example uses ethical reasoning as the objective function. Each example follows this structure:
{
"passage": "Text describing an ethically relevant situation",
"category": "privacy|fairness|autonomy|harm|deception",
"reasoning": "<|begin_of_thought|>\nDetailed ethical analysis...\n<|end_of_thought|>\n\n<|begin_of_solution|>\nEthical conclusion and recommendations...\n<|end_of_solution|>"
}
# Generate 20 ethical reasoning examples
python generate_ethical_data.py --count 20 --output data/my_ethical_examples.json
For quicker iteration, you can generate just one example:
python generate_single_example.py --output data/test_example.json
You can customize the generation process to create different types of examples or improve the quality.
Examine the prompt template in generate_ethical_data.py
and customize it for your needs:
# Example of a prompt template section in the code
prompt = f"""
Given the following situation:
"{text}"
Analyze this situation from an ethical perspective related to {category}.
...
"""
Edit the ETHICAL_CATEGORIES
in your .env
file or pass them directly via command-line arguments:
python generate_ethical_data.py --categories privacy,fairness,autonomy --count 10
Change parameters like temperature and max tokens in your .env
file or via command-line:
python generate_ethical_data.py --temperature 0.8 --max_tokens 3000 --count 10
Before contributing, validate your generated examples to ensure they're high-quality.
# Run the ethical agent with your examples
python ethical_agent.py --examples data/my_ethical_examples.json
Open your generated JSON file and review the examples to ensure they:
- Have meaningful and diverse content
- Follow the correct format structure
- Contain thoughtful and useful analysis
- Don't contain problematic or harmful content
Edit your JSON file to remove or fix any examples that don't meet your quality standards.
Once you have generated and validated your training examples, you can contribute them back to the repository.
git add data/my_ethical_examples.json
git commit -m "Add new ethical reasoning examples focusing on [your focus area]"
First, fork the repository on GitHub, then:
git remote add fork https://github.com/[YOUR_USERNAME]/Ollama_Experiments.git
git push -u fork add-my-training-examples
- Go to the original repository: https://github.com/MikeyBeez/Ollama_Experiments
- Click "Pull Requests" and then "New Pull Request"
- Click "compare across forks"
- Select your fork and branch
- Fill out the PR template with information about your contribution:
- What type of examples did you create?
- How many examples are included?
- What methodology did you use?
- Any special considerations or insights?
The maintainers may provide feedback or ask for changes before accepting your contribution. Be ready to:
- Address any formatting issues
- Improve the quality of examples if needed
- Answer questions about your generation process
If you want to create a different type of training data beyond ethical reasoning:
Decide what capability you want to teach models:
- Mathematical reasoning
- Legal analysis
- Creative writing
- Scientific explanation
- etc.
Create a structured format that encourages the desired reasoning pattern, similar to our thought/solution format.
Copy and modify one of our existing scripts:
cp generate_ethical_data.py generate_my_objective_data.py
Then edit it to implement your objective function and prompt template.
Create documentation explaining:
- Your objective function
- The data format
- How to generate examples
- How to validate results
Congratulations! By following this tutorial, you've learned how to:
- Set up the environment for working with this repository
- Download and process data from CommonCrawl
- Generate synthetic training examples
- Customize the generation process
- Validate your examples
- Contribute your examples back to the repository
Your contribution helps build a valuable resource for the entire AI community. Thank you for participating in this collaborative effort!
If you encounter errors connecting to Ollama:
ERROR: Could not connect to Ollama server at http://localhost:11434
Solutions:
- Ensure Ollama is running:
ollama serve
- Check if your Ollama host is correct in the
.env
file - Try using a different port if you're running Ollama on a custom port
If you get an error that the model doesn't exist:
ERROR: Model [model_name] not found
Solutions:
- List available models:
ollama list
- Pull the model you want to use:
ollama pull llama3
- Update your
.env
file to use an available model
If you encounter memory errors during generation:
ERROR: CUDA out of memory
Solutions:
- Reduce batch size in the
.env
file - Use a smaller model
- Process fewer examples at once:
--count 5
If you see JSON parsing errors in your generated examples:
ERROR: Invalid JSON in generated example
Solutions:
- Adjust temperature to a lower value (e.g., 0.5)
- Use a different model that produces more consistent output
- Check and fix the format manually in problematic examples