Skip to content

Create synthetic specialized datasets to fine-tune the LLM.

License

Notifications You must be signed in to change notification settings

KazKozDev/dataset-creator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

logo

Version License Docker

A synthetic data generation platform for creating and managing training datasets for LLM fine-tuning.

SynthGen. Data Reimagined

Leverages foundation models to generate domain-specific examples through an intuitive web interface.
Helping ML engineers and organizations easily generate high-quality data for custom AI solutions.

Dataset Creator Interface

generator

□ Core Features

Foundation Model Integration

  • Multi-Provider Support: Unified API framework for seamless integration with Ollama, OpenAI, and other LLM providers

Training Data Engineering

  • Domain-Specific Generation Pipeline: Create datasets tailored to vertical applications with configurable quality parameters
  • Batch Processing Orchestration: Generate and process multiple entries with distributed task management

Quality Assurance System

  • Data Validation Framework: Ensure dataset quality through comprehensive validation protocols
  • Modern Interface Architecture: React/Chakra UI implementation with advanced state management

Production Deployment

  1. Clone the repository:
git clone https://github.com/KazKozDev/dataset-creator.git
  1. Change to the project directory:
cd dataset-creator
  1. Deploy with containerization:
docker-compose up -d

Access the application at: http://localhost:3000

Development Environment

Backend Services:

cd backend
python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows
pip install -r requirements.txt
uvicorn main:app --reload

Frontend Application:

cd frontend
npm install
npm start

□ Architecture

The application implements a cloud-native architecture with emphasis on scalability:

Component Technology Implementation Details
Frontend React, Chakra UI Responsive SPA with comprehensive state management
API Layer Python FastAPI RESTful services with asynchronous processing capabilities
Database PostgreSQL Optimized schema for dataset versioning and metadata
Deployment Docker Containerized services with environment isolation

□ Usage

Provider Configuration

  1. Settings → Select provider → Configure authentication parameters

Dataset Generation Workflow

  1. Generator → Select domain → Define quality parameters → Execute pipeline

Quality Management

  1. Open dataset → Run validation suite → Apply improvements → Export production-ready dataset

□ API Integration

Access comprehensive API documentation at http://localhost:8000/docs after deployment.

Key service endpoints:

  • GET /api/datasets - List datasets with quality metrics
  • POST /api/datasets - Create dataset with configuration parameters
  • GET /api/providers - List available LLM providers with capabilities
  • GET /api/tasks - Monitor task execution status

□ License

MIT License. See LICENSE file for details.


If you like this project, please give it a star ⭐

For questions, feedback, or support, reach out to:

Artem KK | GitHub Issues

About

Create synthetic specialized datasets to fine-tune the LLM.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published