A synthetic data generation platform for creating and managing training datasets for LLM fine-tuning.
SynthGen. Data Reimagined
Leverages foundation models to generate domain-specific examples through an intuitive web interface.
Helping ML engineers and organizations easily generate high-quality data for custom AI solutions.
- Multi-Provider Support: Unified API framework for seamless integration with Ollama, OpenAI, and other LLM providers
- Domain-Specific Generation Pipeline: Create datasets tailored to vertical applications with configurable quality parameters
- Batch Processing Orchestration: Generate and process multiple entries with distributed task management
- Data Validation Framework: Ensure dataset quality through comprehensive validation protocols
- Modern Interface Architecture: React/Chakra UI implementation with advanced state management
- Clone the repository:
git clone https://github.com/KazKozDev/dataset-creator.git
- Change to the project directory:
cd dataset-creator
- Deploy with containerization:
docker-compose up -d
Access the application at: http://localhost:3000
cd backend
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
pip install -r requirements.txt
uvicorn main:app --reload
cd frontend
npm install
npm start
The application implements a cloud-native architecture with emphasis on scalability:
Component | Technology | Implementation Details |
---|---|---|
Frontend | React, Chakra UI | Responsive SPA with comprehensive state management |
API Layer | Python FastAPI | RESTful services with asynchronous processing capabilities |
Database | PostgreSQL | Optimized schema for dataset versioning and metadata |
Deployment | Docker | Containerized services with environment isolation |
- Settings → Select provider → Configure authentication parameters
- Generator → Select domain → Define quality parameters → Execute pipeline
- Open dataset → Run validation suite → Apply improvements → Export production-ready dataset
Access comprehensive API documentation at http://localhost:8000/docs after deployment.
Key service endpoints:
GET /api/datasets
- List datasets with quality metricsPOST /api/datasets
- Create dataset with configuration parametersGET /api/providers
- List available LLM providers with capabilitiesGET /api/tasks
- Monitor task execution status
MIT License. See LICENSE file for details.
If you like this project, please give it a star ⭐
For questions, feedback, or support, reach out to: