yaml

mstgnz · mstgnz · commit ac6cfeaacca3 · 2025-03-06T23:34:27.000+03:00
diff --git a/README.md b/README.md
@@ -2,9 +2,77 @@
 
 CleanGo is a library that performs data cleaning and transformation operations with the speed and efficiency of the Go language.
 
+## Project Purpose
+
+CleanGo aims to simplify and accelerate data cleaning processes, which are often the most time-consuming part of data analysis and machine learning workflows. By leveraging Go's performance and concurrency capabilities, CleanGo provides a robust toolkit for:
+
+- **Efficient Data Processing**: Handle large datasets with minimal memory footprint and maximum CPU utilization
+- **Format Flexibility**: Seamlessly work with various data formats (CSV, JSON, XML, YAML, Excel, Parquet) without needing multiple tools
+- **Consistent API**: Use the same clean interface whether working with small or big data
+- **Automation-Ready**: Integrate into data pipelines via library, CLI, or microservice approaches
+- **Parallel Processing**: Utilize all available CPU cores for data cleaning tasks that can be parallelized
+
+Whether you're a data scientist cleaning datasets for analysis, a developer building ETL pipelines, or a data engineer maintaining data quality, CleanGo provides the tools to make your data cleaning processes faster and more efficient.
+
+## Use Cases
+
+CleanGo is designed to address a variety of data cleaning challenges:
+
+- **Data Preparation for Analysis**: Clean and transform raw data before feeding it into analysis tools or machine learning models
+- **ETL Processes**: Extract data from various sources, transform it with cleaning operations, and load it into target systems
+- **Data Migration**: Convert data between different formats while applying cleaning rules
+- **Data Quality Assurance**: Standardize dates, normalize text, handle missing values, and remove outliers
+- **Batch Processing**: Process large datasets efficiently with parallel execution
+- **Microservice Architecture**: Deploy as a standalone service for data cleaning operations in distributed systems
+
+## Target Audience
+
+- **Data Scientists**: Who need to clean and prepare datasets for analysis and modeling
+- **Data Engineers**: Building robust data pipelines that require cleaning steps
+- **Backend Developers**: Integrating data cleaning capabilities into Go applications
+- **DevOps Engineers**: Looking for efficient tools to include in data processing workflows
+- **Analysts**: Who work with data from multiple sources and need to standardize formats
+
+## Performance
+
+CleanGo is built with performance in mind:
+
+- **Parallel Processing**: Automatically distributes work across available CPU cores
+- **Memory Efficiency**: Processes data in chunks to minimize memory usage
+- **Optimized Algorithms**: Uses efficient algorithms for common cleaning operations
+- **Go's Speed**: Leverages the performance benefits of compiled Go code
+- **Benchmarked Operations**: Core operations are benchmarked to ensure optimal performance
+
+In benchmark tests, CleanGo can process millions of rows per second on modern hardware, making it suitable for both small datasets and large-scale data processing tasks.
+
+## Architecture
+
+CleanGo follows a modular architecture:
+
+- **Core DataFrame**: Central data structure that holds and manipulates tabular data
+- **Format Handlers**: Specialized modules for reading/writing different file formats (CSV, JSON, XML, Excel, Parquet)
+- **Cleaning Operations**: Individual functions for specific cleaning tasks
+- **Parallel Framework**: Infrastructure for executing operations in parallel
+- **CLI Layer**: Command-line interface for direct usage
+- **API Layer**: REST and gRPC interfaces for service-oriented usage
+
+This modular design allows for easy extension with new formats or cleaning operations while maintaining a consistent interface.
+
+## Future Plans
+
+The CleanGo project is actively developed with the following features planned for future releases:
+
+- **Additional Data Formats**: Support for more specialized formats like Avro, ORC, and database connections
+- **Advanced Cleaning Operations**: More sophisticated data cleaning algorithms including fuzzy matching and ML-based anomaly detection
+- **Streaming Support**: Process data in streaming mode for real-time applications
+- **Web UI**: A simple web interface for interactive data cleaning
+- **Plugin System**: Allow third-party extensions for custom formats and operations
+- **Cloud Integration**: Native connectors for cloud storage services (S3, GCS, Azure Blob)
+- **Performance Optimizations**: Continuous improvements to processing speed and memory usage
+
 ## Features
 
-- ✅ Reading and writing data in CSV, JSON, XML, Excel, and Parquet formats
+- ✅ Reading and writing data in CSV, JSON, XML, YAML, Excel, and Parquet formats
 - ✅ Powerful data cleaning functions
 - ✅ High performance with parallel processing support
 - ✅ Both library usage and CLI support
@@ -65,6 +133,7 @@ POST /clean
 - CSV
 - JSON
 - XML
+- YAML
 - Excel (xlsx, xls)
 - Parquet
 
diff --git a/examples/README.md b/examples/README.md
@@ -7,6 +7,7 @@ This directory contains example files demonstrating the usage of the CleanGo lib
 - `sample_data.csv`: Example data in CSV format
 - `sample_data.json`: Example data in JSON format
 - `sample_data.xml`: Example data in XML format
+- `sample_data.yaml`: Example data in YAML format
 - `sample_data.xlsx`: Example data in Excel format
 - `sample_data.parquet`: Example data in Parquet format
 - `api_request.json`: Example JSON request for API
@@ -38,6 +39,12 @@ XML file cleaning:
 cleango clean examples/sample_data.xml --trim --date-format="created_at:2006-01-02" --root-element="root" --item-element="item" --output=cleaned.xml
 ```
 
+YAML file cleaning:
+
+```bash
+cleango clean examples/sample_data.yaml --trim --date-format="created_at:2006-01-02" --output=cleaned.yaml
+```
+
 Excel file cleaning:
 
 ```bash
@@ -55,6 +62,7 @@ Reading CSV file and saving in different formats:
 ```bash
 cleango clean examples/sample_data.csv --trim --format=json --output=cleaned.json
 cleango clean examples/sample_data.csv --trim --format=xml --output=cleaned.xml
+cleango clean examples/sample_data.csv --trim --format=yaml --output=cleaned.yaml
 cleango clean examples/sample_data.csv --trim --format=excel --output=cleaned.xlsx
 cleango clean examples/sample_data.csv --trim --format=parquet --output=cleaned.parquet
 ```
@@ -166,6 +174,12 @@ func main() {
 		log.Fatalf("Error: %v", err)
 	}
 
+	// Read YAML file
+	yamlDf, err := cleaner.ReadYAML("examples/sample_data.yaml")
+	if err != nil {
+		log.Fatalf("Error: %v", err)
+	}
+
 	// Basic cleaning operations
 	df.TrimColumns()
 	df.CleanDates("created_at", "2006-01-02")
@@ -181,6 +195,7 @@ func main() {
 	df.WriteCSV("cleaned.csv")
 	df.WriteJSON("cleaned.json")
 	df.WriteXML("cleaned.xml", formats.WithXMLPretty(true), formats.WithXMLRootElement("users"), formats.WithXMLItemElement("user"))
+	df.WriteYAML("cleaned.yaml", formats.WithYAMLPretty(true))
 	df.WriteExcel("cleaned.xlsx", formats.WithSheetName("Temizlenmiş Veri"))
 	df.WriteParquet("cleaned.parquet", cleaner.WithParquetCompression(parquet.CompressionCodec_SNAPPY))
 
diff --git a/examples/sample_data.yaml b/examples/sample_data.yaml
@@ -0,0 +1,79 @@
+- id: 1
+  name: John Doe
+  email: john@example.com
+  age: 30
+  created_at: 2023-01-15
+  active: true
+  score: 85.5
+
+- id: 2
+  name: Jane Smith
+  email: jane@example.com
+  age: 25
+  created_at: 2023-02-20
+  active: true
+  score: 92.0
+
+- id: 3
+  name: Bob Johnson
+  email: bob@example.com
+  age: 40
+  created_at: 2023-03-10
+  active: false
+  score: 78.3
+
+- id: 4
+  name: Alice Brown
+  email: alice@example.com
+  age: 35
+  created_at: 2023-04-05
+  active: true
+  score: 88.7
+
+- id: 5
+  name: Charlie Wilson
+  email: charlie@example.com
+  age: 28
+  created_at: 2023-05-12
+  active: false
+  score: 76.2
+
+- id: 6
+  name: Eva Davis
+  email: eva@example.com
+  age: 32
+  created_at: 2023-06-18
+  active: true
+  score: 90.1
+
+- id: 7
+  name: Frank Miller
+  email: frank@example.com
+  age: 45
+  created_at: 2023-07-22
+  active: true
+  score: 82.9
+
+- id: 8
+  name: Grace Taylor
+  email: grace@example.com
+  age: 27
+  created_at: 2023-08-30
+  active: false
+  score: 79.5
+
+- id: 9
+  name: Henry Clark
+  email: henry@example.com
+  age: 38
+  created_at: 2023-09-14
+  active: true
+  score: 87.3
+
+- id: 10
+  name: Ivy Robinson
+  email: ivy@example.com
+  age: 29
+  created_at: 2023-10-25
+  active: true
+  score: 91.8 
diff --git a/go.mod b/go.mod
@@ -23,4 +23,5 @@ require (
 	golang.org/x/net v0.30.0 // indirect
 	golang.org/x/text v0.19.0 // indirect
 	golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1 // indirect
+	gopkg.in/yaml.v3 v3.0.1 // indirect
 )
diff --git a/pkg/cleaner/yaml.go b/pkg/cleaner/yaml.go
@@ -0,0 +1,20 @@
+package cleaner
+
+import (
+	"github.com/mstgnz/cleango/pkg/formats"
+)
+
+// ReadYAML reads a YAML file and returns a DataFrame
+func ReadYAML(filePath string, options ...formats.YAMLOption) (*DataFrame, error) {
+	headers, data, err := formats.ReadYAMLToRaw(filePath, options...)
+	if err != nil {
+		return nil, err
+	}
+
+	return NewDataFrame(headers, data)
+}
+
+// WriteYAML writes a DataFrame to a YAML file
+func (df *DataFrame) WriteYAML(filePath string, options ...formats.YAMLOption) error {
+	return formats.WriteYAML(df, filePath, options...)
+}
diff --git a/pkg/cleaner/yaml_test.go b/pkg/cleaner/yaml_test.go
@@ -0,0 +1,93 @@
+package cleaner
+
+import (
+	"os"
+	"testing"
+
+	"github.com/mstgnz/cleango/pkg/formats"
+)
+
+func TestReadWriteYAML(t *testing.T) {
+	// Create a temporary YAML file
+	tempFile, err := os.CreateTemp("", "test_*.yaml")
+	if err != nil {
+		t.Fatalf("Failed to create temp file: %v", err)
+	}
+	tempFileName := tempFile.Name()
+	tempFile.Close()
+	defer os.Remove(tempFileName)
+
+	// Create test data
+	yamlContent := `- id: 1
+  name: John Doe
+  email: john@example.com
+  age: 30
+  created_at: 2023-01-15
+  active: true
+  score: 85.5
+- id: 2
+  name: Jane Smith
+  email: jane@example.com
+  age: 25
+  created_at: 2023-02-20
+  active: true
+  score: 92.0
+- id: 3
+  name: Bob Johnson
+  email: bob@example.com
+  age: 40
+  created_at: 2023-03-10
+  active: false
+  score: 78.3`
+
+	// Write test data to file
+	err = os.WriteFile(tempFileName, []byte(yamlContent), 0644)
+	if err != nil {
+		t.Fatalf("Failed to write test data: %v", err)
+	}
+
+	// Read YAML file
+	df, err := ReadYAML(tempFileName)
+	if err != nil {
+		t.Fatalf("Failed to read YAML: %v", err)
+	}
+
+	// Check data
+	rows, cols := df.Shape()
+	if rows != 3 {
+		t.Errorf("Expected 3 rows, got %d", rows)
+	}
+	if cols < 7 {
+		t.Errorf("Expected at least 7 columns, got %d", cols)
+	}
+
+	// Create a temporary output file
+	outFile, err := os.CreateTemp("", "test_out_*.yaml")
+	if err != nil {
+		t.Fatalf("Failed to create output file: %v", err)
+	}
+	outFileName := outFile.Name()
+	outFile.Close()
+	defer os.Remove(outFileName)
+
+	// Write DataFrame to YAML
+	err = df.WriteYAML(outFileName, formats.WithYAMLPretty(true))
+	if err != nil {
+		t.Fatalf("Failed to write YAML: %v", err)
+	}
+
+	// Read the written file
+	df2, err := ReadYAML(outFileName)
+	if err != nil {
+		t.Fatalf("Failed to read written YAML: %v", err)
+	}
+
+	// Check data
+	rows2, cols2 := df2.Shape()
+	if rows2 != rows {
+		t.Errorf("Row count mismatch: got %d, want %d", rows2, rows)
+	}
+	if cols2 != cols {
+		t.Errorf("Column count mismatch: got %d, want %d", cols2, cols)
+	}
+}
diff --git a/pkg/formats/yaml.go b/pkg/formats/yaml.go
diff --git a/pkg/formats/yaml_test.go b/pkg/formats/yaml_test.go

Original file line number	Diff line number	Diff line change
`@@ -23,4 +23,5 @@ require (`
`23`	`23`	`golang.org/x/net v0.30.0 // indirect`
`24`	`24`	`golang.org/x/text v0.19.0 // indirect`
`25`	`25`	`golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1 // indirect`
	`26`	`+ gopkg.in/yaml.v3 v3.0.1 // indirect`
`26`	`27`	`)`