Skip to content

Commit ac6cfea

Browse files
committed
yaml
1 parent 03fff2f commit ac6cfea

File tree

8 files changed

+610
-1
lines changed

8 files changed

+610
-1
lines changed

README.md

+70-1
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,77 @@
22

33
CleanGo is a library that performs data cleaning and transformation operations with the speed and efficiency of the Go language.
44

5+
## Project Purpose
6+
7+
CleanGo aims to simplify and accelerate data cleaning processes, which are often the most time-consuming part of data analysis and machine learning workflows. By leveraging Go's performance and concurrency capabilities, CleanGo provides a robust toolkit for:
8+
9+
- **Efficient Data Processing**: Handle large datasets with minimal memory footprint and maximum CPU utilization
10+
- **Format Flexibility**: Seamlessly work with various data formats (CSV, JSON, XML, YAML, Excel, Parquet) without needing multiple tools
11+
- **Consistent API**: Use the same clean interface whether working with small or big data
12+
- **Automation-Ready**: Integrate into data pipelines via library, CLI, or microservice approaches
13+
- **Parallel Processing**: Utilize all available CPU cores for data cleaning tasks that can be parallelized
14+
15+
Whether you're a data scientist cleaning datasets for analysis, a developer building ETL pipelines, or a data engineer maintaining data quality, CleanGo provides the tools to make your data cleaning processes faster and more efficient.
16+
17+
## Use Cases
18+
19+
CleanGo is designed to address a variety of data cleaning challenges:
20+
21+
- **Data Preparation for Analysis**: Clean and transform raw data before feeding it into analysis tools or machine learning models
22+
- **ETL Processes**: Extract data from various sources, transform it with cleaning operations, and load it into target systems
23+
- **Data Migration**: Convert data between different formats while applying cleaning rules
24+
- **Data Quality Assurance**: Standardize dates, normalize text, handle missing values, and remove outliers
25+
- **Batch Processing**: Process large datasets efficiently with parallel execution
26+
- **Microservice Architecture**: Deploy as a standalone service for data cleaning operations in distributed systems
27+
28+
## Target Audience
29+
30+
- **Data Scientists**: Who need to clean and prepare datasets for analysis and modeling
31+
- **Data Engineers**: Building robust data pipelines that require cleaning steps
32+
- **Backend Developers**: Integrating data cleaning capabilities into Go applications
33+
- **DevOps Engineers**: Looking for efficient tools to include in data processing workflows
34+
- **Analysts**: Who work with data from multiple sources and need to standardize formats
35+
36+
## Performance
37+
38+
CleanGo is built with performance in mind:
39+
40+
- **Parallel Processing**: Automatically distributes work across available CPU cores
41+
- **Memory Efficiency**: Processes data in chunks to minimize memory usage
42+
- **Optimized Algorithms**: Uses efficient algorithms for common cleaning operations
43+
- **Go's Speed**: Leverages the performance benefits of compiled Go code
44+
- **Benchmarked Operations**: Core operations are benchmarked to ensure optimal performance
45+
46+
In benchmark tests, CleanGo can process millions of rows per second on modern hardware, making it suitable for both small datasets and large-scale data processing tasks.
47+
48+
## Architecture
49+
50+
CleanGo follows a modular architecture:
51+
52+
- **Core DataFrame**: Central data structure that holds and manipulates tabular data
53+
- **Format Handlers**: Specialized modules for reading/writing different file formats (CSV, JSON, XML, Excel, Parquet)
54+
- **Cleaning Operations**: Individual functions for specific cleaning tasks
55+
- **Parallel Framework**: Infrastructure for executing operations in parallel
56+
- **CLI Layer**: Command-line interface for direct usage
57+
- **API Layer**: REST and gRPC interfaces for service-oriented usage
58+
59+
This modular design allows for easy extension with new formats or cleaning operations while maintaining a consistent interface.
60+
61+
## Future Plans
62+
63+
The CleanGo project is actively developed with the following features planned for future releases:
64+
65+
- **Additional Data Formats**: Support for more specialized formats like Avro, ORC, and database connections
66+
- **Advanced Cleaning Operations**: More sophisticated data cleaning algorithms including fuzzy matching and ML-based anomaly detection
67+
- **Streaming Support**: Process data in streaming mode for real-time applications
68+
- **Web UI**: A simple web interface for interactive data cleaning
69+
- **Plugin System**: Allow third-party extensions for custom formats and operations
70+
- **Cloud Integration**: Native connectors for cloud storage services (S3, GCS, Azure Blob)
71+
- **Performance Optimizations**: Continuous improvements to processing speed and memory usage
72+
573
## Features
674

7-
- ✅ Reading and writing data in CSV, JSON, XML, Excel, and Parquet formats
75+
- ✅ Reading and writing data in CSV, JSON, XML, YAML, Excel, and Parquet formats
876
- ✅ Powerful data cleaning functions
977
- ✅ High performance with parallel processing support
1078
- ✅ Both library usage and CLI support
@@ -65,6 +133,7 @@ POST /clean
65133
- CSV
66134
- JSON
67135
- XML
136+
- YAML
68137
- Excel (xlsx, xls)
69138
- Parquet
70139

examples/README.md

+15
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ This directory contains example files demonstrating the usage of the CleanGo lib
77
- `sample_data.csv`: Example data in CSV format
88
- `sample_data.json`: Example data in JSON format
99
- `sample_data.xml`: Example data in XML format
10+
- `sample_data.yaml`: Example data in YAML format
1011
- `sample_data.xlsx`: Example data in Excel format
1112
- `sample_data.parquet`: Example data in Parquet format
1213
- `api_request.json`: Example JSON request for API
@@ -38,6 +39,12 @@ XML file cleaning:
3839
cleango clean examples/sample_data.xml --trim --date-format="created_at:2006-01-02" --root-element="root" --item-element="item" --output=cleaned.xml
3940
```
4041

42+
YAML file cleaning:
43+
44+
```bash
45+
cleango clean examples/sample_data.yaml --trim --date-format="created_at:2006-01-02" --output=cleaned.yaml
46+
```
47+
4148
Excel file cleaning:
4249

4350
```bash
@@ -55,6 +62,7 @@ Reading CSV file and saving in different formats:
5562
```bash
5663
cleango clean examples/sample_data.csv --trim --format=json --output=cleaned.json
5764
cleango clean examples/sample_data.csv --trim --format=xml --output=cleaned.xml
65+
cleango clean examples/sample_data.csv --trim --format=yaml --output=cleaned.yaml
5866
cleango clean examples/sample_data.csv --trim --format=excel --output=cleaned.xlsx
5967
cleango clean examples/sample_data.csv --trim --format=parquet --output=cleaned.parquet
6068
```
@@ -166,6 +174,12 @@ func main() {
166174
log.Fatalf("Error: %v", err)
167175
}
168176

177+
// Read YAML file
178+
yamlDf, err := cleaner.ReadYAML("examples/sample_data.yaml")
179+
if err != nil {
180+
log.Fatalf("Error: %v", err)
181+
}
182+
169183
// Basic cleaning operations
170184
df.TrimColumns()
171185
df.CleanDates("created_at", "2006-01-02")
@@ -181,6 +195,7 @@ func main() {
181195
df.WriteCSV("cleaned.csv")
182196
df.WriteJSON("cleaned.json")
183197
df.WriteXML("cleaned.xml", formats.WithXMLPretty(true), formats.WithXMLRootElement("users"), formats.WithXMLItemElement("user"))
198+
df.WriteYAML("cleaned.yaml", formats.WithYAMLPretty(true))
184199
df.WriteExcel("cleaned.xlsx", formats.WithSheetName("Temizlenmiş Veri"))
185200
df.WriteParquet("cleaned.parquet", cleaner.WithParquetCompression(parquet.CompressionCodec_SNAPPY))
186201

examples/sample_data.yaml

+79
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
- id: 1
2+
name: John Doe
3+
4+
age: 30
5+
created_at: 2023-01-15
6+
active: true
7+
score: 85.5
8+
9+
- id: 2
10+
name: Jane Smith
11+
12+
age: 25
13+
created_at: 2023-02-20
14+
active: true
15+
score: 92.0
16+
17+
- id: 3
18+
name: Bob Johnson
19+
20+
age: 40
21+
created_at: 2023-03-10
22+
active: false
23+
score: 78.3
24+
25+
- id: 4
26+
name: Alice Brown
27+
28+
age: 35
29+
created_at: 2023-04-05
30+
active: true
31+
score: 88.7
32+
33+
- id: 5
34+
name: Charlie Wilson
35+
36+
age: 28
37+
created_at: 2023-05-12
38+
active: false
39+
score: 76.2
40+
41+
- id: 6
42+
name: Eva Davis
43+
44+
age: 32
45+
created_at: 2023-06-18
46+
active: true
47+
score: 90.1
48+
49+
- id: 7
50+
name: Frank Miller
51+
52+
age: 45
53+
created_at: 2023-07-22
54+
active: true
55+
score: 82.9
56+
57+
- id: 8
58+
name: Grace Taylor
59+
60+
age: 27
61+
created_at: 2023-08-30
62+
active: false
63+
score: 79.5
64+
65+
- id: 9
66+
name: Henry Clark
67+
68+
age: 38
69+
created_at: 2023-09-14
70+
active: true
71+
score: 87.3
72+
73+
- id: 10
74+
name: Ivy Robinson
75+
76+
age: 29
77+
created_at: 2023-10-25
78+
active: true
79+
score: 91.8

go.mod

+1
Original file line numberDiff line numberDiff line change
@@ -23,4 +23,5 @@ require (
2323
golang.org/x/net v0.30.0 // indirect
2424
golang.org/x/text v0.19.0 // indirect
2525
golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1 // indirect
26+
gopkg.in/yaml.v3 v3.0.1 // indirect
2627
)

pkg/cleaner/yaml.go

+20
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
package cleaner
2+
3+
import (
4+
"github.com/mstgnz/cleango/pkg/formats"
5+
)
6+
7+
// ReadYAML reads a YAML file and returns a DataFrame
8+
func ReadYAML(filePath string, options ...formats.YAMLOption) (*DataFrame, error) {
9+
headers, data, err := formats.ReadYAMLToRaw(filePath, options...)
10+
if err != nil {
11+
return nil, err
12+
}
13+
14+
return NewDataFrame(headers, data)
15+
}
16+
17+
// WriteYAML writes a DataFrame to a YAML file
18+
func (df *DataFrame) WriteYAML(filePath string, options ...formats.YAMLOption) error {
19+
return formats.WriteYAML(df, filePath, options...)
20+
}

pkg/cleaner/yaml_test.go

+93
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
package cleaner
2+
3+
import (
4+
"os"
5+
"testing"
6+
7+
"github.com/mstgnz/cleango/pkg/formats"
8+
)
9+
10+
func TestReadWriteYAML(t *testing.T) {
11+
// Create a temporary YAML file
12+
tempFile, err := os.CreateTemp("", "test_*.yaml")
13+
if err != nil {
14+
t.Fatalf("Failed to create temp file: %v", err)
15+
}
16+
tempFileName := tempFile.Name()
17+
tempFile.Close()
18+
defer os.Remove(tempFileName)
19+
20+
// Create test data
21+
yamlContent := `- id: 1
22+
name: John Doe
23+
24+
age: 30
25+
created_at: 2023-01-15
26+
active: true
27+
score: 85.5
28+
- id: 2
29+
name: Jane Smith
30+
31+
age: 25
32+
created_at: 2023-02-20
33+
active: true
34+
score: 92.0
35+
- id: 3
36+
name: Bob Johnson
37+
38+
age: 40
39+
created_at: 2023-03-10
40+
active: false
41+
score: 78.3`
42+
43+
// Write test data to file
44+
err = os.WriteFile(tempFileName, []byte(yamlContent), 0644)
45+
if err != nil {
46+
t.Fatalf("Failed to write test data: %v", err)
47+
}
48+
49+
// Read YAML file
50+
df, err := ReadYAML(tempFileName)
51+
if err != nil {
52+
t.Fatalf("Failed to read YAML: %v", err)
53+
}
54+
55+
// Check data
56+
rows, cols := df.Shape()
57+
if rows != 3 {
58+
t.Errorf("Expected 3 rows, got %d", rows)
59+
}
60+
if cols < 7 {
61+
t.Errorf("Expected at least 7 columns, got %d", cols)
62+
}
63+
64+
// Create a temporary output file
65+
outFile, err := os.CreateTemp("", "test_out_*.yaml")
66+
if err != nil {
67+
t.Fatalf("Failed to create output file: %v", err)
68+
}
69+
outFileName := outFile.Name()
70+
outFile.Close()
71+
defer os.Remove(outFileName)
72+
73+
// Write DataFrame to YAML
74+
err = df.WriteYAML(outFileName, formats.WithYAMLPretty(true))
75+
if err != nil {
76+
t.Fatalf("Failed to write YAML: %v", err)
77+
}
78+
79+
// Read the written file
80+
df2, err := ReadYAML(outFileName)
81+
if err != nil {
82+
t.Fatalf("Failed to read written YAML: %v", err)
83+
}
84+
85+
// Check data
86+
rows2, cols2 := df2.Shape()
87+
if rows2 != rows {
88+
t.Errorf("Row count mismatch: got %d, want %d", rows2, rows)
89+
}
90+
if cols2 != cols {
91+
t.Errorf("Column count mismatch: got %d, want %d", cols2, cols)
92+
}
93+
}

0 commit comments

Comments
 (0)