Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
.github/workflows		.github/workflows
arrow		arrow
avro		avro
cmd/columnify		cmd/columnify
columnifier		columnifier
examples		examples
parquet		parquet
record		record
schema		schema
.gitignore		.gitignore
.goreleaser.yml		.goreleaser.yml
AUTHORS		AUTHORS
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Repository files navigation

columnify

Make record oriented data to columnar format.

Synopsis

Columnar formatted data is efficient for analytics queries, lightweight and ease to integrate with Data WareHouse middleware's. Conversion from record oriented data to columnar is sometimes realized by BigData stack like Hadoop ecosystem, and there's no easy way to do it lightly and quickly.

columnify is an easy conversion tool for columnar that enables to run single binary written in Go. It also supports some kinds of data format like JSONL(NewLine delimited JSON), Avro.

How to use

Installation

$ GO111MODULE=off go get github.com/reproio/columnify/cmd/columnify

Usage

$ ./columnify -h
Usage of columnify: columnify [-flags] [input files]
  -output string
        path to output file; default: stdout
  -recordType string
        data type, [avro|csv|jsonl|ltsv|msgpack|tsv] (default "jsonl")
  -schemaFile string
        path to schema file
  -schemaType string
        schema type, [avro|bigquery]

Example

$ cat examples/record/primitives.jsonl
{"boolean": false, "int": 1, "long": 1, "float": 1.1, "double": 1.1, "bytes": "foo", "string": "foo"}
{"boolean": true, "int": 2, "long": 2, "float": 2.2, "double": 2.2, "bytes": "bar", "string": "bar"}

$ ./columnify -schemaType avro -schemaFile examples/primitives.avsc -recordType jsonl examples/primitives.jsonl > out.parquet

$ parquet-tools schema out.parquet
message Primitives {
  required boolean boolean;
  required int32 int;
  required int64 long;
  required float float;
  required double double;
  required binary bytes;
  required binary string (UTF8);
}

$ parquet-tools cat -json out.parquet
{"boolean":false,"int":1,"long":1,"float":1.1,"double":1.1,"bytes":"Zm9v","string":"foo"}
{"boolean":true,"int":2,"long":2,"float":2.2,"double":2.2,"bytes":"YmFy","string":"bar"}

Supported formats

Input

Apache Avro
CSV
JSONL(NewLine delimited JSON)
LTSV
Message Pack
TSV

Output

Apache Parquet

Schema

Integration example

fluent-plugin-s3 parquet compressor
- An example is examples/fluent-plugin-s3
- It works as a Compressor of fluent-plugin-s3 write parquet file to tmp via chunk data.

Limilations

Currently it has some limitations from schema/record types.

Some logical types like Decimal are unsupported.
If using -recordType = avro, it doesn't support a nested record has only 1 sub field.
If using -recordType = avro, it converts bytes fields to base64 encoded value implicitly.
The supported values have limitations with considering to record types, e.g. if you use jsonl, it might not be able to handle a large value.

Development

Columnifier reads input file(s), converts format based on given parameter, finally writes output files. Format conversion is separated by schema / record. The schema conversion accepts input schema, then converts it to targets via Arrow's schema. And also the record conversion uses Arrow's Record as the intermediate data representation. columnify basically depends on existing modules but it contains additional modules like arrow, avro, parquet to fill insufficient features.

Release

goreleaser is integrated in GitHub Actions. It's triggerd on creating a new tag. Create a new release with semvar tag(vx.y.z) on this GitHub repo, then you get archives for some environments attached on the release.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

columnify

Synopsis

How to use

Installation

Usage

Example

Supported formats

Input

Output

Schema

Integration example

Limilations

Development

Release

About

Releases 6

Packages

Contributors 7

Languages

License

reproio/columnify

Folders and files

Latest commit

History

Repository files navigation

columnify

Synopsis

How to use

Installation

Usage

Example

Supported formats

Input

Output

Schema

Integration example

Limilations

Development

Release

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 6

Packages 0

Contributors 7

Languages

Packages