-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#45 Implement arrow -> json writer #47
base: master
Are you sure you want to change the base?
Conversation
Oh I missed existing It's actually an internal package but reusable for this use case I guess. |
- (a breaking change) Use signed int in Arrow intemediates - Support some logical types - Fix some test cases
I finally examine a mem pprof result. It show a lower usage than the current version's ( #44 (comment) ) The reduction effect is
|
Here's a quick performance test. I gave the below dummy input Avro file.
With the current version(
With the latest(this pullreq) version:
The elapsed time increased by 1.5x ... The latest version's result contains (inputs) -> map -> arrow additional conversion, so it's not so strange and we possibly reduce the time if we remove (inputs) -> map redundant conversion layer. |
Finally supported! If we have this schema:
And these record values. It partially matches field names and values but some values are null which's not allowed by the schema:
Then columnify failed by the schema mismatch with this error message:
btw the latest release version
|
I will profile CPU usages next. |
Codecov Report
@@ Coverage Diff @@
## master #47 +/- ##
===========================================
- Coverage 70.05% 58.36% -11.70%
===========================================
Files 19 18 -1
Lines 875 1237 +362
===========================================
+ Hits 613 722 +109
- Misses 203 462 +259
+ Partials 59 53 -6
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
I added benchmark and profilings into the CI job. The cpu profiling was here:
And it has some high cum% consumers:
It seems that we don't have so many tuning parts in our side now. So what we can do next is, I guess some parts related to modules, mainly parquet-go. |
I cannot convert msgpack to parquet using columnify with this PR. So I don't measure memory usage.
But v0.0.3 can convert it. I use the following schema and data. {
"name": "RailsAccessLog",
"type": "record",
"fields": [
{
"name": "container_id",
"type": "string"
},
{
"name": "container_name",
"type": "string"
},
{
"name": "source",
"type": "string"
},
{
"name": "log",
"type": "string"
},
{
"name": "__fluentd_address__",
"type": "string"
},
{
"name": "__fluentd_host__",
"type": "string"
},
{
"name": "action",
"type": ["null", "string"]
},
{
"name": "controller",
"type": ["null", "string"]
},
{
"name": "role",
"type": "string"
},
{
"name": "host",
"type": "string"
},
{
"name": "location",
"type": ["null", "string"]
},
{
"name": "severity",
"type": ["null", "string"],
"default": "INFO"
},
{
"name": "status",
"type": "int"
},
{
"name": "db",
"type": ["null", "float"]
},
{
"name": "view",
"type": ["null", "float"]
},
{
"name": "duration",
"type": ["null", "float"]
},
{
"name": "method",
"type": "string"
},
{
"name": "path",
"type": "string"
},
{
"name": "format",
"type": ["null", "string"]
},
{
"name": "error",
"type": ["null", "string"]
},
{
"name": "remote_ip",
"type": ["null", "string"]
},
{
"name": "agent",
"type": ["null", "string"]
},
{
"name": "authenticated_user_id",
"type": ["null", "string"]
},
{
"name": "params",
"type": ["null", "string"]
},
{
"name": "tag",
"type": "string"
},
{
"name": "time",
"type": "string"
}
]
} I can convert mstpack to parquet after I replaced from |
I could reproduce that, actually RSS is still so high (but I found that the memprofile result is not so terrible. It's curious). Anyway I would like to find another way to reduce it. Finally supporting streaming conversion ... ? That's not easy way but will be more effective. |
For using Arrow instead of naive |
#45
TODO