WIP: Data Compression Research #7

imotai · 2023-05-15T12:16:34Z

Background

db3 network uses a data-rollup technology to reduce the Arweave cost by compressing the structure data and we expect a 10x storage cost reduction

Experiment

Best Case

we use the following schema to store data and the schema comes from https://arweave.app/tx/rtstthXo8T8wG1odJPAto9vfMCYUqr6Grp6j8KfVtuM

title	type	example
profileId	string	0x016ac4
contentURI	string	https://data.lens.phaver.com/api/lens/posts/c999701b-439a-4e76-8d86-14e648125490
collectModule	string	0xa31FF85E840ED117E172BC9Ad89E55128A999205
collectModule	string	0xa31FF85E840ED117E172BC9Ad89E55128A999205
collectModuleInitData	string	0x
referenceModuleInitData	string	0x0000000000000000000000000000000000000000
nonce	int	0
deadline	int	1684133321

Generate CSV Data

import calendar
import time

current_GMT = time.gmtime()
time_stamp = calendar.timegm(current_GMT)

def generate():
    with open("lens_post.csv", "w+") as fd: 
        for i in range(0,20000000):                                                                                                                                                                                                                                      
            rows = [i, str(i), i, i, i, i, str(i), str(time_stamp)]
            fd.write('0x%0.2X,https://data.lens.phaver.com/api/lens/posts/%s,0xa31FF85E840ED117E172BC9Ad89E55128A999205%0.2X,0xa31FF85E840ED117E172BC9Ad89E55128A999205%0.2X,0xa31FF85E840ED117E172BC9Ad89E55128A999205%0.2X,0xa31FF85E840ED117E172BC9Ad89E55128A999205%0

if __name__ == "__main__":
    generate()

Compress the Data

from pyarrow import csv, parquet
from datetime import datetime


def file_to_data_frame_to_parquet(local_file: str, parquet_file: str) -> None:
    table = csv.read_csv(local_file)                                                                                                                                                                                                                                     
    parquet.write_table(table, parquet_file, compression="gzip")

if __name__ == "__main__":
    local_csv_file = "lens_post.csv"
    t1 = datetime.now()
    file_to_data_frame_to_parquet(local_csv_file, "lens_post.gz.parquet")
    t2 = datetime.now()
    took = t2 - t1
    print(f"it took {took} seconds to write csv to parquet.")

Report

file	rows	fie size
lens_post.csv	2000k	5.2G
lens_post.gz.parquet	2000k	320M

compress the lens_post.csv to lens_post.gz.parquet with 4C8G

it took 0:00:46.190250 seconds to write csv to parquet.

compression	storage	storage_cost	computing cost	total
N	5.2GB	5.2 * $5.1 ~ $27	0	$27
Y	320M	0.32 * $5.1 ~ $1.5	$0.98 /60 ~$0.016	$1.516

the pricing of 4c8g is $0.98/h in aws
the storage cost in aweave is $5.1/G

Reference

https://www.amazonaws.cn/en/ec2/pricing/ec2-linux-pricing/

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Data Compression Research #7

WIP: Data Compression Research #7

imotai commented May 15, 2023 •

edited

Loading

WIP: Data Compression Research #7

WIP: Data Compression Research #7

Comments

imotai commented May 15, 2023 • edited Loading

Background

Experiment

Best Case

Generate CSV Data

Compress the Data

Report

Reference

imotai commented May 15, 2023 •

edited

Loading