Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Data Compression Research #7

Open
imotai opened this issue May 15, 2023 · 0 comments
Open

WIP: Data Compression Research #7

imotai opened this issue May 15, 2023 · 0 comments

Comments

@imotai
Copy link
Contributor

imotai commented May 15, 2023

Background

db3 network uses a data-rollup technology to reduce the Arweave cost by compressing the structure data and we expect a 10x storage cost reduction

Experiment

Best Case

we use the following schema to store data and the schema comes from https://arweave.app/tx/rtstthXo8T8wG1odJPAto9vfMCYUqr6Grp6j8KfVtuM

title type example
profileId string 0x016ac4
contentURI string https://data.lens.phaver.com/api/lens/posts/c999701b-439a-4e76-8d86-14e648125490
collectModule string 0xa31FF85E840ED117E172BC9Ad89E55128A999205
collectModule string 0xa31FF85E840ED117E172BC9Ad89E55128A999205
collectModuleInitData string 0x
referenceModuleInitData string 0x0000000000000000000000000000000000000000
nonce int 0
deadline int 1684133321

Generate CSV Data

import calendar
import time

current_GMT = time.gmtime()
time_stamp = calendar.timegm(current_GMT)

def generate():
    with open("lens_post.csv", "w+") as fd: 
        for i in range(0,20000000):                                                                                                                                                                                                                                      
            rows = [i, str(i), i, i, i, i, str(i), str(time_stamp)]
            fd.write('0x%0.2X,https://data.lens.phaver.com/api/lens/posts/%s,0xa31FF85E840ED117E172BC9Ad89E55128A999205%0.2X,0xa31FF85E840ED117E172BC9Ad89E55128A999205%0.2X,0xa31FF85E840ED117E172BC9Ad89E55128A999205%0.2X,0xa31FF85E840ED117E172BC9Ad89E55128A999205%0

if __name__ == "__main__":
    generate()

Compress the Data

from pyarrow import csv, parquet
from datetime import datetime


def file_to_data_frame_to_parquet(local_file: str, parquet_file: str) -> None:
    table = csv.read_csv(local_file)                                                                                                                                                                                                                                     
    parquet.write_table(table, parquet_file, compression="gzip")

if __name__ == "__main__":
    local_csv_file = "lens_post.csv"
    t1 = datetime.now()
    file_to_data_frame_to_parquet(local_csv_file, "lens_post.gz.parquet")
    t2 = datetime.now()
    took = t2 - t1
    print(f"it took {took} seconds to write csv to parquet.")

Report

file rows fie size
lens_post.csv 2000k 5.2G
lens_post.gz.parquet 2000k 320M

compress the lens_post.csv to lens_post.gz.parquet with 4C8G

it took 0:00:46.190250 seconds to write csv to parquet.
compression storage storage_cost computing cost total
N 5.2GB 5.2 * $5.1 ~ $27 0 $27
Y 320M 0.32 * $5.1 ~ $1.5 $0.98 /60 ~$0.016 $1.516

the pricing of 4c8g is $0.98/h in aws
the storage cost in aweave is $5.1/G

Reference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant