Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add ChunkedCompressor which compresses chunk n+1 like chunk n #996

Merged
merged 36 commits into from
Oct 10, 2024

Conversation

danking
Copy link
Member

@danking danking commented Oct 8, 2024

The primary idea is that chunk n+1 more often than not has a distribution of values similar to chunk n. We ought to reuse chunk n's compression scheme if the ratio is "good" before attempting a full sampling pass. This has the potential to both increase throughput and also permit us to invest in a more extensive search on the first chunk.

This PR introduces ChunkedCompressor and StructCompressor. Their existence means that compression trees now fully represent an array. For example, if I have a Chunked(Struct(foo=Chunked(U64), ...)), the ChunkedCompressor will attempt to compress all the U64 chunks similarly and then it will pass up the ratio and encoding tree of the last chunk to the StructCompressor. Eventually the outer ChunkedCompressor can attempt to reuse on the second outer chunk all the encodings from all the fields of the first outer chunk.

This PR looks best with whitespace ignored.

The CompressionTree (particularly the metadata) is not so ergonomic, but I focused on throughput improvement rather than a refactor.

benchmarks

Any ratio outside (0.8, 1.2) is bolded.

Benchmark suite Current: 68faec3 Previous: 4aa30c0 Unit Ratio
taxi: compress time 1.4951 2.9452 s 0.51
taxi: compress throughput 300.31 152.45 MiB/s 1.97
taxi: vortex:parquetzstd size 0.94933 0.95669 0.99
taxi: compress ratio 0.10633 0.10605 1.00
taxi: compressed size 47.744 47.615 MiB 1.00
AirlineSentiment: compress time 0.00038491 0.00036394 s 1.06
AirlineSenthroughputnt: compress throughput 5.0049 5.2933 MiB/s 0.95
AirlineSentiment: vortex:parquetzstd size 6.3744 6.3744 1.00
AirlineSentiment: compress ratio 0.62079 0.62079 1.00
AirlineSentiment: compressed size 0.0011959 0.0011959 MiB 1.00
Arade: compress time 2.294 3.9502 s 0.58
Arade: compress throughput 327.19 190.01 MiB/s 1.72
Arade: vortex:parquetzstd size 0.47662 0.47901 1.00
Arade: compress ratio 0.17756 0.17816 1.00
Arade: compressed size 133.27 133.72 MiB 1.00
Bimbo: compress time 12.753 25.983 s 0.49
Bimbo: compress throughput 532.55 261.38 MiB/s 2.04
Bimbo: vortex:parquetzstd size 1.2573 1.1858 1.06
Bimbo: compress ratio 0.061503 0.057562 1.07
Bimbo: compressed size 417.69 390.93 MiB 1.07
CMSprovider: compress time 11.892 16.619 s 0.72
CMSprovider: compress throughput 412.91 295.48 MiB/s 1.40
CMSprovider: vortex:parquetzstd size 1.0742 1.0992 0.98
CMSprovider: compress ratio 0.15301 0.1575 0.97
CMSprovider: compressed size 751.38 773.42 MiB 0.97
Euro2016: compress time 1.7194 2.0275 s 0.85
Euro2016: compress throughput 218.12 184.97 MiB/s 1.18
Euro2016: vortex:parquetzstd size 1.3998 1.3737 1.02
Euro2016: compress ratio 0.4182 0.41015 1.02
Euro2016: compressed size 156.84 153.82 MiB 1.02
Food: compress time 1.0851 1.3049 s 0.83
Food: compress throughput 292.41 243.16 MiB/s 1.20
Food: vortex:parquetzstd size 1.2213 1.2548 0.97
Food: compress ratio 0.12602 0.13044 0.97
Food: compressed size 39.986 41.39 MiB 0.97
HashTags: compress time 2.3817 3.1473 s 0.76
HashTags: compress throughput 322.14 243.77 MiB/s 1.32
HashTags: vortex:parquetzstd size 1.5056 1.5142 0.99
HashTags: compress ratio 0.24665 0.2483 0.99
HashTags: compressed size 189.23 190.51 MiB 0.99
TPC-H l_comment: compress time 0.8073 1.2042 s 0.67
TPC-H l_comment: compress throughput 216.19 144.93 MiB/s 1.49
TPC-H l_comment: vortex:parquetzstd size 1.1701 1.1648 1.00
TPC-H l_comment: compress ratio 0.36161 0.35995 1.00
TPC-H l_comment: compressed size 63.113 62.822 MiB 1.00

@danking danking added the benchmark Run benchmarks on this branch label Oct 8, 2024
@github-actions github-actions bot removed the benchmark Run benchmarks on this branch label Oct 8, 2024
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vortex bytes_at

Benchmark suite Current: 8bde547 Previous: 2e227c3 Ratio
bytes_at/array_data 590.1076021511925 ns (1.088358376172664) 589.7671824965236 ns (0.32772535470076036) 1.00
bytes_at/array_view 860.6663670956597 ns (0.7447788055580418) 872.994025821359 ns (2.1701489010094974) 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFusion

Benchmark suite Current: 8bde547 Previous: 48982b8 Ratio
arrow/planning 800599.0077997915 ns (1064.2650901224115) 797594.0942498998 ns (639.0693948336993) 1.00
arrow/exec 1741433.6727374913 ns (4969.691193088307) 1732872.0781913844 ns (3710.234210194787) 1.00
vortex-pushdown-compressed/planning 503262.6144018927 ns (2237.3521829429956) 503102.8830239783 ns (634.6895160177664) 1.00
vortex-pushdown-compressed/exec 2437655.7523809513 ns (1728.250976189971) 2441262.2766666673 ns (3351.3878690474667) 1.00
vortex-pushdown-uncompressed/planning 506595.16015602107 ns (2763.4398337128514) 504164.5095690734 ns (1916.879407954315) 1.00
vortex-pushdown-uncompressed/exec 3356240.2475 ns (9659.251773437485) 3411699.438666667 ns (14241.643891666085) 0.98
vortex-nopushdown-compressed/planning 815244.3122822371 ns (776.4202180341235) 813406.607646144 ns (744.5407101008459) 1.00
vortex-nopushdown-compressed/exec 14028352.8 ns (54161.73068750091) 13242627.1675 ns (45222.493375000544) 1.06
vortex-nopushdown-uncompressed/planning 807976.5238915366 ns (627.2652247538208) 809567.8450173971 ns (665.6596324790735) 1.00
vortex-nopushdown-uncompressed/exec 1764410.3465024584 ns (744.2067217733711) 1757501.5737760141 ns (2017.416460580076) 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random Access

Benchmark suite Current: 8bde547 Previous: 48982b8 Ratio
random-access/vortex-tokio-local-disk 972858.5261898807 ns (3761.6191938016564) 979460.9233854066 ns (3539.2417611633427) 0.99
random-access/vortex-local-fs 1121842.1350636475 ns (8993.60766926629) 1111160.8693063792 ns (4993.151020677877) 1.01
random-access/parquet-tokio-local-disk 199474273.56666666 ns (2782111.415833339) 193613640.9 ns (1985554.7433333248) 1.03

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vortex Compression

Benchmark suite Current: 8bde547 Previous: 2e227c3 Ratio
Yellow Taxi Trip Data Compression Time/taxi compression 1261340341.2 ns (3074006.9850000143) 2485148812.6 ns (3108514.5662498474) 0.51
Yellow Taxi Trip Data Compression Time/taxi compression throughput 470808924 bytes 470808924 bytes 1
Yellow Taxi Trip Data Compression Time/taxi decompression 678197765.3 ns (13092907.293749988) 705486381.4 ns (18021002.982499957) 0.96
Yellow Taxi Trip Data Compression Time/taxi decompression throughput 470808924 bytes 470808924 bytes 1
Yellow Taxi Trip Data Vortex-to-ParquetZstd Ratio/taxi 0.9431026462983648 ratio 0.9352743107185189 ratio 1.01
Yellow Taxi Trip Data Vortex-to-ParquetUncompressed Ratio/taxi 0.6054206802747777 ratio 0.6003953139789968 ratio 1.01
Yellow Taxi Trip Data Compression Ratio/taxi 0.10768370227408859 ratio 0.10671793893184574 ratio 1.01
Yellow Taxi Trip Data Compression Size/taxi 50698448 bytes 50243758 bytes 1.01
Public BI Compression Time/AirlineSentiment compression 340276.6654419865 ns (137.1909899164748) 332157.12124522065 ns (231.8147614702757) 1.02
Public BI Compression Time/AirlineSentiment compression throughput 2020 bytes 2020 bytes 1
Public BI Compression Time/AirlineSentiment decompression 29046.34266082869 ns (23.412117846495676) 28876.6758236396 ns (31.08676957276839) 1.01
Public BI Compression Time/AirlineSentiment decompression throughput 2020 bytes 2020 bytes 1
Public BI Vortex-to-ParquetZstd Ratio/AirlineSentiment 5.183040330920372 ratio 5.183040330920372 ratio 1
Public BI Vortex-to-ParquetUncompressed Ratio/AirlineSentiment 3.5295774647887326 ratio 3.5295774647887326 ratio 1
Public BI Compression Ratio/AirlineSentiment 0.6316831683168317 ratio 0.6316831683168317 ratio 1
Public BI Compression Size/AirlineSentiment 1276 bytes 1276 bytes 1
Public BI Compression Time/Arade compression 1966288390.8 ns (1738860.0412501097) 3418131307 ns (1434399.0675001144) 0.58
Public BI Compression Time/Arade compression throughput 787023760 bytes 787023760 bytes 1
Public BI Compression Time/Arade decompression 842063289.5 ns (23017898.431249976) 1105020022 ns (33557330.98999995) 0.76
Public BI Compression Time/Arade decompression throughput 787023760 bytes 787023760 bytes 1
Public BI Vortex-to-ParquetZstd Ratio/Arade 0.47880751063090327 ratio 0.48093732922218235 ratio 1.00
Public BI Vortex-to-ParquetUncompressed Ratio/Arade 0.4273757600806135 ratio 0.4292768013530913 ratio 1.00
Public BI Compression Ratio/Arade 0.18234782517874681 ratio 0.181872502807285 ratio 1.00
Public BI Compression Size/Arade 143512071 bytes 143137981 bytes 1.00
Public BI Compression Time/Bimbo compression 10264941244.4 ns (9819225.873750687) 23901757627.6 ns (31364429.449998856) 0.43
Public BI Compression Time/Bimbo compression throughput 7121333608 bytes 7121333608 bytes 1
Public BI Compression Time/Bimbo decompression 8193031738.5 ns (185310175.4499998) 7031355738.9 ns (191661599.06124973) 1.17
Public BI Compression Time/Bimbo decompression throughput 7121333608 bytes 7121333608 bytes 1
Public BI Vortex-to-ParquetZstd Ratio/Bimbo 1.184289432537036 ratio 1.2383901140321767 ratio 0.96
Public BI Vortex-to-ParquetUncompressed Ratio/Bimbo 0.8030464268312414 ratio 0.8397311744699478 ratio 0.96
Public BI Compression Ratio/Bimbo 0.05907529349943635 ratio 0.06210353963802113 ratio 0.95
Public BI Compression Size/Bimbo 420694873 bytes 442260024 bytes 0.95
Public BI Compression Time/CMSprovider compression 10821541276 ns (23367884.44999981) 14116868181.9 ns (5918312.6000003815) 0.77
Public BI Compression Time/CMSprovider compression throughput 5149123964 bytes 5149123964 bytes 1
Public BI Compression Time/CMSprovider decompression 13570967305.6 ns (72663203.76500034) 11518712779.6 ns (21784165.73374939) 1.18
Public BI Compression Time/CMSprovider decompression throughput 5149123964 bytes 5149123964 bytes 1
Public BI Vortex-to-ParquetZstd Ratio/CMSprovider 1.1279400868720821 ratio 1.1276190532527797 ratio 1.00
Public BI Vortex-to-ParquetUncompressed Ratio/CMSprovider 0.7283176621973946 ratio 0.7281103688687698 ratio 1.00
Public BI Compression Ratio/CMSprovider 0.16423246418465912 ratio 0.16467576833036604 ratio 1.00
Public BI Compression Size/CMSprovider 845653317 bytes 847935945 bytes 1.00
Public BI Compression Time/Euro2016 compression 2257260382.6 ns (3826588.0499999523) 2277433363.6 ns (12927430.812500238) 0.99
Public BI Compression Time/Euro2016 compression throughput 393253221 bytes 393253221 bytes 1
Public BI Compression Time/Euro2016 decompression 552111958.3 ns (3071199.832499981) 560077686 ns (4033143.1399999857) 0.99
Public BI Compression Time/Euro2016 decompression throughput 393253221 bytes 393253221 bytes 1
Public BI Vortex-to-ParquetZstd Ratio/Euro2016 1.4358682833042142 ratio 1.4128101754384608 ratio 1.02
Public BI Vortex-to-ParquetUncompressed Ratio/Euro2016 0.6092113090837246 ratio 0.5994281971916199 ratio 1.02
Public BI Compression Ratio/Euro2016 0.4302218162886961 ratio 0.4231260676692588 ratio 1.02
Public BI Compression Size/Euro2016 169186115 bytes 166395689 bytes 1.02
Public BI Compression Time/Food compression 1055159433.9 ns (663161.6499999762) 1207370368.9 ns (824466.0887500048) 0.87
Public BI Compression Time/Food compression throughput 332718229 bytes 332718229 bytes 1
Public BI Compression Time/Food decompression 746950255.8 ns (2272630.3725000024) 467131772.3 ns (4595361.173749983) 1.60
Public BI Compression Time/Food decompression throughput 332718229 bytes 332718229 bytes 1
Public BI Vortex-to-ParquetZstd Ratio/Food 1.236669829514 ratio 1.2785727678320182 ratio 0.97
Public BI Vortex-to-ParquetUncompressed Ratio/Food 0.6992649944053574 ratio 0.7229586733722262 ratio 0.97
Public BI Compression Ratio/Food 0.12943194945895195 ratio 0.13463195609880454 ratio 0.96
Public BI Compression Size/Food 43064369 bytes 44794506 bytes 0.96
Public BI Compression Time/HashTags compression 2373049708.5 ns (7246496.75) 2997774033.6 ns (2281680.3499999046) 0.79
Public BI Compression Time/HashTags compression throughput 804495592 bytes 804495592 bytes 1
Public BI Compression Time/HashTags decompression 1258681251.5 ns (9903010.890000105) 1071558788.7 ns (2689239.6987499595) 1.17
Public BI Compression Time/HashTags decompression throughput 804495592 bytes 804495592 bytes 1
Public BI Vortex-to-ParquetZstd Ratio/HashTags 1.580631604116457 ratio 1.5580046438524044 ratio 1.01
Public BI Vortex-to-ParquetUncompressed Ratio/HashTags 0.44938505986447086 ratio 0.44295205051154446 ratio 1.01
Public BI Compression Ratio/HashTags 0.2601474291235147 ratio 0.2565506225918513 ratio 1.01
Public BI Compression Size/HashTags 209287460 bytes 206393845 bytes 1.01
TPC-H l_comment Compression Time/chunked-without-fsst compression 187048194.26845238 ns (246107.07552826405) 190880273.59718257 ns (385362.03238095343) 0.98
TPC-H l_comment Compression Time/chunked-without-fsst compression throughput 183010921 bytes 183010921 bytes 1
TPC-H l_comment Compression Time/chunked-without-fsst decompression 37466954.81672123 ns (140156.76451494172) 59977066.21305555 ns (69043.00287013873) 0.62
TPC-H l_comment Compression Time/chunked-without-fsst decompression throughput 183010921 bytes 183010921 bytes 1
TPC-H l_comment Vortex-to-ParquetZstd Ratio/chunked-without-fsst 3.2155813496160457 ratio 3.215581274971146 ratio 1.00
TPC-H l_comment Vortex-to-ParquetUncompressed Ratio/chunked-without-fsst 0.9983752622125646 ratio 0.9983844541826632 ratio 1.00
TPC-H l_comment Compression Ratio/chunked-without-fsst 0.999965750677797 ratio 0.999965750677797 ratio 1
TPC-H l_comment Compression Size/chunked-without-fsst 183004653 bytes 183004653 bytes 1
TPC-H l_comment Compression Time/chunked-with-fsst compression 775114728.1 ns (1831565.856249988) 1173074786.15 ns (1020173.8162499666) 0.66
TPC-H l_comment Compression Time/chunked-with-fsst compression throughput 183010921 bytes 183010921 bytes 1
TPC-H l_comment Compression Time/chunked-with-fsst decompression 96815261.37376985 ns (156820.81744047254) 115640037.75162697 ns (338498.6541284993) 0.84
TPC-H l_comment Compression Time/chunked-with-fsst decompression throughput 183010921 bytes 183010921 bytes 1
TPC-H l_comment Vortex-to-ParquetZstd Ratio/chunked-with-fsst 1.348305390366293 ratio 1.3490213532663975 ratio 1.00
TPC-H l_comment Vortex-to-ParquetUncompressed Ratio/chunked-with-fsst 0.41862251372066633 ratio 0.4188486722276101 ratio 1.00
TPC-H l_comment Compression Ratio/chunked-with-fsst 0.4174715562466351 ratio 0.41769830227781873 ratio 1.00
TPC-H l_comment Compression Size/chunked-with-fsst 76401854 bytes 76443351 bytes 1.00
TPC-H l_comment Compression Time/canonical-with-fsst compression 776325542.7 ns (776120.8637500405) 1176725398 ns (1345117.0943750143) 0.66
TPC-H l_comment Compression Time/canonical-with-fsst compression throughput 183010937 bytes 183010937 bytes 1
TPC-H l_comment Compression Time/canonical-with-fsst decompression 107811972.56408732 ns (155691.01287698746) 116820541.63132274 ns (148413.64523676038) 0.92
TPC-H l_comment Compression Time/canonical-with-fsst decompression throughput 183010937 bytes 183010937 bytes 1
TPC-H l_comment Vortex-to-ParquetZstd Ratio/canonical-with-fsst 1.348305982420875 ratio 1.3490190311794283 ratio 1.00
TPC-H l_comment Vortex-to-ParquetUncompressed Ratio/canonical-with-fsst 0.41862248632566124 ratio 0.4188486585225473 ratio 1.00
TPC-H l_comment Compression Ratio/canonical-with-fsst 0.4174635639398972 ratio 0.41769030995125717 ratio 1.00
TPC-H l_comment Compression Size/canonical-with-fsst 76400398 bytes 76441895 bytes 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TPC-H

Benchmark suite Current: 8bde547 Previous: 48982b8 Ratio
tpch_q1/vortex-in-memory-no-pushdown 458661572.95 ns (1581287.449999988) 475637040 ns (1517252.974999994) 0.96
tpch_q1/vortex-in-memory-pushdown 506276111.6 ns (630704.0500000119) 521426212.5 ns (1729808.375) 0.97
tpch_q1/arrow 443220503.7 ns (1303618.6849999726) 450510118.55 ns (876897.474999994) 0.98
tpch_q1/parquet 654706321.3 ns (1265215.9749999642) 660605886.4 ns (1458236.300000012) 0.99
tpch_q1/vortex-file-compressed 646775214.4 ns (2412691.4287499785) 677695129.4 ns (2460168.9000000358) 0.95
tpch_q1/vortex-file-uncompressed 530053163.7 ns (1001407.65625) 540495825.3 ns (2568529.964999974) 0.98
tpch_q2/vortex-in-memory-no-pushdown 122027347.46948414 ns (372707.55017659813) 129933571.43746033 ns (351019.82209920883) 0.94
tpch_q2/vortex-in-memory-pushdown 121143814.67964284 ns (271388.7580952272) 130644405.99146824 ns (350026.14810565114) 0.93
tpch_q2/arrow 119434595.44738097 ns (138003.56400892884) 123568997.48119049 ns (433027.3109136969) 0.97
tpch_q2/parquet 154565265.10325396 ns (357009.600391835) 160081233.3619444 ns (657426.053562507) 0.97
tpch_q2/vortex-file-compressed 153142441.89940473 ns (588772.200297609) 157597771.2007143 ns (430226.5305148661) 0.97
tpch_q2/vortex-file-uncompressed 152957328.1438492 ns (461338.9486527592) 155789825.81273812 ns (642459.4571428299) 0.98
tpch_q3/vortex-in-memory-no-pushdown 152998606.62420636 ns (469900.1927241981) 157938413.6459921 ns (1591380.145059526) 0.97
tpch_q3/vortex-in-memory-pushdown 182745246.86666664 ns (851451.0300000161) 201503770.86666667 ns (923278.232916668) 0.91
tpch_q3/arrow 146315378.64757937 ns (1983486.99350892) 149360912.58079365 ns (380187.28815476596) 0.98
tpch_q3/parquet 333445719.5 ns (1552429.4306250215) 345195376.15 ns (1475443.75) 0.97
tpch_q3/vortex-file-compressed 320334931.4 ns (885575.3056249917) 334018087.55 ns (630083.4174999893) 0.96
tpch_q3/vortex-file-uncompressed 281047458.5 ns (903515.5731250048) 281323756 ns (796877.9249999821) 1.00
tpch_q4/vortex-in-memory-no-pushdown 109940850.85511903 ns (666701.3702202365) 112499341.55984128 ns (465144.08063491434) 0.98
tpch_q4/vortex-in-memory-pushdown 137449277.52083334 ns (333845.4873125106) 140301869.2661508 ns (755437.5224007815) 0.98
tpch_q4/arrow 100342903.2215476 ns (641422.7404761836) 104418114.41341269 ns (598286.1615872979) 0.96
tpch_q4/parquet 212400071.40000004 ns (343130.7729166746) 221724245.7 ns (457172.26291663945) 0.96
tpch_q4/vortex-file-compressed 280776711.1 ns (632521.8731249869) 300388040.65 ns (687584.0037499964) 0.93
tpch_q4/vortex-file-uncompressed 227314694.2 ns (1539251.0333333313) 230161104.6666667 ns (899370.8174999952) 0.99
tpch_q5/vortex-in-memory-no-pushdown 299491474.45 ns (1355037.125) 334475550.7 ns (2354253.521874994) 0.90
tpch_q5/vortex-in-memory-pushdown 304784864.45 ns (1300102.4381249845) 344987419 ns (1669842.3312499821) 0.88
tpch_q5/arrow 286750952.25 ns (1645150.6762500107) 303436601.5 ns (2075220.046875) 0.95
tpch_q5/parquet 448302046.85 ns (3916005.793124974) 476941171.15 ns (3691502.1168750226) 0.94
tpch_q5/vortex-file-compressed 342337952.65 ns (1326007.165625006) 363391593.65 ns (1584685.1056249738) 0.94
tpch_q5/vortex-file-uncompressed 363684551.95 ns (8812639.118124992) 353914915.6 ns (1053852.5387500226) 1.03
tpch_q6/vortex-in-memory-no-pushdown 41603456.42294974 ns (182772.11982903257) 44778494.074140206 ns (85811.4083483778) 0.93
tpch_q6/vortex-in-memory-pushdown 97716476.41551587 ns (187918.62841667235) 89586284.80684523 ns (443487.71927975863) 1.09
tpch_q6/arrow 34615588.17537037 ns (97717.27435185388) 37005146.847989425 ns (169664.4284649454) 0.94
tpch_q6/parquet 151655986.37984127 ns (573351.7103075385) 155052097.64424604 ns (376248.49499852955) 0.98
tpch_q6/vortex-file-compressed 68046295.29882936 ns (303663.2867395878) 66187226.59113095 ns (196660.5936093703) 1.03
tpch_q6/vortex-file-uncompressed 186519198.2 ns (621111.9166666567) 179031836.4965873 ns (688448.0650793612) 1.04
tpch_q7/vortex-in-memory-no-pushdown 553711537.8 ns (1901372.7999999523) 580161489.9 ns (1612201.949999988) 0.95
tpch_q7/vortex-in-memory-pushdown 596317589 ns (3179533.831250012) 621744961 ns (1415843.89624995) 0.96
tpch_q7/arrow 539411364.2 ns (2751725.216250032) 567546965.4 ns (2812628.199999988) 0.95
tpch_q7/parquet 705021551.5 ns (5662838.998749971) 731253638.6 ns (4841214.297499955) 0.96
tpch_q7/vortex-file-compressed 710152592.9 ns (4173051.5299999714) 735665612.8 ns (6358194.650000036) 0.97
tpch_q7/vortex-file-uncompressed 692615813 ns (7463894.910000026) 698880925.8 ns (6529443) 0.99
tpch_q8/vortex-in-memory-no-pushdown 224540604.6666667 ns (837648.8770833164) 243305453 ns (987143.1820833534) 0.92
tpch_q8/vortex-in-memory-pushdown 234711886.8333333 ns (1590926.732916668) 251473633.85 ns (1027957.9350000024) 0.93
tpch_q8/arrow 211288106.26666665 ns (523796.4208333492) 211835851 ns (389751.28333331645) 1.00
tpch_q8/parquet 475504996.65 ns (1827408.7493749857) 477367042.65 ns (1372305.4249999821) 1.00
tpch_q8/vortex-file-compressed 276769617.15 ns (725978.453125) 280260864.8 ns (540696.890625) 0.99
tpch_q8/vortex-file-uncompressed 312526895.5 ns (1461782.553124994) 269643866.5 ns (1133461.4831249714) 1.16
tpch_q9/vortex-in-memory-no-pushdown 407621152.45 ns (1957592.2999999821) 431026554.55 ns (1833737.4962500036) 0.95
tpch_q9/vortex-in-memory-pushdown 410616641.8 ns (1667903.0006250143) 426227276.9 ns (1292383.6550000012) 0.96
tpch_q9/arrow 392653816 ns (1416758.5799999833) 385214933.55 ns (1005303.2031249702) 1.02
tpch_q9/parquet 687487542.6 ns (3918905.801249981) 681882978.4 ns (2177764.949999988) 1.01
tpch_q9/vortex-file-compressed 490003252.55 ns (1297619.5256250203) 477064290.25 ns (1385514.0962499976) 1.03
tpch_q9/vortex-file-uncompressed 491157705.7 ns (9223984.131249994) 425118433.35 ns (1836270.4306250215) 1.16
tpch_q10/vortex-in-memory-no-pushdown 227867225.8666667 ns (715541.8333333284) 223931342.2333334 ns (741702.0920833349) 1.02
tpch_q10/vortex-in-memory-pushdown 261362941.75 ns (2983009.064374998) 257067734.5 ns (377787.1887500137) 1.02
tpch_q10/arrow 218820063.3333333 ns (1258935.745416686) 215985482.86666664 ns (831520.7520833313) 1.01
tpch_q10/parquet 475821709.5 ns (1321776.391874969) 471104058.9 ns (1256398.5506249964) 1.01
tpch_q10/vortex-file-compressed 448838731.85 ns (837384.4050000012) 457415718.85 ns (1118964.056250006) 0.98
tpch_q10/vortex-file-uncompressed 360419231.15 ns (1388645.3256249726) 351030059.8 ns (958430.275000006) 1.03
tpch_q11/vortex-in-memory-no-pushdown 179594004.06972224 ns (384873.2217847258) 177476560.44349208 ns (621545.1311220229) 1.01
tpch_q11/vortex-in-memory-pushdown 179580394.9488889 ns (612704.7365972549) 176887504.22222224 ns (509638.6070000082) 1.02
tpch_q11/arrow 178735344.94884923 ns (540426.9074553698) 173017429.78448415 ns (836755.5850307643) 1.03
tpch_q11/parquet 186316513.7 ns (1078456.3708333373) 184787052.26666662 ns (690430.183333382) 1.01
tpch_q11/vortex-file-compressed 225914565.2666667 ns (891167.8158333153) 226099285.1 ns (887589.3204166591) 1.00
tpch_q11/vortex-file-uncompressed 227205860.6 ns (1919262.7666666508) 223814317.33333334 ns (1064055.3733333647) 1.02
tpch_q12/vortex-in-memory-no-pushdown 199302546.86666664 ns (711640.3129166514) 195305898.49999997 ns (244922.04291664064) 1.02
tpch_q12/vortex-in-memory-pushdown 249151292.3 ns (243294.64999999106) 234062840.8666667 ns (461240.2487499863) 1.06
tpch_q12/arrow 166545971.28333333 ns (325576.8990416825) 164663743.2840873 ns (440381.75235167146) 1.01
tpch_q12/parquet 353498020.2 ns (849258.224999994) 354188285.6 ns (872954.5799999535) 1.00
tpch_q12/vortex-file-compressed 637903620.6 ns (2428782.3937499523) 656868583.3 ns (890714.2012499571) 0.97
tpch_q12/vortex-file-uncompressed 351174999.05 ns (739019.5075000226) 346862421.55 ns (1062543.5249999762) 1.01
tpch_q13/vortex-in-memory-no-pushdown 175377885.08099204 ns (5465737.433413714) 164371801.7034921 ns (895269.1245238036) 1.07
tpch_q13/vortex-in-memory-pushdown 166679356.15781745 ns (2186408.8559240997) 169881779.38027778 ns (3636627.2586250007) 0.98
tpch_q13/arrow 165012782.9222619 ns (2347232.5475237966) 167784218.5671032 ns (2187023.831388384) 0.98
tpch_q13/parquet 313941935 ns (2870156.2631250024) 317857778.2 ns (3799881.0493749976) 0.99
tpch_q13/vortex-file-compressed 206209097.23333335 ns (838473.7558333427) 208638770.66666666 ns (1607727.550000012) 0.99
tpch_q13/vortex-file-uncompressed 194094280.3 ns (990201.1250000298) 193641610.3 ns (1547785.7333333492) 1.00
tpch_q14/vortex-in-memory-no-pushdown 46562845.50263889 ns (184416.51152777672) 45386539.54400794 ns (216845.576000005) 1.03
tpch_q14/vortex-in-memory-pushdown 90035597.62347223 ns (261097.7120295167) 80187881.41924605 ns (602629.1283606067) 1.12
tpch_q14/arrow 37456439.3088492 ns (259786.67374652997) 38490941.49415344 ns (386777.6300767213) 0.97
tpch_q14/parquet 224115278.33333334 ns (2168462.5691666454) 225398861.5666667 ns (991819.9741666615) 0.99
tpch_q14/vortex-file-compressed 124657373.09436509 ns (427436.0742807463) 123005495.09623018 ns (289599.80898809433) 1.01
tpch_q14/vortex-file-uncompressed 159454532.5102778 ns (581623.9065763801) 153355184.1057143 ns (177446.5630684644) 1.04
tpch_q15/vortex-in-memory-no-pushdown 74183038.95841269 ns (208212.83235615492) 74140631.44396825 ns (956807.9174583331) 1.00
tpch_q15/vortex-in-memory-pushdown 126416712.25055556 ns (465003.9616041705) 104862534.42071429 ns (209020.78954464942) 1.21
tpch_q15/arrow 63349733.750972226 ns (206356.43922222778) 62083618.38583334 ns (119847.99166666344) 1.02
tpch_q15/parquet 294324466 ns (1295691.331250012) 296440986.25 ns (559746.1331249774) 0.99
tpch_q15/vortex-file-compressed 228159097.5333333 ns (1513795.9666666687) 215966946.20000005 ns (257837.98666667938) 1.06
tpch_q15/vortex-file-uncompressed 322348954.45 ns (1315276.153124988) 303869392.1 ns (451749.6999999881) 1.06
tpch_q16/vortex-in-memory-no-pushdown 106893224.62630951 ns (542444.4827633947) 102810830.13087301 ns (201740.30115079135) 1.04
tpch_q16/vortex-in-memory-pushdown 123375044.80079365 ns (199240.01614880562) 120718832.32777777 ns (133460.53209721297) 1.02
tpch_q16/arrow 105604173.86603174 ns (206021.0807936564) 101815612.55956349 ns (66089.36590326577) 1.04
tpch_q16/parquet 122527336.29091272 ns (372233.48891071975) 119217624.70896825 ns (103236.00382243097) 1.03
tpch_q16/vortex-file-compressed 135957588.44492063 ns (346797.41950298846) 134610219.3376984 ns (442894.0186170712) 1.01
tpch_q16/vortex-file-uncompressed 133221802.34329367 ns (1132464.9658467248) 128793331.99599203 ns (349453.52372122556) 1.03
tpch_q17/vortex-in-memory-no-pushdown 584686431.2 ns (8669903.080000043) 551557853.6 ns (3007991.949999988) 1.06
tpch_q17/vortex-in-memory-pushdown 728248210 ns (13343961.193750024) 630440861.1 ns (4312127.960000038) 1.16
tpch_q17/arrow 589499254.5 ns (7930660.876250029) 514949517.4 ns (3271836.3787499964) 1.14
tpch_q17/parquet 590765855.4 ns (1541587.3887500167) 583153129.7 ns (2322976.317499995) 1.01
tpch_q17/vortex-file-compressed 735124050.4 ns (3033209.2249999642) 683693123.5 ns (2372760.5374999642) 1.08
tpch_q17/vortex-file-uncompressed 654834332.2 ns (5551130.774999976) 595890466.8 ns (1193761.2975000143) 1.10
tpch_q18/vortex-in-memory-no-pushdown 1045330899.6 ns (15327446.090000033) 994400166.7 ns (4006744.261250019) 1.05
tpch_q18/vortex-in-memory-pushdown 1035943404.9 ns (8714938.77000004) 1001946644.7 ns (2342776.569999993) 1.03
tpch_q18/arrow 1031796352.4 ns (5868771.75) 971919056.2 ns (3115116.4450000525) 1.06
tpch_q18/parquet 1212295635.4 ns (5474788.442499995) 1149244274.9 ns (2721623.563750148) 1.05
tpch_q18/vortex-file-compressed 1156615178.2 ns (3855669.797499895) 1098965163.1 ns (3107538.589999914) 1.05
tpch_q18/vortex-file-uncompressed 1039551076.4 ns (5317063.349999964) 987141789.1 ns (2366368.807500005) 1.05
tpch_q19/vortex-in-memory-no-pushdown 160736092.91357142 ns (600137.599023819) 156585450.10876986 ns (769088.1493685395) 1.03
tpch_q19/vortex-in-memory-pushdown 258407110.6 ns (407397.8118749857) 236800835.2666667 ns (520555.35249997675) 1.09
tpch_q19/arrow 150465092.7822619 ns (477975.92738094926) 147184344.04031748 ns (870414.6070198417) 1.02
tpch_q19/parquet 472223197.2 ns (511820.47499999404) 468287932.35 ns (576167.2381249964) 1.01
tpch_q19/vortex-file-compressed 974279132.4 ns (3229831.4412499666) 931061925.3 ns (1844214.8612499833) 1.05
tpch_q19/vortex-file-uncompressed 331590247.95 ns (804831.5749999881) 311332555.6 ns (331283.7949999869) 1.07
tpch_q20/vortex-in-memory-no-pushdown 247910094.4333333 ns (2671678.03458333) 234512821.4666667 ns (323341.1741666347) 1.06
tpch_q20/vortex-in-memory-pushdown 276279322.15 ns (1287096.193749994) 249917827.26666665 ns (302344.134583354) 1.11
tpch_q20/arrow 250246731.8 ns (3035598.948333308) 231271449.53333336 ns (197476.71666666865) 1.08
tpch_q20/parquet 368941239.35 ns (3561709.7337500155) 345706455.85 ns (1255735.6424999833) 1.07
tpch_q20/vortex-file-compressed 409160771.75 ns (2758693.832499981) 375926813.65 ns (907056.8824999928) 1.09
tpch_q20/vortex-file-uncompressed 401264470.05 ns (3522521.599999994) 377629282.4 ns (631064.875) 1.06
tpch_q21/vortex-in-memory-no-pushdown 856134885.3 ns (9073509.764999986) 817282143.3 ns (2448491.350000024) 1.05
tpch_q21/vortex-in-memory-pushdown 915742677.4 ns (6607757.503750026) 872439065.5 ns (729677.8912500143) 1.05
tpch_q21/arrow 839057934.1 ns (5704750.399999976) 809272300 ns (1285844.3912499547) 1.04
tpch_q21/parquet 975227938.5 ns (4049811.751249969) 944800972.4 ns (1495333.6087499857) 1.03
tpch_q21/vortex-file-compressed 1221140078.9 ns (4768124.185000062) 1210333787.2 ns (2048530.6812499762) 1.01
tpch_q21/vortex-file-uncompressed 1084796720.8 ns (5227527.850000024) 1058461536.5 ns (3659947.0099999905) 1.02
tpch_q22/vortex-in-memory-no-pushdown 68379145.81097223 ns (513375.85478471965) 66225045.47083334 ns (287177.0767291635) 1.03
tpch_q22/vortex-in-memory-pushdown 67733655.44964285 ns (240337.75190030038) 65703235.51831349 ns (160617.2424503863) 1.03
tpch_q22/arrow 67933498.16742063 ns (340934.168194443) 65189679.325238094 ns (220482.54561904818) 1.04
tpch_q22/parquet 94709499.13654761 ns (489956.4975595251) 93157582.84702381 ns (563673.7252901718) 1.02
tpch_q22/vortex-file-compressed 106155791.11083335 ns (734560.3570833206) 102542645.06178573 ns (570395.9921428636) 1.04
tpch_q22/vortex-file-uncompressed 103406283.44623016 ns (306420.66084524244) 101254610.4936508 ns (283236.2650059536) 1.02

This comment was automatically generated by workflow using github-action-benchmark.

@danking danking added the benchmark Run benchmarks on this branch label Oct 8, 2024
@github-actions github-actions bot removed the benchmark Run benchmarks on this branch label Oct 8, 2024
@danking danking added the benchmark Run benchmarks on this branch label Oct 8, 2024
@github-actions github-actions bot removed the benchmark Run benchmarks on this branch label Oct 8, 2024
@danking
Copy link
Member Author

danking commented Oct 8, 2024

cc: @lwwmanning this is a summary of where throughputs, speeds, and ratios are with the soup of changes I've been working on. I want to understand why this PR is slowing down Taxi, Bimbo, and Food; however, it does deliver notable reductions in time for string heavy datasets like CMSprovider, HashTags, and l_comment. Sizes never change more than 3% in either direction and only one increases in size (HashTags).

cleaned up summary of compression benchmark

Benchmark suite Current: 44486f6 Previous: ba7b31b (develop) Current:Previous
Taxi compress time 3.316 s 2.822 s 1.17
Taxi vortex:parquet-zstd size 0.951 0.948 1.00
Taxi vortex:raw size 0.106 0.105 1.00
AirlineSentiment compress time 493.8 us 336.0 us 1.47
AirlineSentiment vortex:parquet-zstd size 6.441 6.639 0.97
AirlineSentiment vortex:raw size 0.621 0.621 1
Arade compress time 3.845 s 4.045 s 0.95
Arade vortex:parquet-zstd size 0.476 0.477 1.00
Arade vortex:raw size 0.177 0.177 1.00
Bimbo compress time 31.54 s 25.81 s 1.22
Bimbo vortex:parquet-zstd size 1.186 1.187 1.00
Bimbo vortex:raw size 0.057 0.057 1.00
CMSprovider compress time 11.41 s 16.85 s 0.68
CMSprovider vortex:parquet-zstd size 1.076 1.110 0.97
CMSprovider vortex:raw size 0.153 0.159 0.96
Euro2016 compress time 1.996 s 2.002 s 1.00
Euro2016 vortex:parquet-zstd size 1.374 1.374 1.00
Euro2016 vortex:raw size 0.410 0.410 1.00
Food compress time 1.575 s 1.267 s 1.24
Food vortex:parquet-zstd size 1.236 1.249 0.99
Food vortex:raw size 0.128 0.130 0.99
HashTags compress time 2.431 s 3.121 s 0.78
HashTags vortex:parquet-zstd size 1.530 1.507 1.02
HashTags vortex:raw size 0.251 0.247 1.02
TPC-H l_comment compress time 1.091 s 1.193 s 0.91
TPC-H l_comment vortex:parquet-zstd size 1.165 1.164 1.00
TPC-H l_comment vortex:raw size 0.360 0.360 1.00

dataset sizes & throughputs

dataset canonical size (MiB) write (MiB/s)
AirlineSentiment 0.001 2.426
Arade 133.0 34.59
Bimbo 390.4 12.38
CMSprovider 753.0 65.99
Euro2016 153.9 77.1
Food 40.61 25.79
HashTags 192.4 79.16

@danking danking added the benchmark Run benchmarks on this branch label Oct 9, 2024
@github-actions github-actions bot removed the benchmark Run benchmarks on this branch label Oct 9, 2024
@danking
Copy link
Member Author

danking commented Oct 9, 2024

Updated summary for 68faec3

Any ratio outside (0.8, 1.2) is bolded.

Benchmark suite Current: 68faec3 Previous: 4aa30c0 Unit Ratio
taxi: compress time 1.4951 2.9452 s 0.51
taxi: compress throughput 300.31 152.45 MiB/s 1.97
taxi: vortex:parquetzstd size 0.94933 0.95669 0.99
taxi: compress ratio 0.10633 0.10605 1.00
taxi: compressed size 47.744 47.615 MiB 1.00
AirlineSentiment: compress time 0.00038491 0.00036394 s 1.06
AirlineSenthroughputnt: compress throughput 5.0049 5.2933 MiB/s 0.95
AirlineSentiment: vortex:parquetzstd size 6.3744 6.3744 1.00
AirlineSentiment: compress ratio 0.62079 0.62079 1.00
AirlineSentiment: compressed size 0.0011959 0.0011959 MiB 1.00
Arade: compress time 2.294 3.9502 s 0.58
Arade: compress throughput 327.19 190.01 MiB/s 1.72
Arade: vortex:parquetzstd size 0.47662 0.47901 1.00
Arade: compress ratio 0.17756 0.17816 1.00
Arade: compressed size 133.27 133.72 MiB 1.00
Bimbo: compress time 12.753 25.983 s 0.49
Bimbo: compress throughput 532.55 261.38 MiB/s 2.04
Bimbo: vortex:parquetzstd size 1.2573 1.1858 1.06
Bimbo: compress ratio 0.061503 0.057562 1.07
Bimbo: compressed size 417.69 390.93 MiB 1.07
CMSprovider: compress time 11.892 16.619 s 0.72
CMSprovider: compress throughput 412.91 295.48 MiB/s 1.40
CMSprovider: vortex:parquetzstd size 1.0742 1.0992 0.98
CMSprovider: compress ratio 0.15301 0.1575 0.97
CMSprovider: compressed size 751.38 773.42 MiB 0.97
Euro2016: compress time 1.7194 2.0275 s 0.85
Euro2016: compress throughput 218.12 184.97 MiB/s 1.18
Euro2016: vortex:parquetzstd size 1.3998 1.3737 1.02
Euro2016: compress ratio 0.4182 0.41015 1.02
Euro2016: compressed size 156.84 153.82 MiB 1.02
Food: compress time 1.0851 1.3049 s 0.83
Food: compress throughput 292.41 243.16 MiB/s 1.20
Food: vortex:parquetzstd size 1.2213 1.2548 0.97
Food: compress ratio 0.12602 0.13044 0.97
Food: compressed size 39.986 41.39 MiB 0.97
HashTags: compress time 2.3817 3.1473 s 0.76
HashTags: compress throughput 322.14 243.77 MiB/s 1.32
HashTags: vortex:parquetzstd size 1.5056 1.5142 0.99
HashTags: compress ratio 0.24665 0.2483 0.99
HashTags: compressed size 189.23 190.51 MiB 0.99
TPC-H l_comment: compress time 0.8073 1.2042 s 0.67
TPC-H l_comment: compress throughput 216.19 144.93 MiB/s 1.49
TPC-H l_comment: vortex:parquetzstd size 1.1701 1.1648 1.00
TPC-H l_comment: compress ratio 0.36161 0.35995 1.00
TPC-H l_comment: compressed size 63.113 62.822 MiB 1.00

@@ -59,7 +59,7 @@ impl EncodingCompressor for FSSTCompressor {
// between 2-3x depending on the text quality.
//
// It's not worth running a full compression step unless the array is large enough.
if array.nbytes() < 10 * FSST_SYMTAB_MAX_SIZE {
if array.nbytes() < 5 * FSST_SYMTAB_MAX_SIZE {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why my changes encountered this, but I was getting samples that were hundreds of bytes too small for FSST which triggered this PR to compress poorly.

We may want to think more broadly about how to estimate FSST compression ratio on a tiny sample, but for now this seems reasonable unless we're directly calling compress on a small array (rather than a sample thereof).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a10y any guidance on how you arrived at 10x multiplier or just a guess?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I had chatted 1:1 with Dan about this, it was fairly arbitrary

@danking danking changed the title Dk/carry forward chunk information 2 feat: add ChunkedCompressor which compresses its chunks like one another Oct 10, 2024
@danking danking added the benchmark Run benchmarks on this branch label Oct 10, 2024
@github-actions github-actions bot removed benchmark Run benchmarks on this branch labels Oct 10, 2024
@danking danking added the benchmark Run benchmarks on this branch label Oct 10, 2024
@github-actions github-actions bot removed the benchmark Run benchmarks on this branch label Oct 10, 2024
@danking danking added the benchmark Run benchmarks on this branch label Oct 10, 2024
@github-actions github-actions bot removed the benchmark Run benchmarks on this branch label Oct 10, 2024
@danking danking marked this pull request as ready for review October 10, 2024 16:31
@danking danking changed the title feat: add ChunkedCompressor which compresses its chunks like one another feat: add ChunkedCompressor which compresses chunk n+1 like chunk n Oct 10, 2024
@danking danking enabled auto-merge (squash) October 10, 2024 16:39
@danking danking force-pushed the dk/carry-forward-chunk-information-2 branch from 68faec3 to 8bde547 Compare October 10, 2024 17:20
@danking
Copy link
Member Author

danking commented Oct 10, 2024

Will beat me to develop so I rebased. I'll re-run benchmarks.

@danking danking added the benchmark Run benchmarks on this branch label Oct 10, 2024
@github-actions github-actions bot removed the benchmark Run benchmarks on this branch label Oct 10, 2024
@robert3005
Copy link
Member

@danking I wouldn't obsses about benchmarks before merging. Running them to spot check locally is fine and if some new trend develops we can investigate


let (arrays, trees) = array
.children()
.zip(children_trees)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I prefer zip_eq from itertools personally, since it asserts that the two things being zipped are of equal length rather than truncating

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

vortex-sampling-compressor/src/compressors/chunked.rs Outdated Show resolved Hide resolved
let ratio = (compressed_chunk.nbytes() as f32) / (chunk.nbytes() as f32);
let exceeded_target_ratio = previous
.as_ref()
.map(|(_, target_ratio)| ratio > target_ratio * 1.2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this magic constant 1.2 should be a field of ChunkedCompressor, and we can have a default ChunkedCompressor (look at BitpackedCompressor or RunEndCompressor for examples)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

ChunkedArray::try_new(compressed_chunks, array.dtype().clone())?.into_array(),
Some(CompressionTree::new_with_metadata(
self,
vec![child],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if child is None, should this just be empty...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this is the diff. I don't find either of these particularly palatable but I think the fix is to sort out how compressors pass information from one invocation of compress to the next.

(docs) # g diff
diff --git a/vortex-sampling-compressor/src/compressors/chunked.rs b/vortex-sampling-compressor/src/compressors/chunked.rs
index c0f5aebe..9adf4ab4 100644
--- a/vortex-sampling-compressor/src/compressors/chunked.rs
+++ b/vortex-sampling-compressor/src/compressors/chunked.rs
@@ -78,16 +78,16 @@ fn like_into_parts(
         vortex_bail!("chunked array compression tree must be ChunkedCompressorMetadata")
     };
 
-    if children.len() != 1 {
-        vortex_bail!("chunked array compression tree must have one child")
+    if (children.len() == 1) != target_ratio.is_some() {
+        vortex_bail!("chunked array compression tree must have a child iff it has a ratio")
     }
 
-    let child = children.remove(0);
-
-    match (child, target_ratio) {
-        (None, None) => Ok(None),
-        (Some(child), Some(ratio)) => Ok(Some((child, *ratio))),
-        (..) => vortex_bail!("chunked array compression tree must have a child iff it has a ratio"),
+    if children.len() == 0 {
+        return Ok(None);
+    } else if children.len() == 1 {
+        return Ok(Some((children.remove(0).unwrap(), target_ratio.unwrap())));
+    } else {
+        vortex_bail!("chunked array compression tree must have at most one child")
     }
 }
 
@@ -141,16 +141,16 @@ impl ChunkedCompressor {
             }
         }
 
-        let (child, ratio) = match previous {
-            Some((child, ratio)) => (Some(child), Some(ratio)),
-            None => (None, None),
+        let (children, ratio) = match previous {
+            Some((child, ratio)) => (vec![Some(child)], Some(ratio)),
+            None => (vec![], None),
         };
 
         Ok(CompressedArray::new(
             ChunkedArray::try_new(compressed_chunks, array.dtype().clone())?.into_array(),
             Some(CompressionTree::new_with_metadata(
                 self,
-                vec![child],
+                children,
                 Arc::new(ChunkedCompressorMetadata(ratio)),
             )),
         ))

@danking danking requested a review from lwwmanning October 10, 2024 19:29
}

fn can_compress(&self, array: &Array) -> Option<&dyn EncodingCompressor> {
ChunkedArray::try_from(array)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array.is_encoding(&Chunked::ID).then_some(self)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@danking danking merged commit 3d6dd50 into develop Oct 10, 2024
5 checks passed
@danking danking deleted the dk/carry-forward-chunk-information-2 branch October 10, 2024 20:44
fn compress_array(&self, array: &Array) -> VortexResult<CompressedArray<'a>> {
let mut rng = StdRng::seed_from_u64(self.options.rng_seed);

if array.encoding().id() == Constant::ID {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array.is_encoding(&Constant::ID)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants