Fix compression rate degradation occurring after a dictionary overflow for some workloads #82
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR addresses the progressive degradation of the compression rate observed in Lightstep's span-type data over time. To facilitate a deeper understanding of the dynamics at play, a comprehensive suite of instrumentation has been implemented, targeting the analysis of the average compression rate in response to various schema modifications.
Additionally, a new CLI command has been introduced, expanding the simulation capabilities to encompass diverse OTel Arrow stream life cycles, including variations in batch sizes and the number of batches per stream.
The root cause of the diminishing compression rate has been identified as a dictionary overflow event. This overflow triggered an automatic fallback to a column without dictionary encoding—a standard and often appropriate response. However, scenarios have been identified where maintaining dictionary encoding and resetting the dictionary may be more beneficial. To ascertain the optimal response, a ratio is employed: the number of distinct values in the dictionary divided by the number of values inserted into the dictionary. This ratio serves as an indicator, dictating when a dictionary reset is preferable over the default fallback procedure. Empirical analysis has resulted in setting the default threshold for this ratio at 0.3. Further research may refine this threshold, adopting a more systematic approach to its determination.
The following chart provides a comparative analysis, showcasing the optimization's impact on compression efficiency gains before and after its implementation.