Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix compression rate degradation occurring after a dictionary overflow for some workloads #82

Merged
merged 9 commits into from
Nov 6, 2023

Conversation

lquerel
Copy link
Contributor

@lquerel lquerel commented Nov 4, 2023

This PR addresses the progressive degradation of the compression rate observed in Lightstep's span-type data over time. To facilitate a deeper understanding of the dynamics at play, a comprehensive suite of instrumentation has been implemented, targeting the analysis of the average compression rate in response to various schema modifications.

Additionally, a new CLI command has been introduced, expanding the simulation capabilities to encompass diverse OTel Arrow stream life cycles, including variations in batch sizes and the number of batches per stream.

The root cause of the diminishing compression rate has been identified as a dictionary overflow event. This overflow triggered an automatic fallback to a column without dictionary encoding—a standard and often appropriate response. However, scenarios have been identified where maintaining dictionary encoding and resetting the dictionary may be more beneficial. To ascertain the optimal response, a ratio is employed: the number of distinct values in the dictionary divided by the number of values inserted into the dictionary. This ratio serves as an indicator, dictating when a dictionary reset is preferable over the default fallback procedure. Empirical analysis has resulted in setting the default threshold for this ratio at 0.3. Further research may refine this threshold, adopting a more systematic approach to its determination.

The following chart provides a comparative analysis, showcasing the optimization's impact on compression efficiency gains before and after its implementation.

compression-efficiency-gain-after-optimization

lquerel and others added 6 commits October 31, 2023 18:06
Events such as schema update, dictionary upgrade, dictionary overflow are now notified to the ProducerObserver.
Events such as schema update, dictionary upgrade, dictionary overflow are now notified to the ProducerObserver.
# Conflicts:
#	collector/go.mod
#	collector/go.sum
@lquerel lquerel merged commit bb84a55 into open-telemetry:main Nov 6, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants