New `large-logs-dataset` challenge in `elastic/logs` track #631

salvatore-campagna · 2024-07-25T10:41:06Z

We would like to run an experiment in Rally which uses considerable amount of data. The idea is to be able to fill the disk of an AWS instance with 7.5 TB of storage. Indexing such large amount of data poses at least two challenges, anyway, which are a result of the way the elastic/logs Rally track is designed:

Data generation needs to be done before running each experiment (note that we could do data generation once and the use it multiple times, but that would be possible only if multiple experiments are ok with using the same dataset).
Raw data to Json expansion means we need to generate far more Json data to index into Elasticsearch. If we imagine a raw-to-json expansion factor of 10, it means that filling storage with 7.5 TB of raw data needs 75 TB of Json data on the Rally load driver.

For our experiment described in an internal Jira ticket, we:

Can't reuse the same dataset (we can't do data generation just once), because at least the document @timestamp needs to change depending on how much data we need to index per each day (raw_data_volume_per_day).
Finding an AWS instance with such large storage is challenging and expensive, one of them is is4gen.8xlarge that has 4 x 7.5 TB = 30 TB of storage available. Note that if we assume x10 raw-to-json expansion we would need 75 TB of Json data to have 7.5 TB of raw data. This means that even the instance with the largest storage can't handle the amount of data we need.

As a result, benchmarking this scenario is practically impossible because of resource constraints but also because of the time data generation and indexing would require.

So the idea is to adopt the following strategy which we would like to implement in a new challenge part of the elastic/logs track:

Index 100 GB of raw data, which means generating about 1 TB of Json data on the load driver (ideally reuse the raw_data_volume_per_day)
Create a snapshot out of the indexed data
Restore the snapshot multiple times (ideally using a challenge parameter)
Execute queries part of the logging-querying existing challenge to collect query latencies

For the use case above where we need to fill the instance with 7.5 TB of raw data it means restoring the snapshot 75 times. We expect:

time required to have the full dataset indexed to be far less than generating the full dataset and indexing it as it would normally happen with the elastic/logs track and the logging-querying existing challenge.
queries to crunch more documents because of data duplication: that is ok as long as we just compare query latencies with other setups using the same track and challenge (i.e. "standard" index mode versus LogsDB)
lower storage footprint due to data duplication and better data compression

An experiment configured like described above mimics and environment where 75 hosts are logging exactly the same dataset.

Note that the snapshot API is only available in on-prem deployments...which means we need to run the benchmark on-prem.

The text was updated successfully, but these errors were encountered:

salvatore-campagna · 2024-07-25T10:43:13Z

@elastic/es-perf I see the existing elastic/logs track has a cross-clusters-search-and-snapshot challenge which does something similar to what I described above but restoring a snapshot to remote clusters (for CCS). Would it be possible to re-use that challenge and just restore the snapshot to the original cluster multiple times instead of restoring it to remote clusters?

salvatore-campagna · 2024-07-25T12:28:20Z

Note that the new challenge needs to skip deleting a template, .fleet_globals-1. Not skipping the delete operation results in an error later when trying to delete it. This component template happens to be used by one of the index templates installed by Elasticsearch.

The error is:

esrally.exceptions.RallyError: Cannot run task [delete-all-component-templates]: Request returned an error. Error type: api, Description: illegal_argument_exception ({'error': {'root_cause': [{'type': 'illegal_argument_exception', 'reason': 'component templates [.fleet_globals-1] cannot be removed as they are still in use by index templates [synthetics-browser.screenshot, synthetics-browser, synthetics-icmp, synthetics-http, synthetics-tcp, metrics-fleet_server.agent_status, metrics-fleet_server.agent_versions, synthetics-browser.network, logs-fleet_server.output_health]'}], 'type': 'illegal_argument_exception', 'reason': 'component templates [.fleet_globals-1] cannot be removed as they are still in use by index templates [synthetics-browser.screenshot, synthetics-browser, synthetics-icmp, synthetics-http, synthetics-tcp, metrics-fleet_server.agent_status, metrics-fleet_server.agent_versions, synthetics-browser.network, logs-fleet_server.output_health]'}, 'status': 400}), HTTP Status: 400

See 6251688

salvatore-campagna added the enhancement label Jul 25, 2024

salvatore-campagna self-assigned this Jul 25, 2024

salvatore-campagna mentioned this issue Jul 25, 2024

LogsDB - Rally elastic/logs dataset generation elastic/elasticsearch#111009

Closed

salvatore-campagna linked a pull request Jul 25, 2024 that will close this issue

New large-logs-dataset challenge in elastic/logs #632

Open

kkrik-es mentioned this issue Aug 2, 2024

Create large-logs-dataset challenge #634

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New `large-logs-dataset` challenge in `elastic/logs` track #631

New `large-logs-dataset` challenge in `elastic/logs` track #631

salvatore-campagna commented Jul 25, 2024 •

edited

Loading

salvatore-campagna commented Jul 25, 2024 •

edited

Loading

salvatore-campagna commented Jul 25, 2024 •

edited

Loading

New large-logs-dataset challenge in elastic/logs track #631

New large-logs-dataset challenge in elastic/logs track #631

Comments

salvatore-campagna commented Jul 25, 2024 • edited Loading

salvatore-campagna commented Jul 25, 2024 • edited Loading

salvatore-campagna commented Jul 25, 2024 • edited Loading

New `large-logs-dataset` challenge in `elastic/logs` track #631

New `large-logs-dataset` challenge in `elastic/logs` track #631

salvatore-campagna commented Jul 25, 2024 •

edited

Loading

salvatore-campagna commented Jul 25, 2024 •

edited

Loading

salvatore-campagna commented Jul 25, 2024 •

edited

Loading