Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gemini CQL statement logging generates large files #441

Open
CodeLieutenant opened this issue Nov 30, 2024 · 2 comments
Open

Gemini CQL statement logging generates large files #441

CodeLieutenant opened this issue Nov 30, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers
Milestone

Comments

@CodeLieutenant
Copy link
Contributor

When gemini is run with --test-statement-log-file and --oracle-statement-log-file, after running the program for a longer periods of time, e.g 3h or 10h case, these files are extremly large, to the point that can kill loader instance in SCT. This sometimes happenes after 10 minutes, as having both of these flags, have identical data in them (same size), thats a double the storage needed.

In order to see everything running in gemini, we need to discuss how to implement better CQL statment logging.

Solution #1 (easy solution)

After commit 7c5dda0 fileLogger was changed to accept io.Writer instead of *os.File, this allows us to pass any writter to fileLogger which of the compression alorithms in stdlib are.

As observered in Argus with some failed gemini runs (which sadly they are lost), gzip compression works really good, condensing statement files to a ~50MB or so, when extracted they are ~600MB to 1GB for a 5-10m run (this is the default what SCT does to the log files). We can do the same thing and implementation would be really easy.

Problem with this solution: It will be expensive to log every statement, performance would degrade.

Solution #2

Log only whats necessacy.

Currently gemeni statment logging looks like this

SELECT col1,col2,co3 FROM tbl1 WHERE pk1=VALUE....
INSERT INTO tbl1(pk1, pk2, col1, col2) VALUES(VALUE1, VALUE2, VALUE3)
...
  1. We can store only the query type, columns their values in the file
  2. Store query type and the seed to generate the same CQL statement

Both of the solution require a subcommand for gemini cli to reconstruct all the queries so that we can see them and run them if there is a need. This solution performance wise, is really good, as it makes smaller ammout of write syscalls, but makes in for code complexity for a custom format and reconstruction of CQL statment.

Solution #2 takes a lot longer to implement and validate, but in the ends will be needed, for a short term solution, for us to have the CQL statements in the logs, Solution #1 should be impelemented as a temporary measure until further discussing this issue.

@CodeLieutenant CodeLieutenant added enhancement New feature or request good first issue Good for newcomers labels Nov 30, 2024
@CodeLieutenant CodeLieutenant added this to the Gemini 1.9 milestone Nov 30, 2024
@CodeLieutenant CodeLieutenant self-assigned this Nov 30, 2024
@fruch
Copy link
Collaborator

fruch commented Dec 2, 2024

@roydahan FYI

@fruch
Copy link
Collaborator

fruch commented Dec 2, 2024

@CodeLieutenant

I would suggest one step at a time.

first option zero, give the tests larger disks so we can run 3h or 10h, it's a simple straightforward configuration in SCT.
then look into option 1, since it doesn't contradict option 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants