Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fully read zipped files not being properly serialized in sincedb #286

Open
max-frank opened this issue Mar 26, 2021 · 3 comments · May be fixed by #287
Open

Fix fully read zipped files not being properly serialized in sincedb #286

max-frank opened this issue Mar 26, 2021 · 3 comments · May be fixed by #287

Comments

@max-frank
Copy link

Description

In read mode zipped files are currently always read completely when they already have been read fully.
This is due to a call to sincedb_collection.clear_watched_file(key) after the zip read completes.
Thus removing the file information from the sincedb entry making it impossible for logstash to recognize the file as already having been read on re-run.

The problem can be fixed by adding a call to sincedb_collection.reading_completed(key) before the clear_watched_file call; as is done in the plain text handler.
The reading completed call will set the @path_in_sincedb attribute ensuring that the entry is correctly serialized even after clearing the watched file.

Information

  • Version: 4.2.4
  • Operating System: Ubuntu Bionic (18.04)
  • Config File:
# pipeline config
input {
  file {
    path => "/tmp/test.log.gz"
    mode => "read"
    sincedb_path => "/tmp/sincedb"
    file_completed_action => "log"
    file_completed_log_path => "/tmp/done.log"
    exit_after_read => true
  }
}

output {
  stdout {}
}
  • Sample Data:
    /tmp/test.log
Line 1
Line 2
Line 3
Line 4
Line 5
  • Steps to Reproduce:
    1. Zip /tmp/test.log
    gzip /tmp/test.log
    1. Run the pipeline the first time (this will create the sincedb file)
    2. Run the pipeline again and see that the zip is read again
    3. Inspect the sincedb file and see that inode entry for /tmp/test.log.gz is missing the path
@max-frank max-frank linked a pull request Mar 26, 2021 that will close this issue
@jbwl
Copy link

jbwl commented May 3, 2022

Hi @max-frank

This is still broken - using the file input plugin with .gz files will break sincedb.
Did you find any elegant way to work around this issue? It seems the only way to keep sincedb functionality is to uncompress the files outside of Logstash, but I would like to avoid that of course.

@max-frank
Copy link
Author

@jbwl I have not needed to use this in a long time so I did not search for solutions other than the tiny modification I made in #287 This one worked for me but I did not test it thoroughly.

@Dukestep
Copy link

Dukestep commented Apr 4, 2024

We have been hit with this issue while trying to read gzipped log files. Every time a pipeline restarted, Logstash would reingest files that had previously been fully read. This was causing us to have duplicate documents in Elasticsearch much like a fellow Logstash user described in this discuss.elastic.co discussion.

@max-frank's fix in #287 fixed our issue. Are there any reasons this PR was never fully merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants