Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude hidden files from logstream regex discovery #448

Merged
merged 2 commits into from
Nov 5, 2024

Conversation

jfzunigac
Copy link
Contributor

This resolves a problem where Singer attempts to initialize logstreams for watermark files when the regex in the logstream configuration isn't left-bounded. For example if we have a regex: .*test.* and in the directory we have the following files:

.singer.my_logstream.test_001  <---- the watermark file
test_001 <---- the logfile

Singer will try to initialize logstreams for both files, resulting in a third watermark that tracks the watermark file:

.singer.my_logstream..singer.my_logstream.test_001 <---- unwanted watermark file
.singer.my_logstream.test_001  <---- watermark file
test_001 <---- logfile

In some cases when filenames are long, the processor won't be able to persist progress onto the unwanted watermark files due to filename too long exceptions.

Test Plan:

Added unit tests and tested in dev environment, as well as in kubernetes since its bound to happen more often there

@jfzunigac jfzunigac requested a review from a team as a code owner October 29, 2024 18:18
@vahidhashemian
Copy link
Contributor

Looks like the build failed because of the added unit test?

@jfzunigac jfzunigac merged commit 03a5062 into pinterest:master Nov 5, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants