-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log number of montiored files #491
Conversation
a519997
to
8f7286c
Compare
e1ec189
to
620deca
Compare
internal/semaphore/semaphore.go
Outdated
if (timeout > 0) { | ||
timer := time.NewTimer(timeout) | ||
defer timer.Stop() | ||
select { | ||
case <-sem.slots: | ||
return true | ||
case <-timer.C: | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to have a timer here. We can just use time.After()
to fail through:
if (timeout > 0) { | |
timer := time.NewTimer(timeout) | |
defer timer.Stop() | |
select { | |
case <-sem.slots: | |
return true | |
case <-timer.C: | |
} | |
} | |
if (timeout > 0) { | |
select { | |
case <-sem.slots: | |
return true | |
case <-time.After(timeout): | |
} | |
} |
plugins/inputs/logfile/constant.go
Outdated
truncate_suffix = "[Truncated...]" | ||
` | ||
|
||
const defaultTimeoutToAcquire = 1 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One second is a long time. Are you sure this can't be like 100ms? I'm very concerned about blocking on a channel for this semaphore because we run the risk of putting the agent in a deadlock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This can be 100ms or even 1ms
since 1 * time.Second
is only an arbitrary value so good point on this one.
time.Sleep(500 * time.Millisecond) | ||
assert.Equal(t, 0, resources.numUsedFds.GetCount()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to assert and test for the fds counter in every test? It's redundant and doesn't increase test coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To account for the case whether the semaphore would drop by 1 if it stops tailing the file. This can be addressed in tail_test too which I had done it. However, there is a strong correlation between tail
and tailsrc
(tailsrc has a attribute of tailer). Therefore, I would introduce this to avoid any regression between these two
internal/semaphore/semaphore.go
Outdated
) | ||
|
||
type Semaphore interface { | ||
//Acquire a slot in the semaphore with a timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is worth documenting that if timeout is <=0, then Acquire() will not block. I initially thought timeout=0 would behave as if timeout=infinity.
if ok := t.numUsedFds.Acquire(defaultTimeoutToAcquire); !ok { | ||
t.Log.Debugf("Cannot increase counter of used file descriptors") | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semaphore code was really well done!
I don't think it needs to be used here.
It would be simpler and use less memory to have a single counter (integer instead of channel) of the open files (aka files being tailed). Then just log the counter's current value periodically.
It is undesirable to change behavior here and block for up to 1 second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem with just having a simple global counter is that the logfile plugin itself doesn't know/care when log files stop being monitored.. Though I guess rather than having a semaphore we could maybe just have an atomic counter that we lock/unlock before incrementing/decrementing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the information: there were others use semaphore as a counter (XSync, Marusama). However, after comparing the performance between them, I would say this version (or similar to DropBox), this would be the best option(for now) when decide the file descriptors as a upper bound.
BenchmarkSemaphore_AcquireReleaseUnderLimitSimple-12 19336089 62.26 ns/op 0 B/op 0 allocs/op
BenchmarkSemaphore_AcquireReleaseUnderLimit-12 1652306 726.5 ns/op 0 B/op 0 allocs/op
BenchmarkSemaphore_AcquireReleaseOverLimit-12 32737 37202 ns/op 0 B/op 0 allocs/op
BenchmarkMarusamaSemaphore_AcquireReleaseUnderLimitSimple-12 6038138 171.7 ns/op 96 B/op 1 allocs/op
BenchmarkMarusamaSemaphore_AcquireReleaseUnderLimit-12 479114 2268 ns/op 960 B/op 10 allocs/op
BenchmarkMarusamaSemaphore_AcquireReleaseOverLimit-12 50464 24728 ns/op 9600 B/op 100 allocs/op
BenchmarkAbiosoftSemaphore_AcquireReleaseUnderLimitSimple-12 8135151 145.0 ns/op 0 B/op 0 allocs/op
BenchmarkAbiosoftSemaphore_AcquireReleaseUnderLimit-12 432472 2731 ns/op 0 B/op 0 allocs/op
BenchmarkAbiosoftSemaphore_AcquireReleaseOverLimit-12 51364 27126 ns/op 0 B/op 0 allocs/op
BenchmarkPivotalGolangSemaphore_AcquireReleaseUnderLimitSimple-12 3358042 355.3 ns/op 128 B/op 2 allocs/op
BenchmarkPivotalGolangSemaphore_AcquireReleaseUnderLimit-12 149810 7881 ns/op 1200 B/op 20 allocs/op
BenchmarkPivotalGolangSemaphore_AcquireReleaseOverLimit-12 14558 83574 ns/op 12000 B/op 200 allocs/op
BenchmarkXSyncSemaphore_AcquireReleaseUnderLimitSimple-12 40687284 27.22 ns/op 0 B/op 0 allocs/op
BenchmarkXSyncSemaphore_AcquireReleaseUnderLimit-12 1000000 1060 ns/op 0 B/op 0 allocs/op
BenchmarkXSyncSemaphore_AcquireReleaseOverLimit-12 17275 69725 ns/op 15994 B/op 299 allocs/op
However, the 1 second
is only an arbitrary value and it could be more or less. But as long as it within acceptable range (e.g do customer cares about the show number of monitored files being late in 1 second
or not? Maybe its exchange between source of truth and the fastness a little bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could get by with an atomic counter. I think the code would look a little weird, but it's 100% feasible. Take a look at https://gobyexample.com/atomic-counters. Pretty sure since it considers the input a delta, you could pass in 1 when a log file is discovered, and pass in -1 when the tailer closes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm wrong but according to the syns/atomic
Package atomic provides low-level atomic memory primitives useful for implementing synchronization algorithms.
These functions require great care to be used correctly. Except for special, low-level applications, synchronization is better done with channels or the facilities of the sync package.
So I'm still considering between the two. However, will compare performance before giving my final decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compare a little bit when doing atomic counter with mutex
BenchmarkSemaphore_AcquireReleaseUnderLimit-12 9621 129439 ns/op 0 B/op 0 allocs/op
BenchmarkAtomicCounterWithOutLock-12 20169 60657 ns/op 0 B/op 0 allocs/op
BenchmarkAtomicCounterWithLock-12 5130 256611 ns/op 13 B/op 0 allocs/op
So agree if we only care about the end results; then Atomic Counter would be a great idea. However, with a lock, I would stand by using semaphore instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to derail, I wasn't as concerned about the implementation of counter vs semaphore.
Agreed that an atomic counter (no lock needed) is simpler and would suffice here.
More concerned that you are introducing a blocking function call on Acquire()
.
I initially thought this PR would just add a variable to track the count of open files AND then periodically log it.
Not to check the counter value and change behavior based on the counter value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More concerned that you are introducing a blocking function call on Acquire().
Acquire()
would be not blocking the function since we are using the timing value
and it we don't provide the timing value, it would return false immediately. So there is a thing we need to consider: would we going with zero timeout value or some partial timeout value (e.g 100 Milisecond, I changed it from 1 second to 100 ms). As long as its within the acceptable range. I will use default time as 0
if it concerns the customer about the blocking part.
Not to check the counter value and change behavior based on the counter value.
if tail.numUsedFds.GetLimit() != 0 {
The reason why I used the previous line is to avoid CloudWatchAgent is not able to get the file descriptors from the linux and assign it to 0 as a default so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I know it does not block indefinitely, but I consider it "blocking" even if it is blocks for 100ms.
*** I don't think it should block at all. I propose setting defaultTimeoutToAcquire
to 0, or just removing the timeout parameter from Acquire()
.***
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. The timeout stops it from blocking indefinitely but is still potentially paused for up to 100ms. That's not great
d7cd36c
to
4c00cfe
Compare
5335a6a
to
a933c40
Compare
case <-time.After(timeout): | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this return false because it failed to claim before the timeout
truncate_suffix = "[Truncated...]" | ||
` | ||
|
||
const defaultTimeoutToAcquire = 0 * time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to multiply by millisecond?
Close since #506 has a better solution and better performance |
Close #405
Close #432
Description of the issue
If CloudWatchAgent runs in an intensive log environment, CloudWatchAgent will monitor all the files and it can exhaust all the file descriptors that being allowed by the OS for a single process. Therefore, CloudWatchAgent needs to show the number of monitored files and show the number of allowed to be monitored for customer to have corresponding actions.
Description of changes
file-descriptor
add-on for each OS and different OS can return different file descriptors limit.License
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Reasons to choose buffered channel
Tests
There are two tests that I have done:
auto_removal
features:Sample app
Sample CloudWatchAgent JSON configuration:
After running these two tests, confirm the monitored file by comparing the monitored files
lsof -p {{CloudWatchAgent process id}}
with logs inCloudWatchAgent.logs
. An example of expected results:logrotate
Sample script
Sample Amazon CloudWatch Agent JSON configuration
Requirements
Before commit the code, please do the following steps.
make fmt
andmake fmt-sh
make linter
Performance benchmarking
Before(v353) and after adding tailer semaphore ( numUsedFds) with monitoring 15 log files (not sure why my version is more lighter)
Comparing between Semaphore