-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
somehow ls-tree is being called with git-dir as an option #1
Comments
I've also seen this happen once on papyri.info, but can't figure out why it errors this way (or what circumstances cause it to arise). Most git commands take
Which is how gitwrapper appears to be using it in the two spots where |
(cherry picked from commit a119d70)
I managed to get the crazy long process to abort using the updated version with the Executing statement printed in the exception, and it looks like the command is right: Executing git --git-dir=/usr/local/gitrepos/communities/Divan of Hafiz/Divan of Hafiz.git ls-tree 8f1d30982099c0c0e31578b5153d2925824f876a | git --git-dir=/usr/local/gitrepos/users/Bridget Almas.git mktree --missing The output of ls-tree when I run it manually is And running the full command piped to mktree succeeds as well. So it's a bit hard to understand the exception in the concept of the executed statement. I'm back to thinking this is some sort of buffer/memory overflow exposed by the nature of this particular publication which is really huge with lots of identifiers and commits on each identifier. |
Also I'm sure I must be doing something wrong here, since the Travis tests with the revered gitwrapper jar are now passing, but in my local test environment the publication tests keep failing even with the reverted jar. |
Have you tried cleaning your test environment with |
I can vouch for this, having been driven insane by it before I realized I had to clean before re-running when my tests errored out.
|
thanks both. script/clean did the trick for the tests at least. (glad to know I'm not the only one to have been driven insane by this ;) Back to the original problem, I'm still waiting to see whether my manual tweaking of the git tree of the target user enables the finalization step to complete but I'm not too hopeful. I'll let you know how it goes. I may need to do some more detailed monitoring of the system to see exactly what is happening at the point of failure. |
As I typed the last comment, it failed again. I was watching CPU and memory as it occurred and based on this and the monitoring logs, it looks like I get up to almost 14% CUP and 25% memory while the process is running just before the failure, so the system isn't falling down, but working hard compared to an average of about 1% CPU (I don't have the memory stats unfortunately). I see those spikes in the log that likely match each failed attempt today but historically I do have spikes that go higher than that. |
But, it actually is failing on a different commit this time... will try again to see if the commit changes with each attempt. |
I'm pretty sure this at least related to memory. I watched the memory consumption increase consistently as the process tried to complete the fetchLite call, until it eventually aborted with the git failure. It could be that there isn't enough left for git to complete the call but there was over 2GB available to git and other git calls on other files continue to succeed. But interestingly, after the aborted fetchLite call it seems the memory was never released. |
To aid debugging, I've added a simple
I was actually thinking we could maybe shell out to such a command within SoSOL in case fetchLite might be contributing to sosol/sosol#99 (since the memory should be released if it's in a Java subprocess that terminates, and would also allow us to give e.g. fetchLite-specific |
@ryanfb thanks so much for that. I get exactly the same error running from the command line as I do from within the app, which at least allows us to rule out tomcat and the suckerpunch async as contributing factors. It may be a litte while before I get back to debugging this. I do think there may be some sort of memory leak here but there may be other factors as well. This particular publication that I'm having a problem with is definitely an edge case, or at least a scenario that wasn't initially envisioned for the app, as there are 500 individual small files with many commits in this one publication. There could also be something particular about this publication's git tree that is throwing things off. It's oddly consistent with the commit it balks at. Will try to provide more debugging details soon. |
Ok, added some more debugging to gitwrapper and ran again. For this particular publication: It would seem that accumulate-diffs is accumulating memory as well. I'm not very informed about clojure and functional programming, but I'm wondering if the problem stems from this line: https://github.com/sosol/gitwrapper/blob/master/src/gitwrapper/utils.clj#L104 which doesn't seem to be embedded in a loop/recur structure? |
tagging @fbaumgardt to put him in the loop (no pun intended) |
I don't think that's the problem, but there's bound to be a lot of string-munging in this process, which will spike memory usage. Not sure there's a good way around that though. Both calls are in the loop, it's just possible that we've reached the file that's different, in which case the last object will be a blob. That said, it's making the assumption that each commit changes just one file in a directory. If that's not the case, I'm not sure it'll work properly. (It is the case in the PE.) If it's not failing with some sort of out-of-memory error, then I don't think the many calls to accumulate-diffs can be the problem though. What happens if you run the failing git ls-tree command? It is possible that this just isn't a good match for the scenario the program is written for, which is to short-circuit the fetch process when we know we have a simple fast-forward situation. Maybe we need some safeguards so that it fails over to a regular fetch if there are too many commits. |
running the failing command manually works. It's actually not failing in accumulate-diffs, it gets through that, it just fails when it moves on to use the results of accumulate-diffs. If I change the nature of the diffs (i.e. by running the failing ls-tree | mktree command manually) it just fails on a different commit. All commits should just be single file commits though -- I haven't changed anything about that, at least not intentionally. Re the memory, I was just thinking that maybe so much memory has been consumed by the time that it gets to the save-object calls that either the shell out to git or git itself falls down. It might be the shell process because of the boggling nature of the error - printing the shell command before it happens confirmed that the command itself is right, so it must be either (1) the shell isn't passing the command to git intact or (2) git is crashing with a bogus error |
So I ran into this in the wild again, where a publication This time I tried manually fetching the board remote inside the user repo and noticed there were error messages during the fetch:
After fixing the board repo so that those error messages wouldn't occur during a fetch, a retry with Is it possible that if these sorts of errors occur in the |
hmm. I don't know. I believe in situation I was experiencing it in, the manual fetch worked without error, but I could be wrong. It also might be that the error being thrown by git is completely useless and doesn't reflect the actual problem at all |
In case it's potentially related, the fix I made on the board repo to fix the fetch errors was to delete the invalid ref: sosol/sosol#123 |
If it's some sort of buffer overflow, fixing the busted ref might have just put it under the threshold. Sent from my phone.
|
I don't know if this is related, or a separate issue, but but I just ran into another problem here: this line: https://github.com/sosol/gitwrapper/blob/master/src/gitwrapper/utils.clj#L99 is being called on a commit which is a blob instead of a tree I'm struggling to understand the clojure code here - repeating the test for the blob that appears in the else clause from the seq test allowed me to complete the fetchLite call on the branch that exposed this, but I'm not sure if everything copied properly. I'm also a little confused about the circumstances that would cause the else clause to occur
|
The new method uses java.lang.ProcessBuilder and pipes stderr to /dev/null. Hopefully this will avoid the weirtd behavior in #1.
Java::JavaUtilConcurrent::ExecutionException: java.lang.Exception: error: unknown option `git-dir=/usr/local/gitrepos/users/srd20.git'
usage: git ls-tree [] [...]
RUBY.transaction(/var/lib/tomcat6/webapps/sosol/WEB-INF/gems/gems/activerecord-3.2.21/lib/active_record/transactions.rb:250)
RUBY.with_lock(/var/lib/tomcat6/webapps/sosol/WEB-INF/gems/gems/activerecord-3.2.21/lib/active_record/locking/pessimistic.rb:70)
RUBY.perform(/var/lib/tomcat6/webapps/sosol/WEB-INF/app/jobs/send_to_finalizer_job.rb:10)
RUBY.with_connection(/var/lib/tomcat6/webapps/sosol/WEB-INF/gems/gems/activerecord-3.2.21/lib/active_record/connection_adapters/abstract/connection_pool.rb:129)
RUBY.perform(/var/lib/tomcat6/webapps/sosol/WEB-INF/app/jobs/send_to_finalizer_job.rb:8)
org.jruby.RubyKernel.public_send(org/jruby/RubyKernel.java:1930)
RUBY.dispatch(/var/lib/tomcat6/webapps/sosol/WEB-INF/gems/gems/celluloid-0.16.0/lib/celluloid/calls.rb:26)
RUBY.dispatch(/var/lib/tomcat6/webapps/sosol/WEB-INF/gems/gems/celluloid-0.16.0/lib/celluloid/calls.rb:63)
RUBY.invoke(/var/lib/tomcat6/webapps/sosol/WEB-INF/gems/gems/celluloid-0.16.0/lib/celluloid/cell.rb:60)
RUBY.task(/var/lib/tomcat6/webapps/sosol/WEB-INF/gems/gems/celluloid-0.16.0/lib/celluloid/cell.rb:71)
RUBY.task(/var/lib/tomcat6/webapps/sosol/WEB-INF/gems/gems/celluloid-0.16.0/lib/celluloid/actor.rb:357)
RUBY.initialize(/var/lib/tomcat6/webapps/sosol/WEB-INF/gems/gems/celluloid-0.16.0/lib/celluloid/tasks.rb:57)
RUBY.create(/var/lib/tomcat6/webapps/sosol/WEB-INF/gems/gems/celluloid-0.16.0/lib/celluloid/tasks/task_fiber.rb:15)
java.util.concurrent.ThreadPoolExecutor.runWorker(java/util/concurrent/ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(java/util/concurrent/ThreadPoolExecutor.java:615)
java.lang.Thread.run(java/lang/Thread.java:745)
The text was updated successfully, but these errors were encountered: