Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

somehow ls-tree is being called with git-dir as an option #1

Open
balmas opened this issue Aug 31, 2015 · 20 comments
Open

somehow ls-tree is being called with git-dir as an option #1

balmas opened this issue Aug 31, 2015 · 20 comments

Comments

@balmas
Copy link

balmas commented Aug 31, 2015

Java::JavaUtilConcurrent::ExecutionException: java.lang.Exception: error: unknown option `git-dir=/usr/local/gitrepos/users/srd20.git'
usage: git ls-tree [] [...]

-d                    only show trees
-r                    recurse into subtrees
-t                    show trees when recursing
-z                    terminate entries with NUL byte
-l, --long            include object size
--name-only           list only filenames
--name-status         list only filenames
--full-name           use full path names
--full-tree           list entire tree; not just current directory (implies --full-name)
--abbrev[=<n>]        use <n> digits to display SHA-1s


    java.util.concurrent.FutureTask.report(java/util/concurrent/FutureTask.java:122)
    java.util.concurrent.FutureTask.get(java/util/concurrent/FutureTask.java:188)
    clojure.core$deref_future.invoke(core.clj:2180)
    clojure.core$deref_future.invoke(clojure/core.clj:2180)
    clojure.core$future_call$reify__6320.deref(core.clj:6420)
    clojure.core$future_call$reify__6320.deref(clojure/core.clj:6420)
    clojure.core$deref.invoke(core.clj:2200)
    clojure.core$deref.invoke(clojure/core.clj:2200)
    clojure.core$pmap$step__6333$fn__6335.invoke(core.clj:6470)
    clojure.core$pmap$step__6333$fn__6335.invoke(clojure/core.clj:6470)
    clojure.lang.LazySeq.sval(clojure/lang/LazySeq.java:40)
    clojure.lang.LazySeq.seq(clojure/lang/LazySeq.java:49)
    clojure.lang.Cons.next(clojure/lang/Cons.java:39)
    clojure.lang.RT.next(clojure/lang/RT.java:598)
    clojure.core$next.invoke(core.clj:64)
    clojure.core$next.invoke(clojure/core.clj:64)
    clojure.core$dorun.invoke(core.clj:2856)
    clojure.core$dorun.invoke(clojure/core.clj:2856)
    gitwrapper.utils$_fetchLite.invoke(utils.clj:193)
    gitwrapper.utils$_fetchLite.invoke(gitwrapper/utils.clj:193)
    java.lang.reflect.Method.invoke(java/lang/reflect/Method.java:606)
    RUBY.copy_branch_from_repo(/var/lib/tomcat6/webapps/sosol/WEB-INF/app/models/repository.rb:219)
    RUBY.copy_to_owner(/var/lib/tomcat6/webapps/sosol/WEB-INF/app/models/publication.rb:1324)
    RUBY.send_to_finalizer(/var/lib/tomcat6/webapps/sosol/WEB-INF/app/models/publication.rb:794)
    RUBY.perform(/var/lib/tomcat6/webapps/sosol/WEB-INF/app/jobs/send_to_finalizer_job.rb:12)
    RUBY.with_lock(/var/lib/tomcat6/webapps/sosol/WEB-INF/gems/gems/activerecord-3.2.21/lib/active_record/locking/pessimistic.rb:72)
    RUBY.transaction(/var/lib/tomcat6/webapps/sosol/WEB-INF/gems/gems/activerecord-3.2.21/lib/active_record/connection_adapters/abstract/database_statements.rb:192)
    RUBY.transaction(/var/lib/tomcat6/webapps/sosol/WEB-INF/gems/gems/activerecord-3.2.21/lib/active_record/transactions.rb:208)

RUBY.transaction(/var/lib/tomcat6/webapps/sosol/WEB-INF/gems/gems/activerecord-3.2.21/lib/active_record/transactions.rb:250)
RUBY.with_lock(/var/lib/tomcat6/webapps/sosol/WEB-INF/gems/gems/activerecord-3.2.21/lib/active_record/locking/pessimistic.rb:70)
RUBY.perform(/var/lib/tomcat6/webapps/sosol/WEB-INF/app/jobs/send_to_finalizer_job.rb:10)
RUBY.with_connection(/var/lib/tomcat6/webapps/sosol/WEB-INF/gems/gems/activerecord-3.2.21/lib/active_record/connection_adapters/abstract/connection_pool.rb:129)
RUBY.perform(/var/lib/tomcat6/webapps/sosol/WEB-INF/app/jobs/send_to_finalizer_job.rb:8)
org.jruby.RubyKernel.public_send(org/jruby/RubyKernel.java:1930)
RUBY.dispatch(/var/lib/tomcat6/webapps/sosol/WEB-INF/gems/gems/celluloid-0.16.0/lib/celluloid/calls.rb:26)
RUBY.dispatch(/var/lib/tomcat6/webapps/sosol/WEB-INF/gems/gems/celluloid-0.16.0/lib/celluloid/calls.rb:63)
RUBY.invoke(/var/lib/tomcat6/webapps/sosol/WEB-INF/gems/gems/celluloid-0.16.0/lib/celluloid/cell.rb:60)
RUBY.task(/var/lib/tomcat6/webapps/sosol/WEB-INF/gems/gems/celluloid-0.16.0/lib/celluloid/cell.rb:71)
RUBY.task(/var/lib/tomcat6/webapps/sosol/WEB-INF/gems/gems/celluloid-0.16.0/lib/celluloid/actor.rb:357)
RUBY.initialize(/var/lib/tomcat6/webapps/sosol/WEB-INF/gems/gems/celluloid-0.16.0/lib/celluloid/tasks.rb:57)
RUBY.create(/var/lib/tomcat6/webapps/sosol/WEB-INF/gems/gems/celluloid-0.16.0/lib/celluloid/tasks/task_fiber.rb:15)
java.util.concurrent.ThreadPoolExecutor.runWorker(java/util/concurrent/ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(java/util/concurrent/ThreadPoolExecutor.java:615)
java.lang.Thread.run(java/lang/Thread.java:745)

@ryanfb
Copy link
Member

ryanfb commented Aug 31, 2015

I've also seen this happen once on papyri.info, but can't figure out why it errors this way (or what circumstances cause it to arise). Most git commands take --git-dir as an argument before the command, e.g.:

$ git --git-dir=/srv/data/papyri.info/sosol/repo/canonical.git/ ls-tree 83feab020c7d038efa5f158b693a1208b924fd12
100644 blob 481c824f68a68a3ee79276e93a5065969891f67e    .gitignore
040000 tree 123df875814b76a2b5c618771b92845f8c2080f0    APD
040000 tree 3508857e0deb3fbd0d96b1706ef463e3694bf2d6    APIS
040000 tree b5dbd65e78981cc0e526382e29fdcad102ae236c    Biblio
040000 tree 960dba7acdd49a21d389b476c849b20096912070    DDB_EpiDoc_XML
040000 tree 1f32ac9af2ac07370ccf637b4cab935b47f260f1    HGV_meta_EpiDoc
040000 tree 658f8f7e0c1ad98459b69897508b9b0b57109cd2    HGV_metadata
040000 tree e662566f5b178790d75dac0e9d5901eaeb2353e3    HGV_trans_EpiDoc
040000 tree 01867873befdf6d16cade75b1f12cd31a92edd46    RDF

Which is how gitwrapper appears to be using it in the two spots where ls-tree is called.

balmas added a commit to sosol/sosol that referenced this issue Aug 31, 2015
balmas added a commit to sosol/sosol that referenced this issue Aug 31, 2015
@balmas
Copy link
Author

balmas commented Aug 31, 2015

I managed to get the crazy long process to abort using the updated version with the Executing statement printed in the exception, and it looks like the command is right:

Executing git --git-dir=/usr/local/gitrepos/communities/Divan of Hafiz/Divan of Hafiz.git ls-tree 8f1d30982099c0c0e31578b5153d2925824f876a | git --git-dir=/usr/local/gitrepos/users/Bridget Almas.git mktree --missing

The output of ls-tree when I run it manually is
100644 blob 4b2bceaa7b6c211ff88627154dccd008e93969e2 README
100644 blob 5075cdb3aec52e13778fb61da9e6c2ad47891ef4 epifacs.xml
040000 tree 1989255f516ef3e4fc587118422bda748d45ae7e epifacs
100644 blob 55af308cbe1f0e0433774f59852daf32622fef53 perseus.xml
040000 tree a8019262998461cd9219b898228013e81afa772e perseus
040000 tree 28396ae19485ea52bd37829698bb3b9deb360c0a pilots

And running the full command piped to mktree succeeds as well.

So it's a bit hard to understand the exception in the concept of the executed statement. I'm back to thinking this is some sort of buffer/memory overflow exposed by the nature of this particular publication which is really huge with lots of identifiers and commits on each identifier.

@balmas
Copy link
Author

balmas commented Aug 31, 2015

Also I'm sure I must be doing something wrong here, since the Travis tests with the revered gitwrapper jar are now passing, but in my local test environment the publication tests keep failing even with the reverted jar.

@ryanfb
Copy link
Member

ryanfb commented Aug 31, 2015

Have you tried cleaning your test environment with ./script/clean? Sometimes a failed test can wind up leaving junk behind in such a way that it causes subsequent tests to fail.

@hcayless
Copy link
Contributor

I can vouch for this, having been driven insane by it before I realized I had to clean before re-running when my tests errored out.

On Aug 31, 2015, at 16:22 , Ryan Baumann [email protected] wrote:

Have you tried cleaning your test environment with ./script/clean https://github.com/sosol/sosol/blob/master/script/clean? Sometimes a failed test can wind up leaving junk behind in such a way that it causes subsequent tests to fail.


Reply to this email directly or view it on GitHub #1 (comment).

@balmas
Copy link
Author

balmas commented Aug 31, 2015

thanks both. script/clean did the trick for the tests at least. (glad to know I'm not the only one to have been driven insane by this ;)

Back to the original problem, I'm still waiting to see whether my manual tweaking of the git tree of the target user enables the finalization step to complete but I'm not too hopeful. I'll let you know how it goes. I may need to do some more detailed monitoring of the system to see exactly what is happening at the point of failure.

@balmas
Copy link
Author

balmas commented Aug 31, 2015

As I typed the last comment, it failed again. I was watching CPU and memory as it occurred and based on this and the monitoring logs, it looks like I get up to almost 14% CUP and 25% memory while the process is running just before the failure, so the system isn't falling down, but working hard compared to an average of about 1% CPU (I don't have the memory stats unfortunately). I see those spikes in the log that likely match each failed attempt today but historically I do have spikes that go higher than that.

@balmas
Copy link
Author

balmas commented Aug 31, 2015

But, it actually is failing on a different commit this time... will try again to see if the commit changes with each attempt.

@balmas
Copy link
Author

balmas commented Sep 1, 2015

I'm pretty sure this at least related to memory. I watched the memory consumption increase consistently as the process tried to complete the fetchLite call, until it eventually aborted with the git failure. It could be that there isn't enough left for git to complete the call but there was over 2GB available to git and other git calls on other files continue to succeed. But interestingly, after the aborted fetchLite call it seems the memory was never released.

@ryanfb
Copy link
Member

ryanfb commented Sep 1, 2015

To aid debugging, I've added a simple -main entrypoint in 495a413 so you can more easily run the JAR from the command line with e.g.:

java -jar target/git-wrapper-0.1.0-SNAPSHOT-standalone.jar archived/2015/08/31/P.NYU_1_1 testfetchlite /Users/ryan/source/dc3/sosol/db/test/git/boards/nyu.git /Users/ryan/source/dc3/sosol/db/test/git/canonical.git

I was actually thinking we could maybe shell out to such a command within SoSOL in case fetchLite might be contributing to sosol/sosol#99 (since the memory should be released if it's in a Java subprocess that terminates, and would also allow us to give e.g. fetchLite-specific JAVA_OPTS), however in my local testing this seems to vastly slow down fetchLite calls (maybe due to the spinup/spindown cost of a JVM?).

@balmas
Copy link
Author

balmas commented Sep 2, 2015

@ryanfb thanks so much for that. I get exactly the same error running from the command line as I do from within the app, which at least allows us to rule out tomcat and the suckerpunch async as contributing factors. It may be a litte while before I get back to debugging this. I do think there may be some sort of memory leak here but there may be other factors as well. This particular publication that I'm having a problem with is definitely an edge case, or at least a scenario that wasn't initially envisioned for the app, as there are 500 individual small files with many commits in this one publication. There could also be something particular about this publication's git tree that is throwing things off. It's oddly consistent with the commit it balks at. Will try to provide more debugging details soon.

@balmas
Copy link
Author

balmas commented Sep 3, 2015

Ok, added some more debugging to gitwrapper and ran again. For this particular publication:
get-diffs is called 6140 times
resulting in 885834 (!) calls to diff-trees via the recursive accumulate-diffs
it doesn't crash until it starts trying to call save-object
the first save-object succeeds
the 2nd fails with the ls-tree call, which definitely appears to be a bogus error since the statement passed to the shell is correct

It would seem that accumulate-diffs is accumulating memory as well. I'm not very informed about clojure and functional programming, but I'm wondering if the problem stems from this line: https://github.com/sosol/gitwrapper/blob/master/src/gitwrapper/utils.clj#L104 which doesn't seem to be embedded in a loop/recur structure?

@balmas
Copy link
Author

balmas commented Sep 3, 2015

tagging @fbaumgardt to put him in the loop (no pun intended)

@hcayless
Copy link
Contributor

hcayless commented Sep 3, 2015

I don't think that's the problem, but there's bound to be a lot of string-munging in this process, which will spike memory usage. Not sure there's a good way around that though. Both calls are in the loop, it's just possible that we've reached the file that's different, in which case the last object will be a blob. That said, it's making the assumption that each commit changes just one file in a directory. If that's not the case, I'm not sure it'll work properly. (It is the case in the PE.)

If it's not failing with some sort of out-of-memory error, then I don't think the many calls to accumulate-diffs can be the problem though. What happens if you run the failing git ls-tree command?

It is possible that this just isn't a good match for the scenario the program is written for, which is to short-circuit the fetch process when we know we have a simple fast-forward situation. Maybe we need some safeguards so that it fails over to a regular fetch if there are too many commits.

@balmas
Copy link
Author

balmas commented Sep 3, 2015

running the failing command manually works. It's actually not failing in accumulate-diffs, it gets through that, it just fails when it moves on to use the results of accumulate-diffs. If I change the nature of the diffs (i.e. by running the failing ls-tree | mktree command manually) it just fails on a different commit.

All commits should just be single file commits though -- I haven't changed anything about that, at least not intentionally.

Re the memory, I was just thinking that maybe so much memory has been consumed by the time that it gets to the save-object calls that either the shell out to git or git itself falls down. It might be the shell process because of the boggling nature of the error - printing the shell command before it happens confirmed that the command itself is right, so it must be either (1) the shell isn't passing the command to git intact or (2) git is crashing with a bogus error

@ryanfb
Copy link
Member

ryanfb commented Oct 22, 2015

So I ran into this in the wild again, where a publication send_to_finalizer was failing in copy_branch_from_repo with the ls-tree message.

This time I tried manually fetching the board remote inside the user repo and noticed there were error messages during the fetch:

error: refs/heads/rsb331/P.Lond._Herm. does not point to a valid object!
remote: error: refs/heads/rsb331/P.Lond._Herm. does not point to a valid object!

After fixing the board repo so that those error messages wouldn't occur during a fetch, a retry with send_to_finalizer on the same publication succeeded (though it got sent to a different random user than the first failed attempt).

Is it possible that if these sorts of errors occur in the git command output, they're getting accidentally re-used as arguments to another git command, leading to this error?

@balmas
Copy link
Author

balmas commented Oct 22, 2015

hmm. I don't know. I believe in situation I was experiencing it in, the manual fetch worked without error, but I could be wrong. It also might be that the error being thrown by git is completely useless and doesn't reflect the actual problem at all

@ryanfb
Copy link
Member

ryanfb commented Oct 22, 2015

In case it's potentially related, the fix I made on the board repo to fix the fetch errors was to delete the invalid ref: sosol/sosol#123

@hcayless
Copy link
Contributor

If it's some sort of buffer overflow, fixing the busted ref might have just put it under the threshold.

Sent from my phone.

On Oct 22, 2015, at 12:51, Bridget Almas [email protected] wrote:

hmm. I don't know. I believe in situation I was experiencing it in, the manual fetch worked without error, but I could be wrong. It also might be that the error being thrown by git is completely useless and doesn't reflect the actual problem at all


Reply to this email directly or view it on GitHub.

@balmas
Copy link
Author

balmas commented Oct 26, 2015

I don't know if this is related, or a separate issue, but but I just ran into another problem here:

this line:

https://github.com/sosol/gitwrapper/blob/master/src/gitwrapper/utils.clj#L99

is being called on a commit which is a blob instead of a tree

I'm struggling to understand the clojure code here - repeating the test for the blob that appears in the else clause from the seq test allowed me to complete the fetchLite call on the branch that exposed this, but I'm not sure if everything copied properly. I'm also a little confused about the circumstances that would cause the else clause to occur

 (defn accumulate-diffs
   [newsha basesha *diffs]
     (let [diff (diff-trees newsha basesha)]
       (loop [d diff]
         (if (seq (rest d))
           (do
-            (accumulate-diffs (second (second (first d))) (nth (second (first d)) 2) (conj! *diffs (first d)))
+            (if (= "blob" (first (second (first d))))
+              (conj! *diffs (first d))
+              (accumulate-diffs (second (second (first d))) (nth (second (first d)) 2) (conj! *diffs (first d))))
             (recur (rest d)))
           (if-not (nil? (first (first d)))
             (if (= "blob" (first (second (first d))))

balmas added a commit that referenced this issue Oct 27, 2015
hcayless added a commit that referenced this issue May 31, 2016
The new method uses java.lang.ProcessBuilder and pipes stderr to
/dev/null. Hopefully this will avoid the weirtd behavior in #1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants