Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional Features Failure #829

Open
tahorst opened this issue Feb 27, 2020 · 5 comments
Open

Optional Features Failure #829

tahorst opened this issue Feb 27, 2020 · 5 comments

Comments

@tahorst
Copy link
Member

tahorst commented Feb 27, 2020

@prismofeverything mind investigating why arrow failed?

Git hash: 9e9f3bb

Command:

DESC="Causality Network" BUILD_CAUSALITY_NETWORK=1 N_GENS=2 SEED=6691 \
  PARALLEL_PARCA=1 SINGLE_DAUGHTERS=1 COMPRESS_OUTPUT=1 RAISE_ON_TIME_LIMIT=1 \
  WC_ANALYZE_FAST=1 \
  python runscripts/fireworks/fw_queue.py

Trace:

 3848.91    470.53        1.443        1.445        1.459        1.431     1.462
Traceback (most recent call last):
  File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/fireworks/core/rocket.py", line 262, in run
    m_action = t.run_task(my_spec)
  File "/scratch/groups/mcovert/jenkins/workspace/wholecell/fireworks/firetasks/simulationDaughter.py", line 78, in run_task
    sim.run()
  File "/scratch/groups/mcovert/jenkins/workspace/wholecell/sim/simulation.py", line 235, in run
    self.run_incremental(self._lengthSec + self.initialTime())
  File "/scratch/groups/mcovert/jenkins/workspace/wholecell/sim/simulation.py", line 267, in run_incremental
    self._evolveState(processes)
  File "/scratch/groups/mcovert/jenkins/workspace/wholecell/sim/simulation.py", line 326, in _evolveState
    process.calculateRequest()
  File "/scratch/groups/mcovert/jenkins/workspace/models/ecoli/processes/complexation.py", line 59, in calculateRequest
    result = self.system.evolve(self._sim.timeStepSec(), moleculeCounts)
  File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/arrow/arrow.py", line 134, in evolve
    steps, time, events, outcome = self.obsidian.evolve(duration, state)
TypeError: 'NoneType' object is not iterable
arrow.obsidian.evolve - failed to allocate memory: 12
Simulation finished:
 - Sim length: 0:24:07
 - Sim end time: 1:04:10
 - Runtime: 0:49:42
@prismofeverything
Copy link
Member

Yeah, it failed to allocate memory which means the python process hit some kind of memory limit. Was this on sherlock? Do you know what the memory limit for processes is there?

Is this an intermittent issue, or does it happen the same way each time?

@tahorst
Copy link
Member Author

tahorst commented Feb 28, 2020

Was this on sherlock?

Yes, the Jenkins builds are on Sherlock.

Do you know what the memory limit for processes is there?

It's currently 48 GB shared between all running jobs.

Is this an intermittent issue, or does it happen the same way each time?

This is the first failure I've seen so I was hoping you could explore to confirm it's reproducible and identify the problem.

@prismofeverything
Copy link
Member

Sure, I'll run it outside of sherlock and see if it fails in the same way. Though based on the error this is a memory failure and arrow just happened to be the one trying to allocate memory when it hit the limit. If the 48GB are shared then I expect this error to be transient, unless it is hitting a separate per-process limit in which case it should fail at the same place each time until the memory is raised.

@1fish2
Copy link
Contributor

1fish2 commented Feb 29, 2020

It's terrific that the native code handled the allocation error gracefully and got the right error information out! That makes it easy to look into getting more memory for the process or optimizing the code to reduce memory usage.

Two details to further help: Cause a MemoryError rather than a TypeError and at the base of the stack catch any MemoryError and print additional memory stats like the total amount of process memory in use.

@prismofeverything
Copy link
Member

Status on this: it is a similar failure to CovertLab/arrow#39, the repeated multiplication of large numbers is overflowing the 64-bit floating point register. This happens when there are large numbers of simultaneous elements in the stoichiometry for a reaction that also has large counts. A few possible improvements have come up in conversations with @tahorst:

  • Detect this condition and report back which reaction has overflowed along with the error. This would allow us to address the particular reaction.
  • Decompose the offending reaction into subreactions that are each tractable to compute.
  • Project the propensity calculations into log space to make them fit into 64-bit floating point registers (this requires research and could introduce performance penalties?).
  • Implement the automatic decomposition of stoichiometry inside Arrow so it is transparent to clients.

This is happening on generation 2 of seed 6691 if anyone wants to replicate. As code changes, this condition may no longer trigger. We haven't seen this with any other seed/generation combinations so far, but they are far from exhaustively tested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants