Optional Features Failure #829

tahorst · 2020-02-27T23:47:21Z

@prismofeverything mind investigating why arrow failed?

Git hash: 9e9f3bb

Command:

DESC="Causality Network" BUILD_CAUSALITY_NETWORK=1 N_GENS=2 SEED=6691 \
  PARALLEL_PARCA=1 SINGLE_DAUGHTERS=1 COMPRESS_OUTPUT=1 RAISE_ON_TIME_LIMIT=1 \
  WC_ANALYZE_FAST=1 \
  python runscripts/fireworks/fw_queue.py

Trace:

 3848.91    470.53        1.443        1.445        1.459        1.431     1.462
Traceback (most recent call last):
  File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/fireworks/core/rocket.py", line 262, in run
    m_action = t.run_task(my_spec)
  File "/scratch/groups/mcovert/jenkins/workspace/wholecell/fireworks/firetasks/simulationDaughter.py", line 78, in run_task
    sim.run()
  File "/scratch/groups/mcovert/jenkins/workspace/wholecell/sim/simulation.py", line 235, in run
    self.run_incremental(self._lengthSec + self.initialTime())
  File "/scratch/groups/mcovert/jenkins/workspace/wholecell/sim/simulation.py", line 267, in run_incremental
    self._evolveState(processes)
  File "/scratch/groups/mcovert/jenkins/workspace/wholecell/sim/simulation.py", line 326, in _evolveState
    process.calculateRequest()
  File "/scratch/groups/mcovert/jenkins/workspace/models/ecoli/processes/complexation.py", line 59, in calculateRequest
    result = self.system.evolve(self._sim.timeStepSec(), moleculeCounts)
  File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/arrow/arrow.py", line 134, in evolve
    steps, time, events, outcome = self.obsidian.evolve(duration, state)
TypeError: 'NoneType' object is not iterable
arrow.obsidian.evolve - failed to allocate memory: 12
Simulation finished:
 - Sim length: 0:24:07
 - Sim end time: 1:04:10
 - Runtime: 0:49:42

The text was updated successfully, but these errors were encountered:

prismofeverything · 2020-02-28T19:12:00Z

Yeah, it failed to allocate memory which means the python process hit some kind of memory limit. Was this on sherlock? Do you know what the memory limit for processes is there?

Is this an intermittent issue, or does it happen the same way each time?

tahorst · 2020-02-28T19:40:19Z

Was this on sherlock?

Yes, the Jenkins builds are on Sherlock.

Do you know what the memory limit for processes is there?

It's currently 48 GB shared between all running jobs.

Is this an intermittent issue, or does it happen the same way each time?

This is the first failure I've seen so I was hoping you could explore to confirm it's reproducible and identify the problem.

prismofeverything · 2020-02-28T20:26:23Z

Sure, I'll run it outside of sherlock and see if it fails in the same way. Though based on the error this is a memory failure and arrow just happened to be the one trying to allocate memory when it hit the limit. If the 48GB are shared then I expect this error to be transient, unless it is hitting a separate per-process limit in which case it should fail at the same place each time until the memory is raised.

1fish2 · 2020-02-29T00:56:48Z

It's terrific that the native code handled the allocation error gracefully and got the right error information out! That makes it easy to look into getting more memory for the process or optimizing the code to reduce memory usage.

Two details to further help: Cause a MemoryError rather than a TypeError and at the base of the stack catch any MemoryError and print additional memory stats like the total amount of process memory in use.

prismofeverything · 2020-03-02T19:38:48Z

Status on this: it is a similar failure to CovertLab/arrow#39, the repeated multiplication of large numbers is overflowing the 64-bit floating point register. This happens when there are large numbers of simultaneous elements in the stoichiometry for a reaction that also has large counts. A few possible improvements have come up in conversations with @tahorst:

Detect this condition and report back which reaction has overflowed along with the error. This would allow us to address the particular reaction.
Decompose the offending reaction into subreactions that are each tractable to compute.
Project the propensity calculations into log space to make them fit into 64-bit floating point registers (this requires research and could introduce performance penalties?).
Implement the automatic decomposition of stoichiometry inside Arrow so it is transparent to clients.

This is happening on generation 2 of seed 6691 if anyone wants to replicate. As code changes, this condition may no longer trigger. We haven't seen this with any other seed/generation combinations so far, but they are far from exhaustively tested.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optional Features Failure #829

Optional Features Failure #829

tahorst commented Feb 27, 2020 •

edited

Loading

prismofeverything commented Feb 28, 2020

tahorst commented Feb 28, 2020

prismofeverything commented Feb 28, 2020

1fish2 commented Feb 29, 2020

prismofeverything commented Mar 2, 2020

Optional Features Failure #829

Optional Features Failure #829

Comments

tahorst commented Feb 27, 2020 • edited Loading

prismofeverything commented Feb 28, 2020

tahorst commented Feb 28, 2020

prismofeverything commented Feb 28, 2020

1fish2 commented Feb 29, 2020

prismofeverything commented Mar 2, 2020

tahorst commented Feb 27, 2020 •

edited

Loading