Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Momey leak #349

Open
mzouink opened this issue Nov 23, 2024 · 12 comments
Open

Momey leak #349

mzouink opened this issue Nov 23, 2024 · 12 comments

Comments

@mzouink
Copy link
Member

mzouink commented Nov 23, 2024

After the new version
there is a memory leak
I think it is coming from funlib.persistence
Because funlib.show.neuroglancer is becoming really slow and buggy

Screenshot 2024-11-23 at 1 42 27 PM

@pattonw

@pattonw
Copy link
Contributor

pattonw commented Nov 23, 2024

Interesting. You're just visualizing an array with funlib show neuroglancer and it's using 500gb?

@mzouink
Copy link
Member Author

mzouink commented Nov 24, 2024

no no, that graph is from dacapo train job. but i mentioned neuroglancer as side example

@pattonw
Copy link
Contributor

pattonw commented Nov 25, 2024

Can you provide some (ideally simplified) config combination that leads to a similar memory profile?

@mzouink
Copy link
Member Author

mzouink commented Nov 25, 2024

I don't know how to give you this exact code because i was using +200 nrs crops, but this iss the code i was running:
(do you still have access to the cluster ?)

# %%
import csv
import json
import os
import dacapo

# %%
datasplit_path = "datasplit_v2.csv"
classes_to_be_used_path = "to_be_used_v2.json"
#%%
with open(classes_to_be_used_path, 'r') as f:
    classes = ["bg"]+list(json.load(f).keys())
# %%
from dacapo.experiments.datasplits import DataSplitGenerator
from funlib.geometry import Coordinate
from dacapo.store.create_store import create_config_store
config_store = create_config_store()
# %%
from dacapo.experiments.datasplits import DataSplitGenerator
from funlib.geometry import Coordinate

input_resolution = Coordinate(8, 8, 8)
output_resolution = Coordinate(8,8,8)
datasplit_config = DataSplitGenerator.generate_from_csv(
    datasplit_path,
    input_resolution,
    output_resolution,
    # targets=classes,
    name="base_model_20241120_20_target_classes",
    # max_validation_volume_size = 400**3,
).compute()
# %%

datasplit = datasplit_config.datasplit_type(datasplit_config)

config_store.store_datasplit_config(datasplit_config)
# %%

from dacapo.experiments.tasks import  OneHotTaskConfig,

simple_one_hot = OneHotTaskConfig(
        name="one_hot_task",
        classes=classes,
        kernel_size=1,
    )
config_store.store_task_config(simple_one_hot)
# %%
from dacapo.experiments.architectures import CNNectomeUNetConfig
architecture_config = CNNectomeUNetConfig(
    name="simple_unet",
    input_shape=(2, 132, 132),
    eval_shape_increase=(8, 32, 32),
    fmaps_in=1,
    num_fmaps=8,
    fmaps_out=8,
    fmap_inc_factor=2,
    downsample_factors=[(1, 4, 4), (1, 4, 4)],
    kernel_size_down=[[(1, 3, 3)] * 2] * 3,
    kernel_size_up=[[(1, 3, 3)] * 2] * 2,
    constant_upsample=True,
    padding="valid",
)
config_store.store_architecture_config(architecture_config)
# %%
from dacapo.experiments.trainers import GunpowderTrainerConfig

trainer_config = GunpowderTrainerConfig(
    name="default_v3",
    batch_size=2,
    learning_rate=0.0001,
    num_data_fetchers=20,
    augments=[
        ElasticAugmentConfig(
            control_point_spacing=[100, 100, 100],
            control_point_displacement_sigma=[10.0, 10.0, 10.0],
            rotation_interval=(0, math.pi / 2.0),
            subsample=8,
            uniform_3d_rotation=True,
        ),
        IntensityAugmentConfig(
                    scale=(0.25, 1.75),
                    shift=(-0.5, 0.35),
                    clip=True,
                ),
        GammaAugmentConfig(gamma_range=(0.5, 2.0)),
                IntensityScaleShiftAugmentConfig(scale=2, shift=-1),
    ],
    snapshot_interval=100000,
    clip_raw=False,
)
config_store.store_trainer_config(trainer_config)
# %%
from dacapo.experiments import RunConfig
from dacapo.experiments.run import Run

iterations = 1000000
validation_interval = 10000
run_config = RunConfig(
    name=f"simple_base_model",
    datasplit_config=datasplit_config,
    task_config=simple_one_hot,
    architecture_config=architecture_config,
    trainer_config=trainer_config,
    num_iterations=iterations,
    validation_interval=validation_interval,
)
config_store.store_run_config(run_config)
# %%
# i submitted it in different job $ dacapo train run_name
from dacapo import train
train(run_config.name)

@pattonw
Copy link
Contributor

pattonw commented Nov 25, 2024

Ah, so this might not be a memory leak, just lots of data.
Do you get the same pattern if you train with just 1 dataset?
How many iterations did you train before memory became a problem?

@mzouink
Copy link
Member Author

mzouink commented Nov 25, 2024

Usually is not a problem even if there is a lot of data, because of the lazy loading.
The error happens after ~ 1000 iterations
i will try multiple scenario to narrow down the problem and come back.

@mzouink
Copy link
Member Author

mzouink commented Nov 25, 2024

i submitted :
datasplit [1 crop , 100 crop]
Trainer [basid, with 3d augmentation]
each with 3 reperations
3 reperations of big datasplit hit the out of memory in the same time (with both trainers) r
~2600 iterations
the one crop datasplit is still running
so there is a memory leak in holding crops info per time.

@mzouink
Copy link
Member Author

mzouink commented Nov 25, 2024

@pattonw i think this is the problem:
https://github.com/funkelab/funlib.persistence/blob/3c0760e48edf1b287c4f75d7d11dc6b775332b2b/funlib/persistence/arrays/array.py#L73
after asking GPT i got :

When Does self.data Contain Binary Data?
Before Computation: Only metadata and references to the underlying storage (lazy evaluation).
After .compute(): The binary data is loaded into memory as a concrete array (e.g., NumPy).
After .persist(): The chunks of the Dask array are computed and stored in memory, allowing for quick access but requiring memory proportional to the size of the computed data.

@pattonw
Copy link
Contributor

pattonw commented Nov 25, 2024

It could be the dask array, but we never call persist, and I'm pretty sure using compute doesn't cache data in memory

@mzouink
Copy link
Member Author

mzouink commented Nov 26, 2024

now it is clear that is related to high number of crops. but i don't know how can i narrow the reason of the bug more

@pattonw
Copy link
Contributor

pattonw commented Nov 26, 2024

My best guess is the masking.
Can you try replacing this line, with a dask.array.ones(dataset.gt.data.shape, dtype=dataset.gt.data.dtype) and see if that solves it?

@mzouink
Copy link
Member Author

mzouink commented Nov 26, 2024

didn't work :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants