LMDBs fully deprecated after #753? #868

cmclausen · 2024-09-24T07:59:26Z

Hi there,
I'm having trouble with my own LMDB datasets and BalancedBatchSampler after a recent update.
It now requires the dataset to pass .metadata_hasattr("natoms") unless the UnsupportedDatasetError is thrown. I have previously made graph datasets for on-the-fly inference and have amended those to accommodate the change:

class GraphsListDataset(Dataset):
    """
    Make a list of graphs to feed into OCP dataloader
    """

    def __init__(self, graphs_list):
        self.graphs_list = graphs_list
        self._metadata = DatasetMetadata([g.natoms for g in graphs_list])

    def __len__(self):
        return len(self.graphs_list)

    def __getitem__(self, idx):
        graph = self.graphs_list[idx]
        return graph

    def metadata_hasattr(self, attr) -> bool:
        if self._metadata is None:
            return False
        return hasattr(self._metadata, attr)

    def get_metadata(self, attr, idx):
        if self._metadata is not None:
            metadata_attr = getattr(self._metadata, attr)
            if isinstance(idx, list):
                return [metadata_attr[_idx] for _idx in idx]
            return metadata_attr[idx]
        return None

However, the error also occurs when initializing training with LMDB datasets from main.py and a config file. Are LMDB datasets fully deprecated now or if not what is the new protocol for making these and passing them to the trainer?

Best regards
Christian

Traceback (most recent call last):
  File "/lustre/hpc/kemi/clausen/fairchem/src/fairchem/../../main.py", line 8, in <module>
    main()
  File "/lustre/hpc/kemi/clausen/fairchem/src/fairchem/core/_cli.py", line 127, in main
    runner_wrapper(config)
  File "/lustre/hpc/kemi/clausen/fairchem/src/fairchem/core/_cli.py", line 57, in runner_wrapper
    Runner()(config)
  File "/lustre/hpc/kemi/clausen/fairchem/src/fairchem/core/_cli.py", line 38, in __call__
    with new_trainer_context(config=config) as ctx:
  File "/groups/kemi/clausen/miniconda3/envs/fairchem/lib/python3.11/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/lustre/hpc/kemi/clausen/fairchem/src/fairchem/core/common/utils.py", line 1087, in new_trainer_context
    trainer = trainer_cls(**trainer_config)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/hpc/kemi/clausen/fairchem/src/fairchem/core/trainers/ocp_trainer.py", line 99, in __init__
    super().__init__(
  File "/lustre/hpc/kemi/clausen/fairchem/src/fairchem/core/trainers/base_trainer.py", line 208, in __init__
    self.load()
  File "/lustre/hpc/kemi/clausen/fairchem/src/fairchem/core/trainers/base_trainer.py", line 230, in load
    self.load_datasets()
  File "/lustre/hpc/kemi/clausen/fairchem/src/fairchem/core/trainers/base_trainer.py", line 333, in load_datasets
    self.train_sampler = self.get_sampler(
                         ^^^^^^^^^^^^^^^^^
  File "/lustre/hpc/kemi/clausen/fairchem/src/fairchem/core/trainers/base_trainer.py", line 286, in get_sampler
    return BalancedBatchSampler(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/hpc/kemi/clausen/fairchem/src/fairchem/core/common/data_parallel.py", line 168, in __init__
    raise error
  File "/lustre/hpc/kemi/clausen/fairchem/src/fairchem/core/common/data_parallel.py", line 165, in __init__
    dataset = _ensure_supported(dataset)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/hpc/kemi/clausen/fairchem/src/fairchem/core/common/data_parallel.py", line 110, in _ensure_supported
    raise UnsupportedDatasetError(
fairchem.core.datasets.base_dataset.UnsupportedDatasetError: BalancedBatchSampler requires a dataset that has a metadata attributed with number of atoms.```

The text was updated successfully, but these errors were encountered:

misko · 2024-09-24T16:55:01Z

hi @cmclausen! On this now! Its possible there is something wired slightly wrong in our code for default behavior to work :'(

I am working on resolving this and adding a test case.

In the meantime if you are blocked, I think it might be possible for you to add,

load_balancing_on_error: warn_and_no_balance

under the optim section to turn this error into just a warning

misko · 2024-09-24T19:05:15Z

Our code and tests look correct, I think what you are trying should be working. LMDB datasets are not deprecated, they are still used. There were changes made to batch balancing and code cleanup/reconsolidating which I think is causing your issues. I'm sorry :'(

In order to use batch balancing (to balance systems across multiple simultaneous GPUs) you need to have a valid dataset._metadata value that contains the 'natoms' field. Or you can just fully disable this by using the following,

optim:
  load_balancing: False

The implementation you have about looks like it should not trigger the error you are getting, because you clearly define ._metadata ,

fairchem/src/fairchem/core/common/data_parallel.py

Line 112 in 83fd9d2

if not dataset.metadata_hasattr("natoms"):

Can you try adding some debug statements inside of BalancedBatchSampler._ensure_support and getting the value of dataset._metadata , most importantly finding out why it does not seem to have a 'natoms' field?

Hope this helps, please let us know how it goes!

cmclausen · 2024-09-27T13:36:13Z

Hi @misko,
Thanks for your attention on this.

I used load_balancing and load_balancing_on_error but for some scenarios BalancedBatchSampler._ensure_support throws an AttributeError if _metadata is completely missing and that cannot be bypassed as it is currently.

Is it correct that the _metadata is supposed to originate from .npz-file as per core.datasets.base_dataset? Earlier I have just converted my structures using AtomsToGraphs from fairchem.core.preprocessing, added the relevant attributes, and saved them to an LMBD.

xianghumeng · 2024-10-07T15:01:13Z

hi, @cmclausen , i have the same trouble when using AtomsToGraphs to lmbd. run main.py --mode predict have the attribute error. have you solved the problem? @misko ,i test the load_balancing=false is not work.
this is my config.yml
dataset:
test:
a2g_args:
r_energy: false
r_forces: false
format: lmdb
src: /home/train_test_data/graph_input/data.0000.lmdb
evaluation_metrics:
metrics:
energy:
- mae
forces:
- forcesx_mae
- forcesy_mae
- forcesz_mae
- mae
- cosine_similarity
- magnitude_error
misc:
- energy_forces_within_threshold
primary_metric: forces_mae
gpus: 1
load_balancing_on_error: warn_and_no_balance
logger: tensorboard
loss_functions:

energy:
coefficient: 1
fn: mae
forces:
coefficient: 1
fn: l2mae
model:
activation: silu
atom_edge_interaction: true
atom_interaction: true
cbf:
name: spherical_harmonics
cutoff: 12.0
cutoff_aeaint: 12.0
cutoff_aint: 12.0
cutoff_qint: 12.0
direct_forces: true
edge_atom_interaction: true
emb_size_aint_in: 64
emb_size_aint_out: 64
emb_size_atom: 256
emb_size_cbf: 16
emb_size_edge: 512
emb_size_quad_in: 32
emb_size_quad_out: 32
emb_size_rbf: 16
emb_size_sbf: 32
emb_size_trip_in: 64
emb_size_trip_out: 64
enforce_max_neighbors_strictly: false
envelope:
exponent: 5
name: polynomial
extensive: true
forces_coupled: false
max_neighbors: 30
max_neighbors_aeaint: 20
max_neighbors_aint: 1000
max_neighbors_qint: 8
name: gemnet_oc
num_after_skip: 2
num_atom: 3
num_atom_emb_layers: 2
num_before_skip: 2
num_blocks: 4
num_concat: 1
num_global_out_layers: 2
num_output_afteratom: 3
num_radial: 128
num_spherical: 7
otf_graph: true
output_init: HeOrthogonal
qint_tags:
- 1
- 2
  quad_interaction: true
  rbf:
  name: gaussian
  regress_forces: true
  sbf:
  name: legendre_outer
  noddp: false
  optim:
  batch_size: 10
  clip_grad_norm: 10
  ema_decay: 0.999
  energy_coefficient: 1
  eval_batch_size: 10
  eval_every: 5000
  factor: 0.8
  force_coefficient: 1
  load_balancing: false
  loss_energy: mae
  loss_force: atomwisel2
  lr_initial: 0.0005
  max_epochs: 80
  mode: min
  num_workers: 2
  optimizer: AdamW
  optimizer_params:
  amsgrad: true
  patience: 3
  scheduler: ReduceLROnPlateau
  weight_decay: 0
  outputs:
  energy:
  level: system
  forces:
  eval_on_free_atoms: true
  level: atom
  train_on_free_atoms: true
  task:
  prediction_dtype: float32
  test_dataset: null
  trainer: forces
  val_dataset:
  oc20_ref: /checkpoint/janlan/ocp/other_data/final_ref_energies_02_07_2021.pkl
  raw_energy_target: true
  src: /large_experiments/opencatalyst/data/oc22/2022_06_16/s2ef/val_id_30k
  train_on_oc20_total_energies: true

xianghumeng · 2024-10-12T03:20:38Z

pip install fair-core is not the lastest version. so the load_balancing and load_balancing_on_error is not work

github-actions · 2024-11-12T00:46:18Z

This issue has been marked as stale because it has been open for 30 days with no activity.

github-actions bot added the stale label Nov 12, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LMDBs fully deprecated after #753? #868

LMDBs fully deprecated after #753? #868

cmclausen commented Sep 24, 2024

misko commented Sep 24, 2024

misko commented Sep 24, 2024

cmclausen commented Sep 27, 2024

xianghumeng commented Oct 7, 2024 •

edited

Loading

xianghumeng commented Oct 12, 2024

github-actions bot commented Nov 12, 2024

LMDBs fully deprecated after #753? #868

LMDBs fully deprecated after #753? #868

Comments

cmclausen commented Sep 24, 2024

misko commented Sep 24, 2024

misko commented Sep 24, 2024

cmclausen commented Sep 27, 2024

xianghumeng commented Oct 7, 2024 • edited Loading

xianghumeng commented Oct 12, 2024

github-actions bot commented Nov 12, 2024

xianghumeng commented Oct 7, 2024 •

edited

Loading