Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug segmentation fault #344

Open
denadai2 opened this issue Aug 19, 2024 · 4 comments
Open

Debug segmentation fault #344

denadai2 opened this issue Aug 19, 2024 · 4 comments
Labels

Comments

@denadai2
Copy link

🐛 Describe the bug

Dear pyg-lib team,

I encountered an error when I call.

out = torch.ops.pyg.merge_sampler_outputs(
            sampled_nodes_with_dupl,
            edge_ids,
            cumm_sampled_nbrs_per_node,
            partition_ids,
            partition_orders,
            partitions_num,
            one_hop_num,
            src_batch,
            self.disjoint,
        )

the error is:

(TrainerActor pid=15637) *** SIGSEGV received at time=1724074372 ***
(TrainerActor pid=15637) PC: @        0x110b277b0  (unknown)  pyg::sampler::(anonymous namespace)::merge_outputs<>()
(TrainerActor pid=15637)     @        0x1080ecde0  (unknown)  absl::lts_20230125::WriteFailureInfo()
(TrainerActor pid=15637)     @        0x1080ecb2c  (unknown)  absl::lts_20230125::AbslFailureSignalHandler()
(TrainerActor pid=15637)     @        0x190087584  (unknown)  _sigtramp
(TrainerActor pid=15637)     @        0x110b27778  (unknown)  pyg::sampler::(anonymous namespace)::merge_outputs<>()
(TrainerActor pid=15637)     @        0x110b29a80  (unknown)  c10::impl::call_functor_with_args_from_stack_<>()
(TrainerActor pid=15637)     @        0x110b29958  (unknown)  c10::impl::make_boxed_from_unboxed_functor<>::call()
(TrainerActor pid=15637)     @        0x383c8b4fc  (unknown)  torch::autograd::basicAutogradNotImplementedFallbackImpl()
(TrainerActor pid=15637)     @        0x38002e664  (unknown)  c10::Dispatcher::callBoxed()
(TrainerActor pid=15637)     @        0x10bb8af14  (unknown)  torch::jit::invokeOperatorFromPython()
(TrainerActor pid=15637)     @        0x10bb8b668  (unknown)  torch::jit::_get_operation_for_overload_or_packet()
(TrainerActor pid=15637)     @        0x10bad43a8  (unknown)  pybind11::detail::argument_loader<>::call<>()
(TrainerActor pid=15637)     @        0x10bad41f0  (unknown)  pybind11::cpp_function::initialize<>()::{lambda()#1}::__invoke()
(TrainerActor pid=15637)     @        0x10b4a9b7c  (unknown)  pybind11::cpp_function::dispatcher()
(TrainerActor pid=15637)     @        0x104f82d88  (unknown)  cfunction_call
(TrainerActor pid=15637)     @        0x104f32060  (unknown)  _PyObject_Call
(TrainerActor pid=15637)     @        0x105026f20  (unknown)  _PyEval_EvalFrameDefault
(TrainerActor pid=15637)     @        0x104f3116c  (unknown)  _PyObject_FastCallDictTstate
(TrainerActor pid=15637)     @        0x104f326a0  (unknown)  _PyObject_Call_Prepend
(TrainerActor pid=15637)     @        0x104fa63c0  (unknown)  slot_tp_call
(TrainerActor pid=15637)     @        0x104f31348  (unknown)  _PyObject_MakeTpCall
(TrainerActor pid=15637)     @        0x105025580  (unknown)  _PyEval_EvalFrameDefault
(TrainerActor pid=15637)     @        0x104f4af60  (unknown)  gen_send_ex2
(TrainerActor pid=15637)     @        0x104cbcfac  (unknown)  task_step_impl
(TrainerActor pid=15637)     @        0x104cbcd84  (unknown)  task_step
(TrainerActor pid=15637)     @        0x104f31348  (unknown)  _PyObject_MakeTpCall
(TrainerActor pid=15637)     @        0x105044c04  (unknown)  context_run
(TrainerActor pid=15637)     @        0x104f824b8  (unknown)  cfunction_vectorcall_FASTCALL_KEYWORDS
(TrainerActor pid=15637)     @        0x105026f20  (unknown)  _PyEval_EvalFrameDefault
(TrainerActor pid=15637)     @        0x104f34410  (unknown)  method_vectorcall
(TrainerActor pid=15637)     @        0x105026f20  (unknown)  _PyEval_EvalFrameDefault
(TrainerActor pid=15637)     @        0x104f34410  (unknown)  method_vectorcall
(TrainerActor pid=15637)     @        0x1050f6558  (unknown)  thread_run
(TrainerActor pid=15637)     @ ... and at least 3 more frames
(TrainerActor pid=15637) [2024-08-19 15:32:52,527 E 15637 51205440] logging.cc:440: *** SIGSEGV received at time=1724074372 ***
(TrainerActor pid=15637) [2024-08-19 15:32:52,527 E 15637 51205440] logging.cc:440: PC: @        0x110b277b0  (unknown)  pyg::sampler::(anonymous namespace)::merge_outputs<>()
(TrainerActor pid=15637) [2024-08-19 15:32:52,527 E 15637 51205440] logging.cc:440:     @        0x1080ecde0  (unknown)  absl::lts_20230125::WriteFailureInfo()
(TrainerActor pid=15637) [2024-08-19 15:32:52,527 E 15637 51205440] logging.cc:440:     @        0x1080ecb44  (unknown)  absl::lts_20230125::AbslFailureSignalHandler()
(TrainerActor pid=15637) [2024-08-19 15:32:52,527 E 15637 51205440] logging.cc:440:     @        0x190087584  (unknown)  _sigtramp
(TrainerActor pid=15637) [2024-08-19 15:32:52,527 E 15637 51205440] logging.cc:440:     @        0x110b27778  (unknown)  pyg::sampler::(anonymous namespace)::merge_outputs<>()
(TrainerActor pid=15637) [2024-08-19 15:32:52,527 E 15637 51205440] logging.cc:440:     @        0x110b29a80  (unknown)  c10::impl::call_functor_with_args_from_stack_<>()
(TrainerActor pid=15637) [2024-08-19 15:32:52,527 E 15637 51205440] logging.cc:440:     @        0x110b29958  (unknown)  c10::impl::make_boxed_from_unboxed_functor<>::call()
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x383c8b4fc  (unknown)  torch::autograd::basicAutogradNotImplementedFallbackImpl()
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x38002e664  (unknown)  c10::Dispatcher::callBoxed()
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x10bb8af14  (unknown)  torch::jit::invokeOperatorFromPython()
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x10bb8b668  (unknown)  torch::jit::_get_operation_for_overload_or_packet()
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x10bad43a8  (unknown)  pybind11::detail::argument_loader<>::call<>()
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x10bad41f0  (unknown)  pybind11::cpp_function::initialize<>()::{lambda()#1}::__invoke()
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x10b4a9b7c  (unknown)  pybind11::cpp_function::dispatcher()
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x104f82d88  (unknown)  cfunction_call
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x104f32060  (unknown)  _PyObject_Call
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x105026f20  (unknown)  _PyEval_EvalFrameDefault
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x104f3116c  (unknown)  _PyObject_FastCallDictTstate
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x104f326a0  (unknown)  _PyObject_Call_Prepend
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x104fa63c0  (unknown)  slot_tp_call
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x104f31348  (unknown)  _PyObject_MakeTpCall
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x105025580  (unknown)  _PyEval_EvalFrameDefault
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x104f4af60  (unknown)  gen_send_ex2
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x104cbcfac  (unknown)  task_step_impl
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x104cbcd84  (unknown)  task_step
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x104f31348  (unknown)  _PyObject_MakeTpCall
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x105044c04  (unknown)  context_run
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x104f824b8  (unknown)  cfunction_vectorcall_FASTCALL_KEYWORDS
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x105026f20  (unknown)  _PyEval_EvalFrameDefault
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x104f34410  (unknown)  method_vectorcall
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x105026f20  (unknown)  _PyEval_EvalFrameDefault
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x104f34410  (unknown)  method_vectorcall
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x1050f6558  (unknown)  thread_run
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @ ... and at least 3 more frames
(TrainerActor pid=15637) Fatal Python error: Segmentation fault
(TrainerActor pid=15637)
(TrainerActor pid=15637) Stack (most recent call first):
(TrainerActor pid=15637)   File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/torch/_ops.py", line 854 in __call__
(TrainerActor pid=15637)   File "/Users/mdenadai/research/mine/pytorch_geometric-master/torch_geometric/distributed/dist_neighbor_sampler.py", line 842 in _merge_sampler_outputs
(TrainerActor pid=15637)   File "/Users/mdenadai/research/mine/pytorch_geometric-master/torch_geometric/distributed/dist_neighbor_sampler.py", line 948 in sample_one_hop
(TrainerActor pid=15637)   File "/Users/mdenadai/research/mine/pytorch_geometric-master/torch_geometric/distributed/dist_neighbor_sampler.py", line 315 in node_sample
(TrainerActor pid=15637)   File "/Users/mdenadai/research/mine/pytorch_geometric-master/torch_geometric/distributed/dist_neighbor_sampler.py", line 618 in edge_sample
(TrainerActor pid=15637)   File "/Users/mdenadai/research/mine/pytorch_geometric-master/torch_geometric/distributed/dist_neighbor_sampler.py", line 193 in _sample_from
(TrainerActor pid=15637)   File "/opt/homebrew/Cellar/[email protected]/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/events.py", line 88 in _run
(TrainerActor pid=15637)   File "/opt/homebrew/Cellar/[email protected]/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/base_events.py", line 1987 in _run_once
(TrainerActor pid=15637)   File "/opt/homebrew/Cellar/[email protected]/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/base_events.py", line 641 in run_forever
(TrainerActor pid=15637)   File "/Users/mdenadai/research/mine/pytorch_geometric-master/torch_geometric/distributed/event_loop.py", line 108 in _run_loop
(TrainerActor pid=15637)   File "/opt/homebrew/Cellar/[email protected]/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1010 in run
(TrainerActor pid=15637)   File "/opt/homebrew/Cellar/[email protected]/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1073 in _bootstrap_inner
(TrainerActor pid=15637)   File "/opt/homebrew/Cellar/[email protected]/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1030 in _bootstrap
(TrainerActor pid=15637)
(TrainerActor pid=15637) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_osx, psutil._psutil_posix, setproctitle, yaml._yaml, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, ray._raylet, numpy._core._multiarray_umath, numpy._core._multiarray_tests, numpy.linalg._umath_linalg, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, scipy._lib._ccallback_c, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, scipy.cluster._vq, scipy.cluster._hierarchy, scipy.cluster._optimal_leaf_ordering, markupsafe._speedups, pyarrow.lib, pyarrow._json (total: 74)

do you have a suggestion on how to debug this?

thx

Environment

  • pyg-lib version:
  • PyTorch version:
  • OS:
  • Python version:
  • CUDA/cuDNN version:
  • How you installed PyTorch and pyg-lib (conda, pip, source):
  • Any other relevant information:
@denadai2 denadai2 added the bug label Aug 19, 2024
@denadai2
Copy link
Author

@denadai2
Copy link
Author

denadai2 commented Sep 11, 2024

I converted it in a non-optimized plain python version and partially fixed it (not unit tested):

def merge_outputs(
        node_ids: List[torch.Tensor],
        edge_ids: List[torch.Tensor],
        cumsum_neighbors_per_node: List[List[int]],
        partition_ids: List[int],
        partition_orders: List[int],
        num_partitions: int,
        num_neighbors: int,
        batch: Optional[torch.Tensor] = None,
        disjoint: bool = False
) -> Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor], List[int]]:

    if num_neighbors < 0:
        # Find maximum population
        population = [[] for _ in range(num_partitions)]
        max_populations = [0] * num_partitions

        for p_id in range(num_partitions):
            cumsum1 = cumsum_neighbors_per_node[p_id][1:]
            cumsum2 = cumsum_neighbors_per_node[p_id][:-1]
            population[p_id] = [abs(a - b) for a, b in zip(cumsum1, cumsum2)]
            max_populations[p_id] = max(population[p_id])

        offset = max(max_populations)
    else:
        offset = num_neighbors

    p_size = len(partition_ids)
    sampled_neighbors_per_node = [0] * p_size

    sampled_node_ids = torch.full((p_size * offset,), -1, dtype=node_ids[0].dtype)
    sampled_edge_ids = torch.full((p_size * offset,), -1, dtype=edge_ids[0].dtype)
    sampled_batch = torch.full((p_size * offset,), -1, dtype=batch.dtype) if disjoint else None

    sampled_node_ids_vec = [n.tolist() for n in node_ids]
    sampled_edge_ids_vec = [e.tolist() for e in edge_ids]

    #print("cumsum_neighbors_per_node", cumsum_neighbors_per_node, "partition_ids", partition_ids)
    for j in range(p_size):
        p_id = partition_ids[j]
        p_order = partition_orders[j]
        if not cumsum_neighbors_per_node[p_id] or len(cumsum_neighbors_per_node[p_id]) <= p_order+1:
            continue
        #print("cumsum_neighbors_per_node", len(cumsum_neighbors_per_node[p_id]) <= p_order+1, p_order, len(cumsum_neighbors_per_node[p_id]))


        begin_node = cumsum_neighbors_per_node[p_id][p_order]
        begin_edge = begin_node - cumsum_neighbors_per_node[p_id][0]

        end_node = cumsum_neighbors_per_node[p_id][p_order + 1]
        end_edge = end_node - cumsum_neighbors_per_node[p_id][0]

        sampled_node_ids[j * offset:(j * offset + end_node - begin_node)] = torch.tensor(sampled_node_ids_vec[p_id][begin_node:end_node])
        sampled_edge_ids[j * offset:(j * offset + end_edge - begin_edge)] = torch.tensor(sampled_edge_ids_vec[p_id][begin_edge:end_edge])

        if disjoint:
            sampled_batch[j * offset:(j * offset + end_node - begin_node)] = batch[j]

        sampled_neighbors_per_node[j] = end_node - begin_node

    # Remove auxiliary -1 numbers:
    valid_node_indices = sampled_node_ids != -1
    out_node_id = sampled_node_ids[valid_node_indices]

    valid_edge_indices = sampled_edge_ids != -1
    out_edge_id = sampled_edge_ids[valid_edge_indices]

    out_batch = sampled_batch[valid_node_indices] if disjoint else None

    return out_node_id, out_edge_id, out_batch, sampled_neighbors_per_node

@rusty1s
Copy link
Member

rusty1s commented Sep 12, 2024

Sorry for late reply. This looks indeed wrong. Can you share your inputs that make it crash? Also @kgajdamo for visibility.

@denadai2
Copy link
Author

denadai2 commented Sep 13, 2024

No worries! I used movielens with 4 partitions and the code that was released. No modifications

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants