Skip to content
This repository was archived by the owner on Mar 14, 2024. It is now read-only.

Add support for distributed file system (HDFS) #195

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

paramjitsingh006
Copy link

Co-authored-by: sgatamex [email protected]

Types of changes

  • Docs change / refactoring / dependency upgrade
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Motivation and Context / Related issue

Current version doesn't support hdfs while running on multi-machine/clustered mode.
This enhancement provides a way to do the same and also demonstrates how spark can be utilized to initialize workers on multiple nodes while utilizing the built-in power of spark.

Existing Issue #142

How Has This Been Tested (if it applies)

Spark jobs were launched using the code shared in the examples (torchbiggraph/examples/distributedCluster/start_spark.sh).
The training got launched on multiple nodes with logs showing the file transfer between hdfs and local as the job progressed.
Final generated embeddings were than compared with the embedding generated with the original code without these changes and both matched.

Checklist

  • The documentation is up-to-date with the changes I made.
  • I have read the CONTRIBUTING document and completed the CLA (see CONTRIBUTING).
  • All tests passed, and additional code has been covered with new tests.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 14, 2021
@paramjitsingh006 paramjitsingh006 marked this pull request as ready for review March 15, 2021 21:11
@sgatamex
Copy link

@adamlerer @lw , kindly review

@lw
Copy link
Contributor

lw commented Mar 22, 2021

Hi @sgatamex, thank you very much for contributing this feature and taking the time to upstream it! The PR as a whole is a rather large change, which will thus take a while to review, and unfortunately both @adamlerer and I are now working on other projects and are maintaining PBG at a "best effort" level, so I cannot guarantee a prompt review.

If you were able to split the PR up into smaller incremental chunks this could make our job easier and could thus speed up the review and merge. For example, my understanding at a high-level is that you are reorganizing the existing file-based handler, and then adding extra codepaths so that it can deal with "remote" HDFS data as well. These two steps could be split: you could first reorganize the existing code, without adding features or changing behavior, and then build on it. There are certainly more ways this can be split up even further.

I do in fact have one high-level comment to start: the reason PBG has plugin registries for checkpoint and edge storage managers is to allow users to plug in custom versions of these classes, such as this HDFS one. These new plugins could be kept in separate repositories but, as HDFS seems to be quite common, I think it makes sense to upstream this one. However, I was puzzled that in order to do so we're introducing a new registry (the CUSTOM_PATH one), which is used internally by the file-based manager. This is because in a sense we're "merging" the file-based and HDFS managers in the existing registry just to split them later on somewhere else. I'd prefer if we could keep the two managers separate in the registry, and avoid introducing the new CUSTOM_PATH registry.

I'm guessing you chose to do that because the file-based and the HDFS managers share a lot of code. If that's the case, it seems to me to be a clear indication that such code should be extracted into some helper functions, or perhaps a common base class?

@adamlerer
Copy link
Contributor

Hi @paramjitsingh006 @sgatamex , sorry about the delay on this one. Are you still looking to merge this? I think it would be great if PBG could be run on spark/HDFS.

@sgatamex
Copy link

sgatamex commented May 26, 2021

Hi @paramjitsingh006 @sgatamex , sorry about the delay on this one. Are you still looking to merge this? I think it would be great if PBG could be run on spark/HDFS.

Yes we are very much interested in merging this but due to the limited availability from last 2 months, we were not able to commit anything... now we would like to resume on this PR, kindly guide us.

Copy link
Contributor

@adamlerer adamlerer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I took a look through the diff:

  1. IIUC, the integration of HDFS is complicated by the fact that we use hdf5 file API, and HDFS doesn't really support hdf5 files (at least not writing them). So to work around this, you're first copying everything to local disk and then operating on it there, correct? That's pretty unfortunate, but I guess the alternative is rewriting all the Storage abstractions from scratch without hdf5 :( If you're serious about PBG, that may be the way to go because it would probably be more elegant and efficient than this hack. But I'm open to considering this way.
  2. As @lw points out, we have our own set of Storage abstractions that would ideally be used more heavily rather than making up a new set. I do see that the Storage abstractions are a bit ill-suited for you because most of the code is the same with just a different file abstraction. Here's a proposal: instead of making these AbstractPath and LocalContextManager objects, add a method open_h5py_file(self) that returns just the h5py.File object in the normal case but a special context manager in the HDFS case.
  3. A README in the HDFS example folder that describes how to set up and run the example scripts would be nice.

logger = logging.getLogger("torchbiggraph")


class Constants:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hadoop-specific stuff should not be in this top-level file

READ_MODE = 'r'


class LocalFileContextManager(AbstractContextManager):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be called `PosixFileContextManager

Comment on lines +68 to +70
reload = True
if Constants.RELOAD in self.kwargs:
reload = self.kwargs[Constants.RELOAD]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
reload = True
if Constants.RELOAD in self.kwargs:
reload = self.kwargs[Constants.RELOAD]
reload = self.kwargs.get(Constants.RELOAD, True)

@sgatamex
Copy link

Okay, I took a look through the diff:

  1. IIUC, the integration of HDFS is complicated by the fact that we use hdf5 file API, and HDFS doesn't really support hdf5 files (at least not writing them). So to work around this, you're first copying everything to local disk and then operating on it there, correct? That's pretty unfortunate, but I guess the alternative is rewriting all the Storage abstractions from scratch without hdf5 :( If you're serious about PBG, that may be the way to go because it would probably be more elegant and efficient than this hack. But I'm open to considering this way.
  2. As @lw points out, we have our own set of Storage abstractions that would ideally be used more heavily rather than making up a new set. I do see that the Storage abstractions are a bit ill-suited for you because most of the code is the same with just a different file abstraction. Here's a proposal: instead of making these AbstractPath and LocalContextManager objects, add a method open_h5py_file(self) that returns just the h5py.File object in the normal case but a special context manager in the HDFS case.
  3. A README in the HDFS example folder that describes how to set up and run the example scripts would be nice.

1 ) I agree and supportive about rewriting all storage abstractions for HDF5 and have it for HDFS and we originally considered this option but due to the limited resources at our disposal, we came up with an intelligent workaround without disrupting Base Storage Layer in PBG. One thing which we have made sure that if the partition on which current training is going on .. after checkpointing we don't keep it on local disk and delete it for avoiding disk full failures provided that user is not using single partition only in which case complete partition need to be copied and loaded which is the worst case for this approach.

  1. We can look into the it.

  2. We can add it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants