Add support for distributed file system (HDFS) #195

paramjitsingh006 · 2021-03-14T01:08:10Z

Co-authored-by: sgatamex [email protected]

Types of changes

Docs change / refactoring / dependency upgrade
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Motivation and Context / Related issue

Current version doesn't support hdfs while running on multi-machine/clustered mode.
This enhancement provides a way to do the same and also demonstrates how spark can be utilized to initialize workers on multiple nodes while utilizing the built-in power of spark.

Existing Issue #142

How Has This Been Tested (if it applies)

Spark jobs were launched using the code shared in the examples (torchbiggraph/examples/distributedCluster/start_spark.sh).
The training got launched on multiple nodes with logs showing the file transfer between hdfs and local as the job progressed.
Final generated embeddings were than compared with the embedding generated with the original code without these changes and both matched.

Checklist

The documentation is up-to-date with the changes I made.
I have read the CONTRIBUTING document and completed the CLA (see CONTRIBUTING).
All tests passed, and additional code has been covered with new tests.

Co-authored-by: sgatamex <[email protected]>

sgatamex · 2021-03-22T06:58:46Z

@adamlerer @lw , kindly review

lw · 2021-03-22T09:07:29Z

Hi @sgatamex, thank you very much for contributing this feature and taking the time to upstream it! The PR as a whole is a rather large change, which will thus take a while to review, and unfortunately both @adamlerer and I are now working on other projects and are maintaining PBG at a "best effort" level, so I cannot guarantee a prompt review.

If you were able to split the PR up into smaller incremental chunks this could make our job easier and could thus speed up the review and merge. For example, my understanding at a high-level is that you are reorganizing the existing file-based handler, and then adding extra codepaths so that it can deal with "remote" HDFS data as well. These two steps could be split: you could first reorganize the existing code, without adding features or changing behavior, and then build on it. There are certainly more ways this can be split up even further.

I do in fact have one high-level comment to start: the reason PBG has plugin registries for checkpoint and edge storage managers is to allow users to plug in custom versions of these classes, such as this HDFS one. These new plugins could be kept in separate repositories but, as HDFS seems to be quite common, I think it makes sense to upstream this one. However, I was puzzled that in order to do so we're introducing a new registry (the CUSTOM_PATH one), which is used internally by the file-based manager. This is because in a sense we're "merging" the file-based and HDFS managers in the existing registry just to split them later on somewhere else. I'd prefer if we could keep the two managers separate in the registry, and avoid introducing the new CUSTOM_PATH registry.

I'm guessing you chose to do that because the file-based and the HDFS managers share a lot of code. If that's the case, it seems to me to be a clear indication that such code should be extracted into some helper functions, or perhaps a common base class?

adamlerer · 2021-05-12T14:49:12Z

Hi @paramjitsingh006 @sgatamex , sorry about the delay on this one. Are you still looking to merge this? I think it would be great if PBG could be run on spark/HDFS.

sgatamex · 2021-05-26T20:50:23Z

Hi @paramjitsingh006 @sgatamex , sorry about the delay on this one. Are you still looking to merge this? I think it would be great if PBG could be run on spark/HDFS.

Yes we are very much interested in merging this but due to the limited availability from last 2 months, we were not able to commit anything... now we would like to resume on this PR, kindly guide us.

adamlerer

Okay, I took a look through the diff:

IIUC, the integration of HDFS is complicated by the fact that we use hdf5 file API, and HDFS doesn't really support hdf5 files (at least not writing them). So to work around this, you're first copying everything to local disk and then operating on it there, correct? That's pretty unfortunate, but I guess the alternative is rewriting all the Storage abstractions from scratch without hdf5 :( If you're serious about PBG, that may be the way to go because it would probably be more elegant and efficient than this hack. But I'm open to considering this way.
As @lw points out, we have our own set of Storage abstractions that would ideally be used more heavily rather than making up a new set. I do see that the Storage abstractions are a bit ill-suited for you because most of the code is the same with just a different file abstraction. Here's a proposal: instead of making these AbstractPath and LocalContextManager objects, add a method open_h5py_file(self) that returns just the h5py.File object in the normal case but a special context manager in the HDFS case.
A README in the HDFS example folder that describes how to set up and run the example scripts would be nice.

adamlerer · 2021-05-26T21:06:07Z