-
Notifications
You must be signed in to change notification settings - Fork 454
Add support for distributed file system (HDFS) #195
base: main
Are you sure you want to change the base?
Add support for distributed file system (HDFS) #195
Conversation
Co-authored-by: sgatamex <[email protected]>
@adamlerer @lw , kindly review |
Hi @sgatamex, thank you very much for contributing this feature and taking the time to upstream it! The PR as a whole is a rather large change, which will thus take a while to review, and unfortunately both @adamlerer and I are now working on other projects and are maintaining PBG at a "best effort" level, so I cannot guarantee a prompt review. If you were able to split the PR up into smaller incremental chunks this could make our job easier and could thus speed up the review and merge. For example, my understanding at a high-level is that you are reorganizing the existing file-based handler, and then adding extra codepaths so that it can deal with "remote" HDFS data as well. These two steps could be split: you could first reorganize the existing code, without adding features or changing behavior, and then build on it. There are certainly more ways this can be split up even further. I do in fact have one high-level comment to start: the reason PBG has plugin registries for checkpoint and edge storage managers is to allow users to plug in custom versions of these classes, such as this HDFS one. These new plugins could be kept in separate repositories but, as HDFS seems to be quite common, I think it makes sense to upstream this one. However, I was puzzled that in order to do so we're introducing a new registry (the I'm guessing you chose to do that because the file-based and the HDFS managers share a lot of code. If that's the case, it seems to me to be a clear indication that such code should be extracted into some helper functions, or perhaps a common base class? |
Hi @paramjitsingh006 @sgatamex , sorry about the delay on this one. Are you still looking to merge this? I think it would be great if PBG could be run on spark/HDFS. |
Yes we are very much interested in merging this but due to the limited availability from last 2 months, we were not able to commit anything... now we would like to resume on this PR, kindly guide us. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I took a look through the diff:
- IIUC, the integration of HDFS is complicated by the fact that we use hdf5 file API, and HDFS doesn't really support hdf5 files (at least not writing them). So to work around this, you're first copying everything to local disk and then operating on it there, correct? That's pretty unfortunate, but I guess the alternative is rewriting all the
Storage
abstractions from scratch without hdf5 :( If you're serious about PBG, that may be the way to go because it would probably be more elegant and efficient than this hack. But I'm open to considering this way. - As @lw points out, we have our own set of
Storage
abstractions that would ideally be used more heavily rather than making up a new set. I do see that theStorage
abstractions are a bit ill-suited for you because most of the code is the same with just a different file abstraction. Here's a proposal: instead of making theseAbstractPath
andLocalContextManager
objects, add a methodopen_h5py_file(self)
that returns just theh5py.File
object in the normal case but a special context manager in the HDFS case. - A README in the HDFS example folder that describes how to set up and run the example scripts would be nice.
logger = logging.getLogger("torchbiggraph") | ||
|
||
|
||
class Constants: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hadoop-specific stuff should not be in this top-level file
READ_MODE = 'r' | ||
|
||
|
||
class LocalFileContextManager(AbstractContextManager): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be called `PosixFileContextManager
reload = True | ||
if Constants.RELOAD in self.kwargs: | ||
reload = self.kwargs[Constants.RELOAD] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reload = True | |
if Constants.RELOAD in self.kwargs: | |
reload = self.kwargs[Constants.RELOAD] | |
reload = self.kwargs.get(Constants.RELOAD, True) |
1 ) I agree and supportive about rewriting all storage abstractions for HDF5 and have it for HDFS and we originally considered this option but due to the limited resources at our disposal, we came up with an intelligent workaround without disrupting Base Storage Layer in PBG. One thing which we have made sure that if the partition on which current training is going on .. after checkpointing we don't keep it on local disk and delete it for avoiding disk full failures provided that user is not using single partition only in which case complete partition need to be copied and loaded which is the worst case for this approach.
|
Co-authored-by: sgatamex [email protected]
Types of changes
Motivation and Context / Related issue
Current version doesn't support hdfs while running on multi-machine/clustered mode.
This enhancement provides a way to do the same and also demonstrates how spark can be utilized to initialize workers on multiple nodes while utilizing the built-in power of spark.
Existing Issue #142
How Has This Been Tested (if it applies)
Spark jobs were launched using the code shared in the examples (torchbiggraph/examples/distributedCluster/start_spark.sh).
The training got launched on multiple nodes with logs showing the file transfer between hdfs and local as the job progressed.
Final generated embeddings were than compared with the embedding generated with the original code without these changes and both matched.
Checklist