Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hdfs-cleaner for non-pnda #34

Open
Raboo opened this issue May 25, 2018 · 2 comments
Open

hdfs-cleaner for non-pnda #34

Raboo opened this issue May 25, 2018 · 2 comments

Comments

@Raboo
Copy link

Raboo commented May 25, 2018

Hi,

Is it possible to get hdfs-cleaner working/packaged with non-pnda HDFS?
We're running HDP and currently cleaning old files using a NFS mount (via HDFS NFSGateway). And we use find to delete old files. It's slow and buggy.
I really haven't found any good solution to delete old files in HDFS. I would like to give your cleaner a try.

@jeclarke
Copy link
Collaborator

Hi @Raboo I believe this will work fine with generic HDFS.

Looking at the code, the part where it tries to clean up PNDA datasets should be skipped over if there is no dataset table available in HBase to read, and that's the only PNDA specific bit.

If you aren't using CDH or HDP as the hadoop distro you will have to do a bit of work as it wants to be able to connect to either Cloudera Manager (CDH) or Ambari (HDP) to discover the endpoints to use. If you you don't have either of those then you would have to fill out some other implementation of endpoint discovery in endpoint.py such as supplying the values directly in the config file.

There are few different categories of file in HDFS it cleans up:
spark_streaming_dirs_to_clean - checks that the files do not correspond to currently running yarn jobs before deleting them
general_dirs_to_clean - deletes all files from here
old_dirs_to_clean - deletes file from here if the last modified time is are older than a certain age

Let me know how you get on, and do submit a patch if you manage to extend it in a useful way.

Thanks,
James

@Raboo
Copy link
Author

Raboo commented May 25, 2018

We are using HDP.
spark_streaming_dirs_to_clean, general_dirs_to_clean, old_dirs_to_clean aren't that all just the same thing? folders where you look for older files that you can delete?

Or is this spark_streaming_dirs referring to spark history server?

Do you need to fill all fields in the properties file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants