Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HADOOP_CMD getting lost... #173

Open
bhermont opened this issue Aug 7, 2015 · 2 comments
Open

HADOOP_CMD getting lost... #173

bhermont opened this issue Aug 7, 2015 · 2 comments

Comments

@bhermont
Copy link

bhermont commented Aug 7, 2015

Hi Antonio,

I've faced a scenario where I call a mapreduce (rmr) from a Shell script inside a Mapper (This is how Oozie launches a Shell action)

Here is the flow:
Oozie Launcher Job -> Lancher Map only task where shell script (Rscript myscript.r) executes -> StreamJob -> Mapper / Reducers

myscript.r

Sys.setenv(JAVA_HOME="/usr/jdk64/jdk1.7.0_67")
Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")
Sys.setenv(HADOOP_HOME="/usr/hdp/2.2.6.0-2800/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/hdp/2.2.6.0-2800/hadoop-mapreduce/hadoop-streaming.jar")
library("rhdfs")
library("rmr2")
hdfs.init()
library(Matrix)

Logs

Launcher Job (Job that launches Map only task)
On this map, the following shell script is executed as a system call (Rscript myscript.r) - Here HADOOP_CMD is set correctly, but an error on the mr function is logged (Probably due the Streaming Mapper error when a hdfs function is called from inside the mr function). Launcher (Mapper) log:

Loading required package: methods
Loading required package: rJava

HADOOP_CMD=/usr/bin/hadoop

Be sure to run hdfs.init()
Please review your hadoop settings. See help(hadoop.settings)
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce,  : 
  hadoop streaming failed with error code 15
Calls: getTags -> mapreduce -> mr

The rmr streaming starts a StreamJob which can have Mappers and Reducers. StreamJob (Mapper) log:

    Log Type: stderr
    Log Upload Time: Thu Aug 06 19:24:41 -0400 2015

    Log Length: 2722

    Loading objects:
    Loading objects:
      backend.parameters
      combine
    Please review your hadoop settings. See help(hadoop.settings)
      combine.file
      combine.line
      debug
      default.input.format
      default.output.format
      in.folder
      in.memory.combine
      input.format
      libs
      map
      map.file
      map.line
      out.folder
      output.format
      pkg.opts
      postamble
      preamble
      profile.nodes
      reduce
      reduce.file
      reduce.line
      rmr.global.env
      rmr.local.env
      save.env
      tempfile
      vectorized.reduce
      verbose
      work.dir
    Loading required package: methods
    Loading required package: rJava
    Loading required package: rhdfs
    Error : .onLoad failed in loadNamespace() for 'rhdfs', details:
      call: fun(libname, pkgname)
      error: Environment variable HADOOP_CMD must be set before loading package rhdfs
    Warning in FUN(X[[i]], ...) : can't load rhdfs
    Loading required package: rmr2
    Loading required package: Matrix

However, calling the myscript.r from command line works fine.

Here is my question:
Should rmr propagate the environment envs in this case, or this should be responsibility of the enviroment to provide the value for the HADOOP_CMD variable?

@bhermont
Copy link
Author

bhermont commented Aug 7, 2015

One update:
Setting HADOOP_CMD in the enviroment (not from R) solved the "Environment variable HADOOP_CMD must be set before loading package rhdfs" issue, but the hadoop streaming error 15 persists..

@piccolbo
Copy link
Collaborator

piccolbo commented Aug 7, 2015

On the env vars, rmr doesn't meddle with them, unless you reach for the low level escape hatch which is backend.options, then you can do anything and the opposite. The reason why it doesn't try to is because there is no reason to think that any of the settings in the user environment are appropriate for the cluster environment. As far as the error 15, you are the first on the whole internet to report it. It may require additional investigation. First thing, simplify your setup by detaching rhdfs and rjava before the first call to mapreduce. Second, show the logs according to the latest experiment. It's important to see everything, not just know that the error is the same. As far as starting a mapreduce job from a mapper, that's really beyond what is supported not only by rmr but also by mapreduce, now or ever. tasks are supposed to lack side effects or to have idempotent side effects, because they can be retried for any or no reason. I don't know how oozie works, but I know that much about mapreduce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants