You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Launcher Job (Job that launches Map only task)
On this map, the following shell script is executed as a system call (Rscript myscript.r) - Here HADOOP_CMD is set correctly, but an error on the mr function is logged (Probably due the Streaming Mapper error when a hdfs function is called from inside the mr function). Launcher (Mapper) log:
Loading required package: methods
Loading required package: rJava
HADOOP_CMD=/usr/bin/hadoop
Be sure to run hdfs.init()
Please review your hadoop settings. See help(hadoop.settings)
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
hadoop streaming failed with error code 15
Calls: getTags -> mapreduce -> mr
The rmr streaming starts a StreamJob which can have Mappers and Reducers. StreamJob (Mapper) log:
However, calling the myscript.r from command line works fine.
Here is my question:
Should rmr propagate the environment envs in this case, or this should be responsibility of the enviroment to provide the value for the HADOOP_CMD variable?
The text was updated successfully, but these errors were encountered:
One update:
Setting HADOOP_CMD in the enviroment (not from R) solved the "Environment variable HADOOP_CMD must be set before loading package rhdfs" issue, but the hadoop streaming error 15 persists..
On the env vars, rmr doesn't meddle with them, unless you reach for the low level escape hatch which is backend.options, then you can do anything and the opposite. The reason why it doesn't try to is because there is no reason to think that any of the settings in the user environment are appropriate for the cluster environment. As far as the error 15, you are the first on the whole internet to report it. It may require additional investigation. First thing, simplify your setup by detaching rhdfs and rjava before the first call to mapreduce. Second, show the logs according to the latest experiment. It's important to see everything, not just know that the error is the same. As far as starting a mapreduce job from a mapper, that's really beyond what is supported not only by rmr but also by mapreduce, now or ever. tasks are supposed to lack side effects or to have idempotent side effects, because they can be retried for any or no reason. I don't know how oozie works, but I know that much about mapreduce.
Hi Antonio,
I've faced a scenario where I call a mapreduce (rmr) from a Shell script inside a Mapper (This is how Oozie launches a Shell action)
Here is the flow:
Oozie Launcher Job -> Lancher Map only task where shell script (Rscript myscript.r) executes -> StreamJob -> Mapper / Reducers
myscript.r
Sys.setenv(JAVA_HOME="/usr/jdk64/jdk1.7.0_67")
Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")
Sys.setenv(HADOOP_HOME="/usr/hdp/2.2.6.0-2800/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/hdp/2.2.6.0-2800/hadoop-mapreduce/hadoop-streaming.jar")
library("rhdfs")
library("rmr2")
hdfs.init()
library(Matrix)
Logs
Launcher Job (Job that launches Map only task)
On this map, the following shell script is executed as a system call (Rscript myscript.r) - Here HADOOP_CMD is set correctly, but an error on the mr function is logged (Probably due the Streaming Mapper error when a hdfs function is called from inside the mr function). Launcher (Mapper) log:
The rmr streaming starts a StreamJob which can have Mappers and Reducers. StreamJob (Mapper) log:
However, calling the myscript.r from command line works fine.
Here is my question:
Should rmr propagate the environment envs in this case, or this should be responsibility of the enviroment to provide the value for the HADOOP_CMD variable?
The text was updated successfully, but these errors were encountered: