rmr2 mapreduce does not produce any output #179

byu777 · 2016-08-07T14:16:36Z

I tried the following simple script on rmr2 in Cloudera Quickstart 5.7.0 but mapreduce does not generate any results. Here is the script:

small.ints <- to.dfs(1:10)
out <- mapreduce(input = small.ints, map = function(k, v) keyval(v, v^2))
from.dfs(out)

Here is the output:

> small.ints <- to.dfs(1:10)
16/08/07 20:14:42 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
16/08/07 20:14:42 INFO compress.CodecPool: Got brand-new compressor [.deflate]
Warning message:
S3 methods ‘gorder.default’, ‘gorder.factor’, ‘gorder.data.frame’, ‘gorder.matrix’, ‘gorder.raw’ were declared in NAMESPACE but not found 
> out <- mapreduce(input = small.ints, map = function(k, v) keyval(v, v^2))
16/08/07 20:14:48 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
packageJobJar: [] [/usr/jars/hadoop-streaming-2.6.0-cdh5.7.0.jar] /tmp/streamjob543400947433267521.jar tmpDir=null
16/08/07 20:14:49 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/08/07 20:14:49 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/08/07 20:14:50 INFO mapred.FileInputFormat: Total input paths to process : 1
16/08/07 20:14:50 INFO mapreduce.JobSubmitter: number of splits:2
16/08/07 20:14:50 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1470447912721_0016
16/08/07 20:14:50 INFO impl.YarnClientImpl: Submitted application application_1470447912721_0016
16/08/07 20:14:50 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1470447912721_0016/
16/08/07 20:14:50 INFO mapreduce.Job: Running job: job_1470447912721_0016
16/08/07 20:15:00 INFO mapreduce.Job: Job job_1470447912721_0016 running in uber mode : false
16/08/07 20:15:00 INFO mapreduce.Job:  map 0% reduce 0%
16/08/07 20:15:12 INFO mapreduce.Job:  map 50% reduce 0%
16/08/07 20:15:13 INFO mapreduce.Job:  map 100% reduce 0%
16/08/07 20:15:13 INFO mapreduce.Job: Job job_1470447912721_0016 completed successfully
16/08/07 20:15:13 INFO mapreduce.Job: Counters: 30
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=236342
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=1001
        HDFS: Number of bytes written=244
        HDFS: Number of read operations=14
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=4
    Job Counters 
        Launched map tasks=2
        Data-local map tasks=2
        Total time spent by all maps in occupied slots (ms)=19917
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=19917
        Total vcore-seconds taken by all map tasks=19917
        Total megabyte-seconds taken by all map tasks=9958500
    Map-Reduce Framework
        Map input records=3
        Map output records=0
        Input split bytes=208
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=115
        CPU time spent (ms)=1300
        Physical memory (bytes) snapshot=239222784
        Virtual memory (bytes) snapshot=2127200256
        Total committed heap usage (bytes)=121503744
    File Input Format Counters 
        Bytes Read=793
    File Output Format Counters 
        Bytes Written=244
16/08/07 20:15:13 INFO streaming.StreamJob: Output directory: /tmp/file10106a0b36b6
> from.dfs(out)
$key
NULL

$val
NULL

to.dfs and from.dfs do work since I tried the following:

> small.ints <- to.dfs(1:10)
16/08/07 07:15:34 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
16/08/07 07:15:34 INFO compress.CodecPool: Got brand-new compressor [.deflate]
> out <- from.dfs(small.ints)
16/08/07 07:15:44 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
16/08/07 07:15:44 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
> out
$key
NULL

$val
 [1]  1  2  3  4  5  6  7  8  9 10

The text was updated successfully, but these errors were encountered:

byu777 · 2016-08-08T20:38:28Z

I figured this out now. I installed rmr2 from within RStudio and somehow the library was not available to the script even though the mapreduce function seems to run successfully. I was surprised that in one of the logs, I read that rmr2 was not found, but the script still gave me a _SUCCESS!

I eventually installed rmr2 fresh in R (using sudo R), with the required packages, reshape2 and caTools, and everything seems to work fine now.

thecodeflash · 2017-05-08T12:28:03Z

Hey @byu777 , I am facing the same problem. Can you please help me out with this? I tried installing rmr2 using R CMD but the output is still the same.

Surender1984 · 2017-08-20T06:43:48Z

Hi byu777/VJ-Vikvy, , any luck to solve this issue. I a, also facing the same issue.

thecodeflash · 2017-08-26T09:05:22Z

Hey @Surender1984 , Please install the RMR2 package using:

sudo R CMD INSTALL rmr2.tar.gz

But before doing that install all the required packages, as root user this will solve your issue. Please let me know if you face any issues.

Also, I think RMR2 and RHDFS is dead and we need to switch to Spark. What are your opinion on this?

sharwinbobde · 2018-03-27T16:34:06Z

@byu777 Problem still not solved for me :(

Code :

#########################################################################
#########################################################################
Sys.setenv(HADOOP_HOME="/home/sharwin/Programs/hadoop-2.7.5") 
Sys.setenv(HADOOP_CMD="/home/sharwin/Programs/hadoop-2.7.5/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/home/sharwin/Programs/hadoop-2.7.5/share/hadoop/tools/lib/hadoop-streaming-2.7.5.jar")
Sys.setenv(JAVA_HOME="/usr/lib/jvm/java-9-oracle")

library(rJava)
library(rhdfs)
library(rmr2)
library(reshape2)
library(caTools)

hdfs.init() 
# Clear previous output
hdfs.rmr('/test/out')

#============================================================

map <- function(k,lines) {
     words.list <- strsplit(lines, '\\s')
     words <- unlist(words.list)
     return( keyval(words, 1) )
}

reduce <- function(word, counts) {
     keyval(word, sum(counts))
}

wordcount <- function (input, output=NULL) {
     mapreduce(input=input, output=output, input.format="text", map=map, reduce=reduce)
}

## read text files from folder example/wordcount/data
hdfs.root <- '/test'
hdfs.data <- file.path(hdfs.root, 'data')

## save result in folder example/wordcount/out
hdfs.out <- file.path(hdfs.root, 'out')

## Submit job
out <- wordcount(hdfs.data, hdfs.out) 

## Fetch results from HDFS
results <- from.dfs(out)
results.df <- as.data.frame(results, stringsAsFactors=F)
colnames(results.df) <- c('word', 'count')

head(results.df)

thecodeflash · 2018-03-29T06:03:28Z

@sharwinbobde Please install the RMR2 package using:

sudo R CMD INSTALL rmr2.tar.gz

But before doing that install all the required packages, as root user this will solve your issue. Please let me know if you face any issues.

thecodeflash mentioned this issue Aug 26, 2017

rhadoop giving null key value output #184

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rmr2 mapreduce does not produce any output #179

rmr2 mapreduce does not produce any output #179

byu777 commented Aug 7, 2016 •

edited

Loading

byu777 commented Aug 8, 2016

thecodeflash commented May 8, 2017

Surender1984 commented Aug 20, 2017

thecodeflash commented Aug 26, 2017

sharwinbobde commented Mar 27, 2018

thecodeflash commented Mar 29, 2018

rmr2 mapreduce does not produce any output #179

rmr2 mapreduce does not produce any output #179

Comments

byu777 commented Aug 7, 2016 • edited Loading

byu777 commented Aug 8, 2016

thecodeflash commented May 8, 2017

Surender1984 commented Aug 20, 2017

thecodeflash commented Aug 26, 2017

sharwinbobde commented Mar 27, 2018

thecodeflash commented Mar 29, 2018

byu777 commented Aug 7, 2016 •

edited

Loading