Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rmr2 mapreduce does not produce any output #179

Open
byu777 opened this issue Aug 7, 2016 · 6 comments
Open

rmr2 mapreduce does not produce any output #179

byu777 opened this issue Aug 7, 2016 · 6 comments

Comments

@byu777
Copy link

byu777 commented Aug 7, 2016

I tried the following simple script on rmr2 in Cloudera Quickstart 5.7.0 but mapreduce does not generate any results. Here is the script:

small.ints <- to.dfs(1:10)
out <- mapreduce(input = small.ints, map = function(k, v) keyval(v, v^2))
from.dfs(out)

Here is the output:

> small.ints <- to.dfs(1:10)
16/08/07 20:14:42 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
16/08/07 20:14:42 INFO compress.CodecPool: Got brand-new compressor [.deflate]
Warning message:
S3 methods ‘gorder.default’, ‘gorder.factor’, ‘gorder.data.frame’, ‘gorder.matrix’, ‘gorder.raw’ were declared in NAMESPACE but not found 
> out <- mapreduce(input = small.ints, map = function(k, v) keyval(v, v^2))
16/08/07 20:14:48 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
packageJobJar: [] [/usr/jars/hadoop-streaming-2.6.0-cdh5.7.0.jar] /tmp/streamjob543400947433267521.jar tmpDir=null
16/08/07 20:14:49 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/08/07 20:14:49 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/08/07 20:14:50 INFO mapred.FileInputFormat: Total input paths to process : 1
16/08/07 20:14:50 INFO mapreduce.JobSubmitter: number of splits:2
16/08/07 20:14:50 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1470447912721_0016
16/08/07 20:14:50 INFO impl.YarnClientImpl: Submitted application application_1470447912721_0016
16/08/07 20:14:50 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1470447912721_0016/
16/08/07 20:14:50 INFO mapreduce.Job: Running job: job_1470447912721_0016
16/08/07 20:15:00 INFO mapreduce.Job: Job job_1470447912721_0016 running in uber mode : false
16/08/07 20:15:00 INFO mapreduce.Job:  map 0% reduce 0%
16/08/07 20:15:12 INFO mapreduce.Job:  map 50% reduce 0%
16/08/07 20:15:13 INFO mapreduce.Job:  map 100% reduce 0%
16/08/07 20:15:13 INFO mapreduce.Job: Job job_1470447912721_0016 completed successfully
16/08/07 20:15:13 INFO mapreduce.Job: Counters: 30
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=236342
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=1001
        HDFS: Number of bytes written=244
        HDFS: Number of read operations=14
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=4
    Job Counters 
        Launched map tasks=2
        Data-local map tasks=2
        Total time spent by all maps in occupied slots (ms)=19917
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=19917
        Total vcore-seconds taken by all map tasks=19917
        Total megabyte-seconds taken by all map tasks=9958500
    Map-Reduce Framework
        Map input records=3
        Map output records=0
        Input split bytes=208
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=115
        CPU time spent (ms)=1300
        Physical memory (bytes) snapshot=239222784
        Virtual memory (bytes) snapshot=2127200256
        Total committed heap usage (bytes)=121503744
    File Input Format Counters 
        Bytes Read=793
    File Output Format Counters 
        Bytes Written=244
16/08/07 20:15:13 INFO streaming.StreamJob: Output directory: /tmp/file10106a0b36b6
> from.dfs(out)
$key
NULL

$val
NULL

to.dfs and from.dfs do work since I tried the following:

> small.ints <- to.dfs(1:10)
16/08/07 07:15:34 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
16/08/07 07:15:34 INFO compress.CodecPool: Got brand-new compressor [.deflate]
> out <- from.dfs(small.ints)
16/08/07 07:15:44 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
16/08/07 07:15:44 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
> out
$key
NULL

$val
 [1]  1  2  3  4  5  6  7  8  9 10
@byu777
Copy link
Author

byu777 commented Aug 8, 2016

I figured this out now. I installed rmr2 from within RStudio and somehow the library was not available to the script even though the mapreduce function seems to run successfully. I was surprised that in one of the logs, I read that rmr2 was not found, but the script still gave me a _SUCCESS!

I eventually installed rmr2 fresh in R (using sudo R), with the required packages, reshape2 and caTools, and everything seems to work fine now.

@thecodeflash
Copy link

Hey @byu777 , I am facing the same problem. Can you please help me out with this? I tried installing rmr2 using R CMD but the output is still the same.

@Surender1984
Copy link

Hi byu777/VJ-Vikvy, , any luck to solve this issue. I a, also facing the same issue.

@thecodeflash
Copy link

Hey @Surender1984 , Please install the RMR2 package using:

sudo R CMD INSTALL rmr2.tar.gz

But before doing that install all the required packages, as root user this will solve your issue. Please let me know if you face any issues.

Also, I think RMR2 and RHDFS is dead and we need to switch to Spark. What are your opinion on this?

@sharwinbobde
Copy link

@byu777 Problem still not solved for me :(

Code :

#########################################################################
#########################################################################
Sys.setenv(HADOOP_HOME="/home/sharwin/Programs/hadoop-2.7.5") 
Sys.setenv(HADOOP_CMD="/home/sharwin/Programs/hadoop-2.7.5/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/home/sharwin/Programs/hadoop-2.7.5/share/hadoop/tools/lib/hadoop-streaming-2.7.5.jar")
Sys.setenv(JAVA_HOME="/usr/lib/jvm/java-9-oracle")

library(rJava)
library(rhdfs)
library(rmr2)
library(reshape2)
library(caTools)

hdfs.init() 
# Clear previous output
hdfs.rmr('/test/out')

#============================================================

map <- function(k,lines) {
     words.list <- strsplit(lines, '\\s')
     words <- unlist(words.list)
     return( keyval(words, 1) )
}

reduce <- function(word, counts) {
     keyval(word, sum(counts))
}

wordcount <- function (input, output=NULL) {
     mapreduce(input=input, output=output, input.format="text", map=map, reduce=reduce)
}

## read text files from folder example/wordcount/data
hdfs.root <- '/test'
hdfs.data <- file.path(hdfs.root, 'data')

## save result in folder example/wordcount/out
hdfs.out <- file.path(hdfs.root, 'out')

## Submit job
out <- wordcount(hdfs.data, hdfs.out) 

## Fetch results from HDFS
results <- from.dfs(out)
results.df <- as.data.frame(results, stringsAsFactors=F)
colnames(results.df) <- c('word', 'count')

head(results.df)

@thecodeflash
Copy link

@sharwinbobde Please install the RMR2 package using:

sudo R CMD INSTALL rmr2.tar.gz

But before doing that install all the required packages, as root user this will solve your issue. Please let me know if you face any issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants