wordcount_in_spark.py

# -*- coding: utf-8 -*-
"""Wordcount in Spark.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1aJ4z1STVSQgiwKcaHFstU6NPLJYpilGg

# Wordcount in Spark

### Setup

Let's setup Spark on the Colab environment.  Run the cell below!
"""

!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

"""Lets authenticate a Google Drive client to download the file we will be processing in on Spark job."""

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

id='1SE6k_0YukzGd5wK-E4i6mG83nydlfvSa'
downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile('pg100.txt')

"""By executed the cells above, we can able to see the file *pg100.txt* under the "Files" tab on the left panel.

### Wordcount

After running the setup stage successfully, I am ready to work on the *pg100.txt* file which contains a copy of the complete works of Shakespeare.

I am going to write a Spark application which outputs the number of words that start with each letter. This means that for every letter I want to count the total number of (non-unique) words that start with a specific letter. In my implementation, I am **ignoring the letter case**, i.e., consider all words as lower case. Also, I am ignoring all the words **starting** with a non-alphabetic character.
"""

from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext
import pandas as pd

# create the Spark Session
spark = SparkSession.builder.getOrCreate()

# create the Spark Context
sc = spark.sparkContext

!head -5 pg100.txt

pg100 = sc.textFile('pg100.txt')

counts = pg100.flatMap(lambda line: line.split(" ")) \
              .filter(lambda word: len(word) > 0)  \
              .filter(lambda word: ord(word.lower()[0]) in range(ord('a'), ord('z')+1)) \
              .map(lambda word: (word.lower()[0], 1)) \
              .reduceByKey(lambda a, b: a + b)

counts.collect()

counts.saveAsTextFile("char_count.txt")

sc.stop()