Eluvio_DS_Challenge

Problem Statement

The dataset is tabular and the features involved should be self-explanatory. We would like for you to come up with a specific problem yourself and solve it properly. This is an “open challenge,” mainly focusing on natural language processing. The problem could be either about predictive modeling or providing analytical insights for some business use cases. Note the problem should be treated as large-scale, as the dataset is large (e.g., >100GB) and will not fit into the RAM of your machine. Python is strongly recommended in terms of the coding language.

Overview

Now assuming that the dataset is large that is >1000 GB. Pandas may not be a better choice considering RAM limitations we have in our systems. To overcome this problem we use Dask library here which works quiet similar to pandas.

To give an example how dask works, consider the case if we have 100GB data. Now if we do any row operation, then what Dask Dataframe do, it will breake the data into say 100 chunks. It will then bring in 1 chunk into the RAM, perform the computation, and send it back to the disk. It will repeat this with the other 99 chunks. If you have 4 cores in your machine, and your RAM can handle data equal to the size of 4 chunks, all of them will work in parallel and the operation will be completed in 1/4th of the time. The best part: you need not worry about the number of cores involved or the capacity of your RAM. Dask will figure out everything in the background and not give you any burden.

Prerequisites & Importing libraries

dask[complete]
nltk
seaborn

Install all the dependencies with pip command inside colab notebook.

!pip3 install (above prerequisite)

Import necessary libraries of Dask and NLTK with these commands

import nltk
from wordcloud import WordCloud


nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
import dask.dataframe as dd
from dask.distributed import Client

client = Client(n_workers=4)
import pandas as pd

Data Preprocessing

The following code load the data into dask dataframe,where using the blocksize(Number of bytes by which to cut up larger files) will define how many memory should our RAM use.

from dask import dataframe as dd
df = dd.read_csv(
    '/home/aditya/euvio challenge/Eluvio_DS_Challenge.csv', 
    delimiter=',',
    blocksize=64000000 # = 64 Mb chunks
)

Further we get statistics of the whole data using this code:

df.describe(include="all").compute()

Using this we can see, that "category", "down_votes" feature are redundant, as they have same values for all data point, and so we can drop them. Also using NLTK, basic math operation we can modify our dataframe.

Inference

Top words for which upvotes>500

we can analyse the words which are most common in "title"

compute effect of the time on total up votes

This answer the question about, at which point of time one should create the headlines

Effect of over 18 posts on up votes

By comparing graph we can whether over 18 posts has any effect on getting more votes.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
README.md		README.md
download (1).png		download (1).png
download (2).png		download (2).png
download.png		download.png
eluvio.ipynb		eluvio.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Eluvio_DS_Challenge

Problem Statement

Overview

Prerequisites & Importing libraries

Data Preprocessing

Inference

Top words for which upvotes>500

compute effect of the time on total up votes

Effect of over 18 posts on up votes

About

Releases

Packages

Languages

adijindal30/Big-Data-Analysis-Dask-df

Folders and files

Latest commit

History

Repository files navigation

Eluvio_DS_Challenge

Problem Statement

Overview

Prerequisites & Importing libraries

Data Preprocessing

Inference

Top words for which upvotes>500

compute effect of the time on total up votes

Effect of over 18 posts on up votes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages