Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

[WIP] Issue #22 - Data Wrangling #77

Closed
wants to merge 3 commits into from

Conversation

aliamcami
Copy link
Collaborator

@aliamcami aliamcami commented Mar 22, 2019

Pull request related to issue #22

Goal

Identify what are the "huge" values. But how much is "huge"?
Since I cannot exactly answer that, my solution was to find the most common and bigger value and start analysing from there.

Approach

Given my machine and knowledge limitations I had to find a way to drastically reduce the data into something I could actually work on.

First thing I did was to have a look at biggest values on the raw data, I had quite a bit of trouble getting this for reasons of memory and processing power, but I did.

I noticed that even if the whole value was different, the beginning of it was usually something common, like the name of the service or something like this.

I decided to use this "first" word as key word to group by, and I named it "domain", and by doing so I now could count the occurrences to see witch ones are the most common and analyse that.

Filtering

Since the problem is the big values, the smaller ones are not interesting here so we can filter them out.

Even before the group by, I had every single row that has the length of 2000 or less character filtered out. That reduced quite a bit of the data I had to work with the group by step.

I was still left with too many rows after the group by, so I decided that no "domain" that did not have at least one occurency of at least 5000 characters was not worth my time for this specific task, and filtered that out as well.

Data Wrangling

At this point I was left with 220 groups, categorized by 'domain' and with new data for each:

  • standard deviation of value_len;
  • mean value_len on the group;
  • min value_len found on that group;
  • max value_len found on that group;
  • count of how many occurencies make that group;

With this new data I could sort out by count and see the domain that had the most occurrencies and since the smaller values were already sorted out, the top ones are the ones that interest me most for this specific task.

Top 10

value_domain mean std min max count
{"ScribeTransport". 4128.59 1406.46 2001 7211 93409
{"ins-today-sId". 5037.69 14446.52 2002 87748 60426
{"criteo_pt_cdb_metrics_expires". 9529.66 53326.72 2003 692032 47543
font-face{font-family. 162363.28 172503.75 2634 648067 45059
{"CLOUDFLARE. 514484.07 634151.12 4356 3253324 42660
{"__qubitUACategorisation". 64927.71 105887.48 2018 368966 40003
Na9BL8mAQgqyMAy1zxOlJg$0. 2236.68 178.84 2001 3312 37945
935971. 3726.06 396.41 3248 4695 33010
{"insdrSV". 4026.30 12823.05 2002 191041 32981
834540. 2218.71 216.20 2001 2864 32117

Cloudflare

The one that I found most interesting was the 'cloudflare' group because out of the top 10, this one is the one that has the bigger min and max, by far.

I decided to take a look at the raw value for the biggest "clourflare" apperances and I identified that they are JSON's for (probably) some kind of configuration, or just organized scraped data. For example, most have some kind of font definition, or google analytics script.

For example:

And many more, the "value" field also contains the response for each of these requests, which is probably why is so big.

Next steps

I think would be good to verify what are the other top ones. But from a very initial and superficial look I'm guessing that they are also some sort of configuration json.

@aliamcami aliamcami changed the title Issue22 Issue22 - Data Wrangling Mar 22, 2019
@aliamcami
Copy link
Collaborator Author

My (personal) Limitations

I had some (lots of) limitations and difficulties on dealing with this large data set.

  • First of, I'm a complete begginer on data analysis and I had absolute not idea how to deal with it (I had to study a lot and I'm pretty sure I'm doing some weird and stupid things, but how else can I learn right?)
  • Second, the computer I have right now could not handle that big data, even the samples were too big.
  • Third (the solution), I brought a virtual machine from Vultr but for money limitations I could not get one "that" good, just good enough to do some basic stuff. The machine spec is 2 vcpu, 4GB ran, 80 SSD.

@birdsarah
Copy link
Contributor

Review done in chat.

@aliamcami aliamcami changed the title Issue22 - Data Wrangling [WIP] Issue #22 - Data Wrangling Mar 22, 2019
@aliamcami aliamcami closed this Mar 22, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants