This repository has been archived by the owner on Dec 22, 2021. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull request related to issue #22
Goal
Identify what are the "huge" values. But how much is "huge"?
Since I cannot exactly answer that, my solution was to find the most common and bigger value and start analysing from there.
Approach
Given my machine and knowledge limitations I had to find a way to drastically reduce the data into something I could actually work on.
First thing I did was to have a look at biggest values on the raw data, I had quite a bit of trouble getting this for reasons of memory and processing power, but I did.
I noticed that even if the whole value was different, the beginning of it was usually something common, like the name of the service or something like this.
I decided to use this "first" word as key word to group by, and I named it "domain", and by doing so I now could count the occurrences to see witch ones are the most common and analyse that.
Filtering
Since the problem is the big values, the smaller ones are not interesting here so we can filter them out.
Even before the group by, I had every single row that has the length of 2000 or less character filtered out. That reduced quite a bit of the data I had to work with the group by step.
I was still left with too many rows after the group by, so I decided that no "domain" that did not have at least one occurency of at least 5000 characters was not worth my time for this specific task, and filtered that out as well.
Data Wrangling
At this point I was left with 220 groups, categorized by 'domain' and with new data for each:
With this new data I could sort out by count and see the domain that had the most occurrencies and since the smaller values were already sorted out, the top ones are the ones that interest me most for this specific task.
Top 10
Cloudflare
The one that I found most interesting was the 'cloudflare' group because out of the top 10, this one is the one that has the bigger min and max, by far.
I decided to take a look at the raw value for the biggest "clourflare" apperances and I identified that they are JSON's for (probably) some kind of configuration, or just organized scraped data. For example, most have some kind of font definition, or google analytics script.
For example:
And many more, the "value" field also contains the response for each of these requests, which is probably why is so big.
Next steps
I think would be good to verify what are the other top ones. But from a very initial and superficial look I'm guessing that they are also some sort of configuration json.