-
Notifications
You must be signed in to change notification settings - Fork 53
Comparing results of Ghostery Study and Sample [WIP] #82
Conversation
No rush @PetalsOnWind, but whenever you'd like review go ahead and remove WIP from the title. |
@birdsarah I removed the WIP. Can you please check? (There might be some repetitions from other ipynbs, I don't know if I should keep them) |
Can you clarify what you mean by:
If you mean you have used other people's code in your analysis. That's fine, but you must add a citation. It will reflect poorly on your notebook if you haven't done so. Basic things like opening a file, and filtering it are obviously the same for everyone so use your judgement. If you know you copy-pasted something, cite it. I may be misinterpreting your meaning, and if so I apologize. But I want to give you the opportunity to fix citation omissions before I review. |
Can you have a look now? @birdsarah |
Yes! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should run your work on a larger chunk of the dataset to get a result that can make a claim about the dataset.
inplace
is not recommended and will be deprecated (pandas-dev/pandas#16529)
The columns 'argument_0', 'argument_1', 'argument_2', 'argument_3', 'argument_4','argument_5', 'argument_6', 'argument_7', 'argument_8', 'arguments' are either empty or have unrecognizable values. No conclusion could be derived from the values in the current form.
Without a mission statement (what you're trying to accomplish and how) it's hard to know whether this is a valid statement
googleanalytics is the most common script run, 1317 times.
Nice. You could expand this thought further. What proportion of scripts? What proportion of locations is it on? What proportion of the dataset? How does this relate to other references claims about the prevalence of googleanalytics?
The way you are running through every column and using pandas "unique" or "nunique" function makes me think that you don't understand the original issue. Perhaps you do, but you have not made it clear how this work relates to the issue.
Everything up to the heading "Observations from Ghostery Study and Sample": it is good to see you exploring the data like this - it's very important for your understanding. But it doesn't seem like it's part of your analysis so perhaps put it in a seperate notebook, so I can get to the good stuff.
Don't render massive chunks of data in the notebook. They're a lot to scroll past and don't contribute to understanding as a reader. I totally understand that you would be looking at all that output while analyzing, but remember to clean it up and just output one or two rows when submitting.
According to the observations in "Tracking the Trackers", Google Analytics, followed by Facebook Connect, are most prevalent scripts. However, from the above observations, we can see that "gap" , followed by Officeworks, are the most common script urls in this sample.
This is not what your code has shown. In fact, earlier in your document you note that googleanalytics is the most prevalent.
A search reveals that GAP stands for "Google Analytics Painless" which supports the article.
No. gap.com is the website you get when you go to gap.com (a clothing store).
The complete study Tracking the Trackers compares usage across different regions through IP addresses.
This surprised me (in a good way). There is another study by the same people also called Tracking The Trackers. So I was initially very confused. I hadn't read this paper so I appreciate the reference. Good job for linking because it made it easy for me to figure out what was going on. I think we should all now use the long title names of their studies to reduce further confusion. I also realize that I had linked to the wrong paper in the reading list. That said, there's no problem. The paper you've read and are citing is a follow on from the paper I was thinking of. I've now updated the reading list.
Your notebook doesn't run. At cell 68 you start referring to ddf which hasn't been defined before. Clearly you've started working with dask at some point.
Your notebook doesn't show your geoip code finishing / running.
United States consists of most page loads, same as the results of Ghostery study. However, after United States, in the study the most page loads were in Russia, France and Germany whereas here are in Australia, China and Singapore. The differences could be due to the way data is collected as there the data of German users was collected.
Good to recognize the difference in data collection. Explore this idea further. Ghostery used browsing data from their user base, we used crawl data based on Alexa top 10k. (You can see some discussion of this under section 3.2 in the paper - although they're not referring to overscripted they're referring to a similar methodology)
You are getting the ip address and then country of the location (the page that was visited), which tallies with what the "Tracking the Trackers Analysing the tracking landscape with GhostRank" study does in Figure 1. So showing a comparison of that is an interesting result. However, the IP address that you see when you run this code now is likely different from what I would see if I ran the same code now and is almost certainly different from what was served to the crawl in 2017. So this difference needs to be explored and acknowledged. Moreover the point of this initial country analysis is so in Figure 4 the prevalence of trackers across different geographies can be compared. You haven't got there yet.
That said. Their tracker database is available here: https://github.com/cliqz-oss/whotracks.me - I can see a path to labeling our dataset compared to their dataset and comparing but I haven't thought through what that would or wouldn't tell us.
Also, if you can get your notebook in a clean and running format I could run it here in USA and we could see how different results we get.
The most page loads are of Facebook in United States and run multiple Google Analytics scripts.
Nice. But it comes out of nowhere and doesn't relate to the cells before.
The sample follows some observations from ghostery study but deviates in some cases, which may be due to differences in the way data is collected. The rest of the study "Tracking the Trackers" seems to be based on the data that was collected as a part of the Ghostery extension and may not be easily implemented using the current dataset. The methodology is also not mentioned in the paper which adds to the difficulty of recreating it, especially relating to safe and unsafe values.
I'm not saying you should continue with this analysis, but I think you could squeeze more out of this if you wanted. I think you could think about the topics covered in 3.2 and 3.3 and how they relate to our dataset. That said, I think it's worth thinking about the data they use to detect trackers and the data we have which is about JavaScript execution.
The call_id are unique values which shows that unique values are being created. Further work would be required to determine the process of creation of these unique_ids
The call_id
was generated by me to give every row a unique id it is generated by taking the filename and adding a counter suffix.
But since all other columns have repeated values, it suggests that unique identifier is not easily identified.
Just because some values are repeated does not mean that unique values are not present. They are different things. If you want to work on the original issue (#35), please re-read the issue and ask clarifying questions. Or state very precisely how your analysis relates the question.
Dask tip: Don't read in all the columns then drop some. Use the columns
argument to only read the columns you need. Things will be faster.
OK. I see a lot of good stuff in your notebook and good capabilities with pandas. But there are two big things:
- your notebook is not clearly organized
- the questions you're going after aren't clear.
I think you can fix both of these if you clean-up and re-organize your notebook in a way where I can see the question you wanted to answer, an explanation of the approach you were taking, executing the code, the results of that approach, the pros and cons of that approach, and remaining questions / gaps etc.
You made a bold move trying to do a geographical analysis. It's hard. It was also one of my first thoughts when I started working with the dataset. I tried to use "whois" records to find out about the country of ownership for the domains involved. It was very unsuccessful!
I hope my comments didn't come across too negative, they weren't intended as such. I've just been through and cleaned up a little of the language from the comments I made as I was scrolling through the notebook. Just to emphasize again, there's a lot of good stuff in there, and I look forward to seeing the next iteration. And congratulations, there's a LOT to get to grips with getting started with this dataset and you are well on your way. |
Closing this PR due to lack of activity, please feel free to reopen. |
Comparing the observations in Ghostery Study "Tracking the Trackers" (https://www.ghostery.com/lp/study/) and seeing how the results from the sample conform to that analysis.
Will close #35 after completion.