Analyses issue #22 and TLD [WIP] #98

Soumya0803 · 2019-04-01T21:03:34Z

No description provided.

Soumya0803 · 2019-04-01T21:10:05Z

@birdsarah I have submited an initial analysis on #22 and small analyses on the TLDS. I will add more to this. For the TLD folder tld_analysis is the main notebook in which the. others are linked.
Please review the work done so far.

birdsarah · 2019-04-06T01:16:37Z

Your issue_22 notebook has a merge conflict and I had to manually edit it to get it to run.

birdsarah

Notes as I go:

Issue 22 notebook:

Good intro. You could go further and explain more about how you're going to show the "relation between symbol and value".
Avoid outputting long amounts of output into the notebook - hundreds or thousands of rows - it makes the notebook unecessarily large and is a lot to scroll past. Of course you will want to look at these values when exploring, but clean up before
"The value corresponding to window.document.cookie symbol for HubSpot contains the name of cookies set in a visitor's browser by Hubspot along with some alphanumeric value asaignted to it." maybe - the 'operation' column will tell you whether it's a get or set. Why focus on just cookies, what about localStorage?
"Look at the value columns above and you can find these names in it." it's better to provide a filter and some kind of extraction - to help show the points your trying to make. Just scanning through values isn't rigorous evidence.
"This cookie is used to determine and save whether the chat widget is open for future visits" and follow-on claims. Very interesting! But how did you know all this? It's not evident from the data.
There's lots of good exploration here, but I'm not sure where you're going. I see your work in progress statement at the end, but it doesn't indicate to me how you're going to accomplish things.

TLD analysis:

This is a really interesting idea.
Instead of transcribing you could use counter.most_common(10) (or however many you want)
To help dask you can do dff.script_netloc.apply(get_end_of_net_loc, meta='O') O is the object type which is what is available in pandas for strings.
You are counting by number of calls, what does that tell you? What are potential biases with these numbers? Would a metric like number of scripts change things?
You write .net(network) which to me implies that you're saying net is equivalent to network. Is that what you mean?
Nice summary of Acar and canvas fingerprinting
To answer your question "Is the script contributing to fingerprinting everytime it is called or there are specific instances?" I would say the answer is yes because you've used fairly precise heuristics to generate those lists and are more likely to have missed some candidates than got too many false positives.
This is a really interesting take, I'd like to see more comparison, pulling the data together and looking perhaps at relative frequency. How do the distribution compare to the distribution of tlds for script_urls overall and the distribution of tlds for locations.

Overall. Really great work. Seems like the #22 analysis didn't pan out. That's okay. If there are specific answers in there lets clean up the notebook to demonstrate that. All the copying and pasting of numbers around is very prone to errors. I would suggest saving to, for example, json your results from running the fingerprinting analysis and tld extraction. Those json files will be small so it's okay to commit them to the repo. While committing data is generally frowned upon, small derived datasets are okay. Then you can read them in directly to the TLD_analysis notebook. I'd like to see you more clearly lay out a question at the start of the TLD analysis and then with the work you've already got and perhaps a little more then clearly answer it. Your very close. This is a very nice idea and is well put together.

Soumya0803 · 2019-04-07T02:42:33Z

Thanks @birdsarah.
I will work on all the things you mentioned.
I'll get in the practice of keeping my notebook clean, by not adding large amount of data.
About Local storage , as I mentioned it is A WIP, local storage values are what I planeed to understand next and find out some meaning.

" This cookie is used to determine and save whether the chat widget is open for future visits" and
follow-on claims. Very interesting! But how did you know all this? It's not evident from the data. ''

I found this on their website where the cookies being used were mentioned. i'll try to look more and find its evidence in the data

Soumya0803 · 2019-04-07T03:15:03Z

Instead of transcribing you could use counter.most_common(10)

To help dask you can do dff.script_netloc.apply(get_end_of_net_loc, meta='O') O is the object type which is what is available in pandas for strings.

Thanks for mentioning these, I will do these changes.

To answer your question "Is the script contributing to fingerprinting everytime it is called or there are specific instances?" I would say the answer is yes because you've used fairly precise heuristics to generate those lists and are more likely to have missed some candidates than got too many false positives.

Thank you for answering this I'll update it in the notebook.

Overall. Really great work.
Thanks a lot.

I will work more towards issue22 as the value columns has a lot more information and i''ll have to dig deeper. I'll add what more I'm planing to work on at the end of the notebook to indicate how i'm going to accomplish things.

birdsarah · 2019-04-07T04:57:14Z

I'll add what more I'm planing to work on at the end of the notebook to indicate how i'm going to accomplish things.

I look forward to that. I'm eager to see your response / thoughts on this question:

You are counting by number of calls, what does that tell you? What are potential biases with these numbers? Would a metric like number of scripts change things?

aliamcami · 2019-10-24T20:38:49Z

Hi @Soumya0803, is this ready for review?

aliamcami · 2019-10-29T21:45:18Z

Closing this PR due to lack of activity, please feel free to reopen.

Soumya0803 · 2019-10-30T03:47:53Z

Hi @aliamcami, I had worked on some of the points mentioned in the review. I look forward to continue working on this PR.

birdsarah · 2019-10-30T16:23:55Z

Thanks @Soumya0803, I'm sorry for the stagnation. We'll take a look at this and your other PR.

Soumya0803 added 5 commits April 2, 2019 00:30

issue22_analysis

1ffb9f3

auio_tld

f80ea32

canvas_tld

769ca3d

font_tld

2a1d239

webRTC_tld

25ade4e

tld_analysis

9771f53

birdsarah suggested changes Apr 6, 2019

View reviewed changes

Soumya0803 changed the title ~~Analyses issue #22 and TLD~~ Analyses issue #22 and TLD [WIP] Apr 10, 2019

Soumya0803 added 2 commits April 16, 2019 23:32

location_analysis

117f962

location_analysis_addition

0821b35

Soumya0803 force-pushed the analyses_outreachy branch from 573e61c to 0821b35 Compare April 17, 2019 12:47

location_analysis_TLD_fingerprinting

33f8fd9

aliamcami closed this Oct 29, 2019

birdsarah reopened this Oct 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analyses issue #22 and TLD [WIP] #98

Analyses issue #22 and TLD [WIP] #98

Soumya0803 commented Apr 1, 2019

Soumya0803 commented Apr 1, 2019 •

edited

Loading

birdsarah commented Apr 6, 2019

birdsarah left a comment

Soumya0803 commented Apr 7, 2019 •

edited

Loading

Soumya0803 commented Apr 7, 2019 •

edited

Loading

birdsarah commented Apr 7, 2019

aliamcami commented Oct 24, 2019

aliamcami commented Oct 29, 2019

Soumya0803 commented Oct 30, 2019

birdsarah commented Oct 30, 2019

Analyses issue #22 and TLD [WIP] #98

Are you sure you want to change the base?

Analyses issue #22 and TLD [WIP] #98

Conversation

Soumya0803 commented Apr 1, 2019

Soumya0803 commented Apr 1, 2019 • edited Loading

birdsarah commented Apr 6, 2019

birdsarah left a comment

Choose a reason for hiding this comment

Soumya0803 commented Apr 7, 2019 • edited Loading

Soumya0803 commented Apr 7, 2019 • edited Loading

birdsarah commented Apr 7, 2019

aliamcami commented Oct 24, 2019

aliamcami commented Oct 29, 2019

Soumya0803 commented Oct 30, 2019

birdsarah commented Oct 30, 2019

Soumya0803 commented Apr 1, 2019 •

edited

Loading

Soumya0803 commented Apr 7, 2019 •

edited

Loading

Soumya0803 commented Apr 7, 2019 •

edited

Loading