Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Analyses issue #22 and TLD [WIP] #98

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

Soumya0803
Copy link
Contributor

No description provided.

@Soumya0803
Copy link
Contributor Author

Soumya0803 commented Apr 1, 2019

@birdsarah I have submited an initial analysis on #22 and small analyses on the TLDS. I will add more to this. For the TLD folder tld_analysis is the main notebook in which the. others are linked.
Please review the work done so far.

@birdsarah
Copy link
Contributor

Your issue_22 notebook has a merge conflict and I had to manually edit it to get it to run.

Copy link
Contributor

@birdsarah birdsarah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes as I go:

Issue 22 notebook:

  • Good intro. You could go further and explain more about how you're going to show the "relation between symbol and value".
  • Avoid outputting long amounts of output into the notebook - hundreds or thousands of rows - it makes the notebook unecessarily large and is a lot to scroll past. Of course you will want to look at these values when exploring, but clean up before
  • "The value corresponding to window.document.cookie symbol for HubSpot contains the name of cookies set in a visitor's browser by Hubspot along with some alphanumeric value asaignted to it." maybe - the 'operation' column will tell you whether it's a get or set. Why focus on just cookies, what about localStorage?
  • "Look at the value columns above and you can find these names in it." it's better to provide a filter and some kind of extraction - to help show the points your trying to make. Just scanning through values isn't rigorous evidence.
  • "This cookie is used to determine and save whether the chat widget is open for future visits" and follow-on claims. Very interesting! But how did you know all this? It's not evident from the data.
  • There's lots of good exploration here, but I'm not sure where you're going. I see your work in progress statement at the end, but it doesn't indicate to me how you're going to accomplish things.

TLD analysis:

  • This is a really interesting idea.
  • Instead of transcribing you could use counter.most_common(10) (or however many you want)
  • To help dask you can do dff.script_netloc.apply(get_end_of_net_loc, meta='O') O is the object type which is what is available in pandas for strings.
  • You are counting by number of calls, what does that tell you? What are potential biases with these numbers? Would a metric like number of scripts change things?
  • You write .net(network) which to me implies that you're saying net is equivalent to network. Is that what you mean?
  • Nice summary of Acar and canvas fingerprinting
  • To answer your question "Is the script contributing to fingerprinting everytime it is called or there are specific instances?" I would say the answer is yes because you've used fairly precise heuristics to generate those lists and are more likely to have missed some candidates than got too many false positives.
  • This is a really interesting take, I'd like to see more comparison, pulling the data together and looking perhaps at relative frequency. How do the distribution compare to the distribution of tlds for script_urls overall and the distribution of tlds for locations.

Overall. Really great work. Seems like the #22 analysis didn't pan out. That's okay. If there are specific answers in there lets clean up the notebook to demonstrate that. All the copying and pasting of numbers around is very prone to errors. I would suggest saving to, for example, json your results from running the fingerprinting analysis and tld extraction. Those json files will be small so it's okay to commit them to the repo. While committing data is generally frowned upon, small derived datasets are okay. Then you can read them in directly to the TLD_analysis notebook. I'd like to see you more clearly lay out a question at the start of the TLD analysis and then with the work you've already got and perhaps a little more then clearly answer it. Your very close. This is a very nice idea and is well put together.

@Soumya0803
Copy link
Contributor Author

Soumya0803 commented Apr 7, 2019

Thanks @birdsarah.
I will work on all the things you mentioned.
I'll get in the practice of keeping my notebook clean, by not adding large amount of data.
About Local storage , as I mentioned it is A WIP, local storage values are what I planeed to understand next and find out some meaning.

" This cookie is used to determine and save whether the chat widget is open for future visits" and
follow-on claims. Very interesting! But how did you know all this? It's not evident from the data. ''

I found this on their website where the cookies being used were mentioned. i'll try to look more and find its evidence in the data

@Soumya0803
Copy link
Contributor Author

Soumya0803 commented Apr 7, 2019

Instead of transcribing you could use counter.most_common(10)

To help dask you can do dff.script_netloc.apply(get_end_of_net_loc, meta='O') O is the object type which is what is available in pandas for strings.

Thanks for mentioning these, I will do these changes.

To answer your question "Is the script contributing to fingerprinting everytime it is called or there are specific instances?" I would say the answer is yes because you've used fairly precise heuristics to generate those lists and are more likely to have missed some candidates than got too many false positives.

Thank you for answering this I'll update it in the notebook.

Overall. Really great work.
Thanks a lot.

I will work more towards issue22 as the value columns has a lot more information and i''ll have to dig deeper. I'll add what more I'm planing to work on at the end of the notebook to indicate how i'm going to accomplish things.

@birdsarah
Copy link
Contributor

I'll add what more I'm planing to work on at the end of the notebook to indicate how i'm going to accomplish things.

I look forward to that. I'm eager to see your response / thoughts on this question:

You are counting by number of calls, what does that tell you? What are potential biases with these numbers? Would a metric like number of scripts change things?

@Soumya0803 Soumya0803 changed the title Analyses issue #22 and TLD Analyses issue #22 and TLD [WIP] Apr 10, 2019
@aliamcami
Copy link
Collaborator

Hi @Soumya0803, is this ready for review?

@aliamcami
Copy link
Collaborator

Closing this PR due to lack of activity, please feel free to reopen.

@aliamcami aliamcami closed this Oct 29, 2019
@Soumya0803
Copy link
Contributor Author

Hi @aliamcami, I had worked on some of the points mentioned in the review. I look forward to continue working on this PR.

@birdsarah birdsarah reopened this Oct 30, 2019
@birdsarah
Copy link
Contributor

Thanks @Soumya0803, I'm sorry for the stagnation. We'll take a look at this and your other PR.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants