-
Notifications
You must be signed in to change notification settings - Fork 53
Analyses issue #22 and TLD [WIP] #98
base: master
Are you sure you want to change the base?
Conversation
@birdsarah I have submited an initial analysis on #22 and small analyses on the TLDS. I will add more to this. For the TLD folder tld_analysis is the main notebook in which the. others are linked. |
Your issue_22 notebook has a merge conflict and I had to manually edit it to get it to run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notes as I go:
Issue 22 notebook:
- Good intro. You could go further and explain more about how you're going to show the "relation between symbol and value".
- Avoid outputting long amounts of output into the notebook - hundreds or thousands of rows - it makes the notebook unecessarily large and is a lot to scroll past. Of course you will want to look at these values when exploring, but clean up before
- "The value corresponding to window.document.cookie symbol for HubSpot contains the name of cookies set in a visitor's browser by Hubspot along with some alphanumeric value asaignted to it." maybe - the 'operation' column will tell you whether it's a get or set. Why focus on just cookies, what about localStorage?
- "Look at the value columns above and you can find these names in it." it's better to provide a filter and some kind of extraction - to help show the points your trying to make. Just scanning through values isn't rigorous evidence.
- "This cookie is used to determine and save whether the chat widget is open for future visits" and follow-on claims. Very interesting! But how did you know all this? It's not evident from the data.
- There's lots of good exploration here, but I'm not sure where you're going. I see your work in progress statement at the end, but it doesn't indicate to me how you're going to accomplish things.
TLD analysis:
- This is a really interesting idea.
- Instead of transcribing you could use
counter.most_common(10)
(or however many you want) - To help dask you can do
dff.script_netloc.apply(get_end_of_net_loc, meta='O')
O is the object type which is what is available in pandas for strings. - You are counting by number of calls, what does that tell you? What are potential biases with these numbers? Would a metric like number of scripts change things?
- You write
.net(network)
which to me implies that you're saying net is equivalent to network. Is that what you mean? - Nice summary of Acar and canvas fingerprinting
- To answer your question "Is the script contributing to fingerprinting everytime it is called or there are specific instances?" I would say the answer is yes because you've used fairly precise heuristics to generate those lists and are more likely to have missed some candidates than got too many false positives.
- This is a really interesting take, I'd like to see more comparison, pulling the data together and looking perhaps at relative frequency. How do the distribution compare to the distribution of tlds for script_urls overall and the distribution of tlds for locations.
Overall. Really great work. Seems like the #22 analysis didn't pan out. That's okay. If there are specific answers in there lets clean up the notebook to demonstrate that. All the copying and pasting of numbers around is very prone to errors. I would suggest saving to, for example, json your results from running the fingerprinting analysis and tld extraction. Those json files will be small so it's okay to commit them to the repo. While committing data is generally frowned upon, small derived datasets are okay. Then you can read them in directly to the TLD_analysis notebook. I'd like to see you more clearly lay out a question at the start of the TLD analysis and then with the work you've already got and perhaps a little more then clearly answer it. Your very close. This is a very nice idea and is well put together.
Thanks @birdsarah.
I found this on their website where the cookies being used were mentioned. i'll try to look more and find its evidence in the data |
Thanks for mentioning these, I will do these changes.
Thank you for answering this I'll update it in the notebook.
I will work more towards issue22 as the value columns has a lot more information and i''ll have to dig deeper. I'll add what more I'm planing to work on at the end of the notebook to indicate how i'm going to accomplish things. |
I look forward to that. I'm eager to see your response / thoughts on this question:
|
573e61c
to
0821b35
Compare
Hi @Soumya0803, is this ready for review? |
Closing this PR due to lack of activity, please feel free to reopen. |
Hi @aliamcami, I had worked on some of the points mentioned in the review. I look forward to continue working on this PR. |
Thanks @Soumya0803, I'm sorry for the stagnation. We'll take a look at this and your other PR. |
No description provided.