Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Analysis on issue #35 #88

Closed
wants to merge 10 commits into from
Closed

Conversation

Aimaanhasan
Copy link

No description provided.

ah02887 and others added 7 commits March 19, 2019 01:23
Analysis on efficiency and usage of extension arrays in dask

Issue mozilla#36
Added link to the fletcher docs and gave an example for usability. 
Relocated the analyses for readability

Issue mozilla#36
Analysis on JavaScript API symbols storing unique IDs

Issue mozilla#35
@Aimaanhasan
Copy link
Author

Analysis can be found in analyses\2019_03_Aimaanhasan__issue#35_unique_IDs.ipynb

Issue #35

@birdsarah birdsarah self-assigned this Mar 29, 2019
Copy link
Contributor

@birdsarah birdsarah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start!


You should run your work on a larger chunk of the dataset to get a result that can make a claim about the dataset.


I appreciate the write-up at the start. I was confused by the headings slightly, so think about how you could tweak the headers / add more sections to help guide understanding of what relates to what.


The unique values are found out by checking keywords such as " = , id, ID, iD, Id, =, ;, : " in values_1000 column, because = are separated by " =, ;, :".

... are the API symbols which are estimated to store unique values.

This is an interesting approach and I think there's value in it. You have convinced me that this is a way to find values, but nothing you've shown demonstrates that they are unique.

Every approach has pros and cons, and I completely understand that you do not have time to try all the different approaches and compare. But some discussion of the range of possibilities you considered and the pros and cons of this approach would be appropriate.


window.document.cookie is estimated to store most values

This is the conclusion you have reached for every approach. But https://github.com/mozilla/overscripted/blob/master/data_prep/symbol_counts.csv shows that window.document.cookie is by far the most common symbol in the dataset. Have you just discovered the most common rows? Think about what you're trying to measure.


Add a title to the plot itself. It helps to keep it up to date and to be clearer about what is referring to what.


I learned something reading your analysis. value_1000.str.contains('ID', case=False, regex=True). I haven't used case=False. Thanks for that!


Don't render massive chunks of data in the notebook. They're a lot to scroll past and don't contribute to understanding as a reader. I totally understand that you would be looking at all that output while analyzing, but remember to clean it up and just output one or two rows when submitting.


So this is off to a great start. I think the work on arguments and cookie values looks promising too. Think about how to make the case about unique values. Or, be explicit that you've decided to look for things that store a lot of values. Are you interested in scripts storing lots of values per call or making lots of calls. Think about the bigger picture here: we're looking for scripts that are storing unique values, how would you collect / summarize / compare that question. Have you looked at the operation column where you will see get / set. Is there a difference between get and set? Are you double counting values because they're being set then get? These questions are all jumping off points for you to round out your analysis. Of course there will be many more ideas that you have than you have time for, document that - write up the questions / ideas / plans that you have having worked on the data.

@birdsarah birdsarah changed the title Analysis on issue #35 Analysis on issue #35 [WIP] Mar 30, 2019
@Aimaanhasan
Copy link
Author

Thank you so much @birdsarah. I will definitely work more on that according to your feedback. I hope I will provide you with a valuable analysis

@Aimaanhasan
Copy link
Author

@birdsarah I have worked to show the involvement of values in script_url. Plus I have made changes in write-up according to your feedback. Moreover, I have also worked with dropping the duplicates corresponding to the value column to analyse data meaningfully. I have also added an analysis based on the relative count for each symbol which are expected to store unique IDs

@Aimaanhasan Aimaanhasan changed the title Analysis on issue #35 [WIP] Analysis on issue #35 Apr 2, 2019
Copy link
Contributor

@birdsarah birdsarah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes as I go:

  • You added to the intro sections - they're very nice - good job
  • Nice digging up of the "Detecting and Defending" paper. I read it a long time ago but have not had a chance to re-read it. Can you point me to a specific section which discusses the methodological ideas which you drew from?
  • You're still only using a tiny tiny fraction of the dataset - you need to run your analysis with the full 10% sample dataset, probably using dask, to be able to draw a meaningful conclusion. If that doesn't make sense, let me know. I will say that to make this happen you're going to need to re-write some things so they run efficiently.

Analysis 1:

  • The plot "plot showing symbols found which have returned values containing keyword 'id' relative to the total count of each subsequent symbol" if I'm understanding your code is, for a given symbol, the count of rows that had id in value_1000 if they were unique values / the count of rows that had unique values.Is that right? Regardless, you're still counting rows / calls. It seems to me more relevant to do this on a per script basis - although I'm happy to be convinced otherwise.
  • Finding keyword of "id" in values shows that "RTCPeerConnection.localDescription" is relatively expected to store unique values Are you sure it's storing (set) not retrieving (get)? I mentioned this in my previous review - you should look at the operation column to understand this.
  • For 4.2 I don't think your analysis backs up your conclusion that that HTMLCanvasElement.style, RTCPeerConnection.localDescription, window.document.cookie, window.localStorage, window.navigator.userAgent and window.sessionStorage tends to store most of the unique values. Firstly, you haven't isolated stored values. Secondly, you have used 'drop_duplicates' but drop_duplicates keep one of every duplicated value. So you have at least one of all the non unique values.
  • For 4.3 my previous comments apply.

Analysis 2:

  • Why did you change approach from contains? You didn't explain this and it seemed like it was working
    to me.
  • You will have less of your own code to write and probably more robust results if you use the columns that have already been derived in the dataset for you argument_x (x = 0-8).
  • I'm REALLY interested by what you've started in the very last piece where you're looking at items from values and whether they appear in script_url.

Code tips:

The way you're making df2 is inefficient and won't scale well. An improvement is this:

cleaned_entries_escaped = [re.escape(x) for x in cleaned_entries]
cleaned_entries_regex = '|'.join(cleaned_entries_escaped)
df3 = df[df.arguments.str.contains(cleaned_entries_regex, regex=True)]

Although this will only scale as well as you're able to store all the cleaned entries in memory which may become infeasible.

  • cleaned_entries = [j.split('=')[1] for e in name_value if len(e) > 0 for j in e if (len(j.split('='))>1 and (len(j.split('=')[1]) > 2))] - that's a mouthful! Think about how to re-write your code to make it easy for others to see what you did.

Overall: This is really coming along great. I'd like to think through what the concrete set of to dos are to get this to a place where it's a mergeable analysis. I think there's a really great and concrete contribution in your examination of values that were stored / retrieved in cookies and also appear in URLs. I think I'm not properly understanding Analysis 1. I understand what you're looking for and how you've scoped that. But I don't understand what you're presenting in terms of results and what we're learning from that. I'm interested in @mlopatka's take on getting to a mergeable analysis.

@aliamcami
Copy link
Collaborator

Closing this PR due to lack of activity, please feel free to reopen.

@aliamcami aliamcami closed this Oct 24, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants