Analysis on issue #35 #88

Aimaanhasan · 2019-03-29T16:19:22Z

No description provided.

Analysis on efficiency and usage of extension arrays in dask Issue mozilla#36

Added link to the fletcher docs and gave an example for usability. Relocated the analyses for readability Issue mozilla#36

Analysis on JavaScript API symbols storing unique IDs Issue mozilla#35

Aimaanhasan · 2019-03-29T16:21:33Z

Analysis can be found in analyses\2019_03_Aimaanhasan__issue#35_unique_IDs.ipynb

Issue #35

birdsarah

Great start!

You should run your work on a larger chunk of the dataset to get a result that can make a claim about the dataset.

I appreciate the write-up at the start. I was confused by the headings slightly, so think about how you could tweak the headers / add more sections to help guide understanding of what relates to what.

The unique values are found out by checking keywords such as " = , id, ID, iD, Id, =, ;, : " in values_1000 column, because = are separated by " =, ;, :".

... are the API symbols which are estimated to store unique values.

This is an interesting approach and I think there's value in it. You have convinced me that this is a way to find values, but nothing you've shown demonstrates that they are unique.

Every approach has pros and cons, and I completely understand that you do not have time to try all the different approaches and compare. But some discussion of the range of possibilities you considered and the pros and cons of this approach would be appropriate.

window.document.cookie is estimated to store most values

This is the conclusion you have reached for every approach. But https://github.com/mozilla/overscripted/blob/master/data_prep/symbol_counts.csv shows that window.document.cookie is by far the most common symbol in the dataset. Have you just discovered the most common rows? Think about what you're trying to measure.

Add a title to the plot itself. It helps to keep it up to date and to be clearer about what is referring to what.

I learned something reading your analysis. value_1000.str.contains('ID', case=False, regex=True). I haven't used case=False. Thanks for that!

Don't render massive chunks of data in the notebook. They're a lot to scroll past and don't contribute to understanding as a reader. I totally understand that you would be looking at all that output while analyzing, but remember to clean it up and just output one or two rows when submitting.

So this is off to a great start. I think the work on arguments and cookie values looks promising too. Think about how to make the case about unique values. Or, be explicit that you've decided to look for things that store a lot of values. Are you interested in scripts storing lots of values per call or making lots of calls. Think about the bigger picture here: we're looking for scripts that are storing unique values, how would you collect / summarize / compare that question. Have you looked at the operation column where you will see get / set. Is there a difference between get and set? Are you double counting values because they're being set then get? These questions are all jumping off points for you to round out your analysis. Of course there will be many more ideas that you have than you have time for, document that - write up the questions / ideas / plans that you have having worked on the data.

Aimaanhasan · 2019-03-31T08:57:46Z

Thank you so much @birdsarah. I will definitely work more on that according to your feedback. I hope I will provide you with a valuable analysis

Issue mozilla#36

Aimaanhasan · 2019-04-02T15:20:57Z

@birdsarah I have worked to show the involvement of values in script_url. Plus I have made changes in write-up according to your feedback. Moreover, I have also worked with dropping the duplicates corresponding to the value column to analyse data meaningfully. I have also added an analysis based on the relative count for each symbol which are expected to store unique IDs

birdsarah

Notes as I go:

You added to the intro sections - they're very nice - good job
Nice digging up of the "Detecting and Defending" paper. I read it a long time ago but have not had a chance to re-read it. Can you point me to a specific section which discusses the methodological ideas which you drew from?
You're still only using a tiny tiny fraction of the dataset - you need to run your analysis with the full 10% sample dataset, probably using dask, to be able to draw a meaningful conclusion. If that doesn't make sense, let me know. I will say that to make this happen you're going to need to re-write some things so they run efficiently.

Analysis 1:

The plot "plot showing symbols found which have returned values containing keyword 'id' relative to the total count of each subsequent symbol" if I'm understanding your code is, for a given symbol, the count of rows that had id in value_1000 if they were unique values / the count of rows that had unique values.Is that right? Regardless, you're still counting rows / calls. It seems to me more relevant to do this on a per script basis - although I'm happy to be convinced otherwise.
Finding keyword of "id" in values shows that "RTCPeerConnection.localDescription" is relatively expected to store unique values Are you sure it's storing (set) not retrieving (get)? I mentioned this in my previous review - you should look at the operation column to understand this.
For 4.2 I don't think your analysis backs up your conclusion that that HTMLCanvasElement.style, RTCPeerConnection.localDescription, window.document.cookie, window.localStorage, window.navigator.userAgent and window.sessionStorage tends to store most of the unique values. Firstly, you haven't isolated stored values. Secondly, you have used 'drop_duplicates' but drop_duplicates keep one of every duplicated value. So you have at least one of all the non unique values.
For 4.3 my previous comments apply.

Analysis 2:

Why did you change approach from contains? You didn't explain this and it seemed like it was working
to me.
You will have less of your own code to write and probably more robust results if you use the columns that have already been derived in the dataset for you argument_x (x = 0-8).
I'm REALLY interested by what you've started in the very last piece where you're looking at items from values and whether they appear in script_url.

Code tips:

The way you're making df2 is inefficient and won't scale well. An improvement is this:

cleaned_entries_escaped = [re.escape(x) for x in cleaned_entries]
cleaned_entries_regex = '|'.join(cleaned_entries_escaped)
df3 = df[df.arguments.str.contains(cleaned_entries_regex, regex=True)]

Although this will only scale as well as you're able to store all the cleaned entries in memory which may become infeasible.

cleaned_entries = [j.split('=')[1] for e in name_value if len(e) > 0 for j in e if (len(j.split('='))>1 and (len(j.split('=')[1]) > 2))] - that's a mouthful! Think about how to re-write your code to make it easy for others to see what you did.

Overall: This is really coming along great. I'd like to think through what the concrete set of to dos are to get this to a place where it's a mergeable analysis. I think there's a really great and concrete contribution in your examination of values that were stored / retrieved in cookies and also appear in URLs. I think I'm not properly understanding Analysis 1. I understand what you're looking for and how you've scoped that. But I don't understand what you're presenting in terms of results and what we're learning from that. I'm interested in @mlopatka's take on getting to a mergeable analysis.

aliamcami · 2019-10-24T20:11:04Z

Closing this PR due to lack of activity, please feel free to reopen.

ah02887 and others added 7 commits March 19, 2019 01:23

Analysis on issue mozilla#36

f81c97a

Analysis on efficiency and usage of extension arrays in dask Issue mozilla#36

Author correction from previous commit

4b5e559

Restructured Analysis

ad591ca

Added Fletcher Docs and restructured analysis

f587fde

Added link to the fletcher docs and gave an example for usability. Relocated the analyses for readability Issue mozilla#36

Delete 2019_03_Aimaanhasan__issue#36_array_extension - Copy.ipynb

0470002

Analysis on Issue mozilla#35

b1d3ac6

Analysis on JavaScript API symbols storing unique IDs Issue mozilla#35

Merge branch 'master' of https://github.com/Aimaanhasan/overscripted

1a559d5

birdsarah self-assigned this Mar 29, 2019

birdsarah suggested changes Mar 30, 2019

View reviewed changes

birdsarah changed the title ~~Analysis on issue #35~~ Analysis on issue #35 [WIP] Mar 30, 2019

Worked on script_url and arguments for issue # 36

ceb2d09

Issue mozilla#36

Aimaanhasan changed the title ~~Analysis on issue #35 [WIP]~~ Analysis on issue #35 Apr 2, 2019

Aiman Hasan and others added 2 commits April 2, 2019 20:47

Fixed syntax error in plotting

fd69db2

Removing unwanted committed file

57e8388

birdsarah suggested changes Apr 5, 2019

View reviewed changes

aliamcami closed this Oct 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analysis on issue #35 #88

Analysis on issue #35 #88

Aimaanhasan commented Mar 29, 2019

Aimaanhasan commented Mar 29, 2019

birdsarah left a comment •

edited

Loading

Aimaanhasan commented Mar 31, 2019

Aimaanhasan commented Apr 2, 2019

birdsarah left a comment

aliamcami commented Oct 24, 2019

Analysis on issue #35 #88

Analysis on issue #35 #88

Conversation

Aimaanhasan commented Mar 29, 2019

Aimaanhasan commented Mar 29, 2019

birdsarah left a comment • edited Loading

Choose a reason for hiding this comment

Aimaanhasan commented Mar 31, 2019

Aimaanhasan commented Apr 2, 2019

birdsarah left a comment

Choose a reason for hiding this comment

aliamcami commented Oct 24, 2019

birdsarah left a comment •

edited

Loading