Skip to content
This repository has been archived by the owner on Aug 9, 2024. It is now read-only.

Latest commit

 

History

History
291 lines (257 loc) · 16.5 KB

ores_bias_todo.org

File metadata and controls

291 lines (257 loc) · 16.5 KB

Make “punch-list”

follow up on ideas from Clark Bernier

Nate, since I’m standing at your desk today, I figured it’s only fair that I finally get back to you.

The first thing that comes to mind is social identity theory, which focuses on how comparisons with out-groups are central to defining and valuing group identities. Tajfel and Turner (1986) is the canonical piece; Hogg and Terry (2000) extend it into organizational contexts. Brubaker and Cooper (2000) place social identity theory in the context of other ways people talk about identity and is a helpful companion to any dive into identity theories. Though this would require specifying why intensified monitoring and enforcement leads to increased attention to social identities at all. Perhaps social identity is a tool supporting the monitoring and enforcement regimes you have in mind?

Another potential mechanism is that intensifying monitoring and enforcement leads to a greater need to coordinate decisions with others. For instance, enforcement actions with greater sanctioning may increase the importance that any given enforcer believes that their peers will approve of their enforcement decisions. A moderator may not worry so much about their peers’ perceptions when enforcement is limited to rolling back changes, but much moreso when banning someone. Correll et al. (2017) argue that the need to coordinate decisions will lead people to rely on conventional indicators of quality, as these give the best guess at what their peers are likely to believe. Whereas they focus on status symbols becoming more important, I could see an argument for in-group/out-group distinctions becoming more relevant under such conditions. That is, I have stronger enforcement options, so I worry more about what my peers think, so I am more likely to incorporate obvious signals, like group membership, into my decisions because I believe my peers will, too.

Finally, maybe group membership becomes more important because increased monitoring/enforcement gives the impression that group members have already been vetted by trusted others? Group membership under more stringent regimes might act as a specific status characteristic that signals quality to monitors. Under this set of mechanisms, it’s not the enforcers become biased against out-groups but biased towards in-group members. I’m not sure how you would distinguish this empirically. If you can get a copy, Cecilia Ridgeway’s new book Status is the best statement of the status theory underlying the argument. Otherwise, the canonical statement of expectations state theory, and the role played by diffuse and specific status characteristics is Berger et al.’s (1980) review piece.

Hope that helps! Happy to chat more if it’d be helpful.

Cheers! -Clark

Berger, Joseph, Susan J. Rosenholtz, and Morris Zelditch. 1980. “Status Organizing Processes.” Annual Review of Sociology 6: 479–508. Brubaker, Rogers, and Frederick Cooper. 2000. “Beyond ‘Identity.’” Theory and Society 29 (1): 1–47. Correll, Shelley J., Cecilia L. Ridgeway, Ezra W. Zuckerman, Sharon Jank, Sara Jordan-Bloch, and Sandra Nakagawa. 2017. “It’s the Conventional Thought That Counts: How Third-Order Inference Produces Status Advantage.” American Sociological Review, 0003122417691503. Hogg, Michael A., and Deborah J. Terry. 2000. “Social Identity and Self-Categorization Processes in Organizational Contexts.” The Academy of Management Review 25 (1): 121–40. https://doi.org/10.2307/259266. Tajfel, Henri, and John C Turner. 1986. “The Social Identity Theory of Intergroup Behavior.” In Psychology of Intergroup Relations, edited by Stephen Worchel and William G. Austin. Chicago: Nelson-Hall.

look at Morey’s paper on the fallacy of confidence intervals <- ** TODO Report the sample size by Wiki for each of our models.

look again for citations suggesting “nudging” may lead to errors

is there an alternative to “identity based” as a category of signal like registration status.

If we argue that “identity” matters because of in-group membership (i.e. being Wikipedian) then this may not be so important

improve high level takeaways for designers / builders

improve citations to types of regulation / governance systems

use social identity theory to argue that measures like anonymity and having a user page constitute “identity based signals”

reog background to introduce concepts in as intuitive and compact manner as possible

report summary statistics, dates RCfilters was introduced for each wiki.

make perma.cc links

do placebo tests

consider other kinds of hypothesis tests based on adoption check.

e.g. if only “very likely bad” is significantly adopted, then we can just test our hypotheses by comparing coefficients at that threshold.

Find another solution to the problem of low N in H2 and H3

  • is there another measure we can use for controversial sanctioning?
  • is there another measure we can use for has user page?
  • maybe raising the 20k strata limit will help.
  • also double check for bugs.

rewrite abstract with better transition and results

use social identity theory to argue for taste-based discrimination on the basis of group membership as a plausible intuition for null findings

proofread bibliography

draft limitations section

Knit remaining pieces of data in data section (date ranges, sample sizes)

Write discussion and results sections

Why do “likely bad” flags have a negative effect?

Fix rounding of x axis in plots

Increase the sample size so we have more non-reverted edits around the very likely damaging threhsold.

Re-run archaeologist until missingness rate is low, alternatively, debug the missing data issue.

Fit models with shorter bandwidth

Run analysis on more targeted measures of reversion.

WONT DO Use use the same date range for all wikis and exclude those without data for the entire range?

I don’t think this matters very much. We’re just trying to get the broadest sample possible.

Ask Halfak if we can log the scores DB automatically since this lets us stratify by score which is more convenient.

Consider measuring warnings as sanctions.

cite Grimmelmann virtues of moderation for definition of moderation

cite nora’s chi paper on anonymity

cite haiyi’s work on algorithms.

cite any other CSCW about algorithms.

figure out my subjective / normative take on this. is this a good thing or a bad thing?

Why would we show algorithmic flags and identity-based signals in the same interface?

think of a better term than “conservative” or “liberal” to describe strictness of moderation.

build more intuition that moderation actions can be in error / controversial and why this is bad.

cite all the papers about the importance of studying Wikipedia in many languages. then we can cite the reading time paper maybe.

build argument that moderation is fast paced and stressful more to help with the above, it’s as easy as citing Sarah Roberts and Seering more.

find someone to cite for salient signals in cscw

Emphasize visibility and monitoring as a useful concept for thinking about governance. Visibility and salient signals are two different mechanisms that our two hypotheses try to tease apart.

Define flagging.

Make it clear what our results demonstrate directly and indirectly.

Email Bo Cowgill and ask for updates. I drafted an email in outlook. Send on Friday.

Create table of strata sample sizes and weights for the appendix.

Get scores from https://quarry.wmflabs.org/query/40712 if the missing data is bad.

add controversial revert varaiable to dataset.

update data set with scores from quarry and reverted-reverts.

compare re-scored missingness to old missingness

run revscoring with feature injection (don’t do it for now)

robustness check in the local linear regression where we include wikis with the live site issue.

simplify wiki_weeks generation

Check revscoring results and thresholds

Add scores to sample with the revert in 24 var

Try a less restricted time series model: see if a long-run spline and a short-run spline (or a lagged dv) are stationary according to the Breush-Godfrey test. (Do this after i’m done with other things I can do first while the stan models run)S

for week of year with fixed effects for month instead of fixed effects for week.

[#B] Score huge sample

model selection for panel models of different spline degrees of freedom using LOO

rerun panel models (also using using p_reverted)

[#A] Collect thresholds for each deployment

Use latest model for scoring when we are pre-cutoff

convert to cscw template

fix missing data in revscoring (deleted revisions, zhwiki, this is fucking up the weights!)

[#A] knit bias analysis

run bias analysis on static model version.

This actually isn’t that important and we probably don’t have to do it unless reviewers ask. It’s probably enough to keep it up to date with the new wikis. Also, it’s a bit of a hassle.

backup joal’s wikidata snapshot (at least for the records that I use).

[#B] plot model proto wiki

[#B] create pooled bias analysis

integrate bias analysis with main repo

label rdd reverts only if they are damaging

Run analysis using a makefile

create dependent variable p_reverted (prortion of anon/newcomer edits that reverted)

[#B] Inspect the messagewalls-style models

[#B] RDD data points using data from multiple wikis (get the N big enough to convince mako :) ?)

Prep for CGSA meeting (reply to Salt’s email)

[#A] run revscoring on new sample.

[#A] regenerate wikiweeks

make a new outline

make it so I never have to run revscoring again

regenerate the commit cutoff db to include euwiki

[#A] Model anon and newcomers seperately.

Drop wikis without enough observations.

[#A] Submit to CSCW

[#A] Model with estimates for average wiki

This is somewhat fraught. Seems like between wiki-heterogeneity makes it difficult to estiamte a pooling effect. So let’s hold off on that and either present an average-edit model or seperate models for each wiki. But which?

What’s the right way to do this? Have equal sized samples from each wiki and don’t weight.

[#B] assign thresholds to edits! (there seems to be a bug in getting defaults

Score pre-treatment edits using latest model versions (instead of earliest model versions)

WONT DO [#C] Model pooling estimates across thresholds

[#A] Robustness Check: Run on a sample of much earlier edits and different cutoffs.

[#C] RDD: Plot density conditional on outcomes to test for control over assignment.

Compare models using LOO or LRT

[#A] Investigate spikes in wiki-weeks data.

I didn’t find a good explanation, but I noticed that I wasn’t removing bots. Also we should model p.reverted instead of n.reverted. I’ll try again later.

[#C] Try fitting models using MLE

We don’t need to do this since we’ll want to compare estimates and so have a need for bayes.

[#A] Fit time series models with splines for time and loo-based model selection.

[#A] Visualize reversion rates in buckets.

[#A] Debug newcomer panel data model.

probably should be fitting binomial models predicting proportion reverted instead it fits ok when we don’t do QR decomposition.

make time-series plots with data and model predicted values.

[#A] (fit and interpret) time series models for new hypotheses

BLOCKED [#B] Do seperate RDD analyses for each wiki

Robustness checks with varying neighborhood sizes.

Fit RDD models on newcomer and anonymous editors.

[#A] Run RDDs

Make pretty discontinuity plots for every wiki.

[#C] Model with time to revert as outcome

figure out best way to model multiple cutoffs (with missing data)

Maybe it’s one cutoff per model but we exclude data on the other sides of the other cutoffs. Or we don’t. Mako might be helpful with that.

(preliminary) threshhold analysis.

Fit model from litschig_impact_2013

Fit per-wiki models.

Why did we lose user ids?

[#A] Fit kink model to check that funny cutoffs aren’t due to mispecification.

Make it so I never have to score edits again.

move to git-annex from git-lfs

rerun archaeologist on new sample

make plots for threshhold analysis.

fix error handling in archaeologist

Make new sample

fix remaining bug in archaeologist.

rebase from github to code.communitydata

get default cutoffs

Send halfak sample of edits around the threshold.

fix error handling in archaeologist

Make new sample

rerun archaeologist on new sample

fix remaining bug in archaeologist.

(preliminary) threshhold analysis.

make plots for threshhold analysis.

see if they will install git-annex on the wmf machines.

Git-annex isn’t installed on wmf machines. So I need to ask about it.

rebase from github to code.communitydata

Future project

  • two different extreme assumptions could be: the same damage gets reverted, it takes more work. 2. Stuff doesn’t get reverted at all, the cost of debiasing is more damage getting through.

make code for making threshold ME plots