Buyer beware: this session is NOT a tool playground.

This afternoon... we think!

When Models go wrong:

Columbia ice strike

Size: 1200 m2
Speed: 477 mpg (relative to vehicle)

Certified as “safe” by the CRATER micro-meteorite model.

A experiment in CRATER’s DB:
- Size: 3cm3
- Speed: under 100 mpg

Columbia, and crew, dies on re-entry

Lesson: conclusions should come with a “certification envelope”
If new tests outside of the envelope of the training set then raise an alert

Cognitive

Constructivism:

Knowledge is not universal, laying around like diamonds
- WRONG: Mind is a passive system that gathers its contents from its environment and, through the act of knowing, produces a copy of the order of reality
Rather, in the act of knowing, it is the human mind that actively gives meaning and order to that reality to which it is responding - We see, we react, then we know

e.g. Explanation

When explaining something, one size does not fit all. Explanations need to be tailored to the knowledge and goals of the audience.

e.g. Kelly's personel construct theory:

humans reason about the difference between our constructs rather than the constructs themselves
Kelly, George (1991) [first publsihed 1955]. The psychology of personal constructs. London; New York: Routledge in association with the Centre for Personal Construct Psychology. ISBN 0415037999. OCLC 21760190.

Instance-Based Reasoning

Roger Shank: people don't think, they remember

Reasoning = reflecting over one's own library of exemplars of past events
Learning = finding good exemplars

Note that most data sets contain such exemplars:

A "well-supported" model is built from N examples, using multiple examples fron N to support different parts of the model
That is, the N examples contain many repeated effects (and we try to avoid modeling the rare, one-off anomalous effects)
So any data set that supports a model can be condensed by removing the repeats.

E.g. Reverse nearest neighbors

Everyone point to their nearest neighbor
- Count how many times someone is pointing to you
- That is your reverse nearest neighbor score
Sort rows by ther rnn score, print only the highest scroring ones

e.g.786 examples in diabtetes.arff. If we just show the ones with rnn>= 2 then:

#n,         /class,   rnn, :pedi, :mass, :insu, :skin, :preg, :pres,  :age, :plas
             #====, =====, =====, =====, =====, =====, =====, =====, =====, =====
      
              # -,      -,   100,    98,    95,    89,    84,    72,    46,     0
24, tested_negative,    2,     1,     1,     1,     1,     1,     1,     1,     1 <== 24 repeats
1,  tested_negative,    2,     3,     1,     1,     1,     1,     1,     1,     1
%-------------------------------------------------------------------------------
1, tested_positive,     2,     1,     1,     1,     1,     1,     1,     1,     1
1, tested_positive,     2,     1,     1,     1,     1,     3,     1,     2,     1
1, tested_positive,     2,     1,     3,     1,     1,     3,     2,     1,     3
1, tested_positive,     2,     1,     3,     1,     3,     1,     2,     1,     3
1, tested_positive,     2,     1,     3,     2,     3,     1,     1,     1,     3
2, tested_negative,     3,     3,     3,     2,     3,     1,     2,     2,     3
1, tested_positive,     3,     3,     1,     1,     1,     3,     2,     2,     3
1, tested_positive,     3,     3,     3,     2,     3,     1,     2,     1,     3 
1, tested_negative,     4,     1,     1,     2,     1,     1,     1,     1,     3
1, tested_negative,     4,     3,     1,     2,     3,     1,     1,     1,     1
1, tested_positive,     4,     1,     3,     2,     3,     3,     2,     2,     3
1, tested_positive,     6,     3,     3,     2,     1,     3,     2,     2,     3
1, tested_positive,     7,     3,     3,     2,     3,     1,     2,     1,     3
1, tested_positive,    12,     3,     3,     2,     3,     3,     2,     2,     3
1, tested_positive,    18,     3,     3,     2,     3,     3,     2,     2,     3

Note, in the above, the numbers have been binned into a small number of groups; e.g. age has 2 groups and preg has 3. why? see below.

BTW, there are many other ways to find these centroids:

cluster, then for each cluster:
- take center of each cluster
- take the N most remote points in the cluster
- take the N random points in the cluster
decision tree learning; then for each leaf:
- take center of each leaf
- take the N most remote points in the leaf
- take the N random points in the leaf
see also, the (very cool) literature on prototype selection; a.k.a. finding exemplars.

Generalization

One example is a point is space
- e.g. a=1,b=2,c=3
Rules are volumes in space
- e.g. a < 1, b< 2 (and c equals anything)
So, to generalize
- Combine points into bins (discretiation)

Simple Discretition

find epsilon ;
- e.g. something less than what the business users can control
- e.g. some small fraction of the standard deviation
- Cohen is a heuristic for detecting trivially small differences. Trivial if less than cohen*sd different
- Cohen=(0.2,0.5,0.8) = small, medium, large
- But this is somewhat contraversial
- Divide data into at least bins of size epsilon
find enough
- e.g. sqrt(N)
- divide numbers into bins bigger than enough

The following code applies epsilon and enough to split the data in nums.all (see the third and fourth if statement). The other if statements handle some weird end cases.

function binsGrow(t,nums,b,      
                   epsilon,enough,r,ranks,j,most,val) {
  epsilon = nums.sd * b.cohen
  enough  = length( t.rows )^b.m
  most    = last(nums.all)
  has(b.ranks, 1, "Num")
  r = 1
  for(j=1; j<=t.n; j++) {
    val = nums.all[j]
    NumAdd( b.ranks[r], val )
    if ( most - val > epsilon ) 
      if( t.n - j > enough )
        if( b.ranks[r].n > enough )
          if( b.ranks[r].hi - b.ranks[r].lo > epsilon )
            has( b.ranks, ++r, "Num" ) }
  return r
}

Trickier Discretization

Don't split things unless those splits decrease confusion.

For numbers and symbols, confusion is called variance and entropy.

First, variance:

A class for collecting means and standard deviations
incremental computation of var, mean
detecting different distributions - see these lines for the core test

Second, entropy

entropy = number of bits needed to encode a signal
eg. yes,yes, yes, no,no = (3/5, 2/5)
ent = -1* sum p*log2(p) = -1*(2/5*log2(2/5) + 3/5*log2(3/5) = 0.97

If divisions of data do not decreases varaince/entropy, then that is a bad division:

e.g. 10 orranges, 5 green apples, 1 watermelon
before division: ent = -1*(10/16*log2(10/16) + 5/16*log2(5/16) + 1/16*log2(1/16)) = 1.6
considering dividing on color= orange and color=green
after division there are 10,6 orange and green things
ent= -1*(10/16*log2(10/16) + 6/16*log2(6/16)) = 0.95
code

Example. The following code is run as a post-processor to binsGrow (shown above). Bins that should be combined get the same rank;

Aside: this cose uses expectedValue

If there are obs of "x1" and "x2" seen in "n1" and "n2" examples then
n=n1+n2
Expected value = E = n1/n * x1 + n2/n * x2
e.g. see the tmp line, below:

Which the following code uses to recursively split the bins generated above:

function binsCuts(lo,hi,ranks,b4,r,
                  j,best,n,mid,x,y,xx,yy,tmp,cut) {
  best = b4.sd
  if (lo < hi) 
    for(mid=lo+1; mid<=hi; mid++) {
      xpect(x, lo,    mid, ranks)
      xpect(y, mid+1,  hi, ranks)
      tmp = x.n/b4.n*x.sd + y.n/b4.n*y.sd  #<== for symbols, change .sd to .ent
      if (tmp < best)  
        if ( diff(x,y) ) { #<== explained below
          cut  = mid
          best = tmp
          copy(x,xx)
          copy(y,yy)
  }}
  if (cut) {
    r = binsCuts(lo,    cut, ranks, xx, r) + 1 # <== recursive call
    r = binsCuts(cut+1,  hi, ranks, yy, r)
  } else 
     for(j=lo; j<=hi; j++)
       ranks[j].rank= r
  return r
}

Aside: this code needs two helper function

function Xpect(i) { 
  new(i); 
  i.n = i.mu = i.sd = 0 
}
function xpect(x,lo,hi,ranks,   j,n1) {
  Xpect(x)
  for(j=lo; j<=hi; j++) 
    x.n += ranks[j].n
  for(j=lo; j<=hi; j++) { 
    n1    = ranks[j].n
    x.mu += n1 / x.n * ranks[j].mu 
    x.sd += n1 / x.n * ranks[j].sd 
}}

Even Trickier Discretization

Use statistical theory to block spurious splits

See the diff call, made above.

Which means we need to know some stats:

Significance tests (e.g. the t-test) check if to distributions are different by more than random chance
Effect size tests (e.g. hedges) comment on the size of the difference
We use both

function diff(x,y,      s) { 
  Stats(s)
  return hedges(x,y,s) && ttest(x,y,s)
}

For this to work, we need fast ways to incrementally collect mu and standard deviation:

incremental computation of var, mean

In the following, we use assue x,y can report their mean and standard deviation and sample size as e.g. x.n, x.mu, x,sd.

The Hedges effect size test using magic numbers from this study. If checks of the difference in the mean divided by the standard deviation is bigger than some magic number (0.38)

function hedges(x,y,s,   nom,denom,sp,g,c) {  
  nom   = (x.n - 1)*x.sd^2 + (y.n - 1)*y.sd^2
  denom = (x.n - 1)        + (y.n - 1)
  sp    = sqrt( nom / denom )
  g     = abs(x.mu - y.mu) / sp  
  c     = 1 - 3.0 / (4*(x.n + y.n - 2) - 1)
  return g * c > 0.38 # from https://goo.gl/w62iIL
}

The t-test significance test is very similar. It uses ttest to look up a magic threshold value. For details see here.

function ttest(x,y,s,    t,a,b,df,c) {
  # debugged using https://goo.gl/CRl1Bz
  t  = (x.mu - y.mu) / sqrt(max(10^-64,
                                x.sd^2/x.n + y.sd^2/y.n ))
  a  = x.sd^2/x.n
  b  = y.sd^2/y.n
  df = (a + b)^2 / (10^-64 + a^2/(x.n-1) + b^2/(y.n - 1))
  c  = ttest1(s, int( df + 0.5 ), s.conf)
  return abs(t) > c
}

In practice:

The above methods:

divide many continuous distributions into a handful of bins
select for a handful of rows
allow us to find a handful of most useful columns
and that reduced space works just as well for data mining as the original space

BTW, once we know how to do exemplars:

big data moves to little data
- big data is useful for initial data collection
- little data is where all the subsequent inference happens
- The secret of big data is little data
we can ship models with a certification envelope; i.e. the core examples from which we defined a model
- ring an alarm bell if new data not in the certification envelope
- no more crashing columbia space shuttles.

Old notes, plz ignore

maths that matters my Spiel would be cognitive. that stats are "m'eh" but the real game is showing decision makers things that let them do things. so informative "visualizations" that highligtht the deltas between things (Kelly's personnel construct theory: folks dont know "things", what they really know are the minimal differences between things) . which leads to the mathematics of diversity (entropy, variance) and how to distinguish between things (bootstrapping, effectSize) and how to apply that theory of differences to data mining (clustering, discretization, recursive tree learning). the final cherry on this cake would be scott-knot stuff seen at icse recently where results are chunked up into BIG groups and we no longer focus on micro deltas but on BIG MOFO DIFFERENCES. so not sure if that is an REU thing or a lecture for my foundations of software science thing in the fall your call. happy to help. happy to not. what ya need? Cognitive maths. Maths that fuels human decision making:

explaination is task-depnedent: supervised learning rules (different atsks, different learning)
instance-based reasoning: don't think, rememoer over a small library of examplars
Don't sweat the small stuff: effect size, scott-knott
Maths of n-dimensional sphere: low-dimensional or not at all (+)
Kelly's peresonnel constract theory: dont show what is, show what is different
Minimize confusion: reduce expected value of variance or entropy

(+) note: cool if we are human intelligence. May not hold for super-human intelligence.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

When Models go wrong:

Cognitive

Constructivism:

Instance-Based Reasoning

Generalization

Simple Discretition

Trickier Discretization

Even Trickier Discretization

In practice:

Old notes, plz ignore

Files

README.md

Latest commit

History

README.md

File metadata and controls

When Models go wrong:

Cognitive

Constructivism:

Instance-Based Reasoning

Generalization

Simple Discretition

Trickier Discretization

Even Trickier Discretization

In practice:

Old notes, plz ignore