-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.html
98 lines (75 loc) · 4.64 KB
/
README.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
<!-- === begin markdown block =====================================================
generated by markdown 1.0.0 on Ruby 1.8.7 (2012-02-08) [universal-darwin12.0]
on Sun Jun 02 15:55:13 -0700 2013 with Markdown engine kramdown (0.14.1)
using options { !to be done! }
-->
<h1 id="introduction">Introduction</h1>
<p><code>lda-match2.py</code> and <code>lda-match3.py</code> are Python 2 and Python 3 versions
(respectively) of scripts which allow the user to generate a subsample of
data (represented by a comma-separated values file) in which two groups do
not differ (according to a two-tailed unequal variance <em>t</em>-test) on an
arbitrary number of real-valued measures.</p>
<p>The approach here is similar to the “greedy” algorithm used by van Santen
et al. (2010; <em>Autism</em>) but in general results in larger subgroups, as it
uses linear discriminant analysis (LDA) to identify outliers. To a first
approximation, when the <em>t</em>-test assumptions (normality or large samples),
the assumptions of LDA will also hold, and therefore this will give
something close to an “optimal” subsample according to a criterion which
favors large subsamples of approximately the same size.</p>
<h1 id="installation-instructions">Installation instructions</h1>
<p>This code requires Python (either 2 or 3) and two additional packages,
<code>numpy</code> and <code>rpy2</code>. You will need all the following:</p>
<ul>
<li>Python (try <code>python --version</code>) </li>
<li>R (try <code>R --version</code>)</li>
<li>C compiler (try <code>cc --version</code>)</li>
<li><code>pip</code>, the Python package manager (try <code>pip</code>)</li>
<li>
<p><code>numpy</code> and <code>rpy2</code>:</p>
<p><code>sudo pip install numpy rpy2</code></p>
</li>
</ul>
<h1 id="usage">Usage</h1>
<p>Users <em>must</em> specify the labels for the two groups (<code>-a</code>, <code>-b</code>), the
column name containing the groups (<code>-g</code>), the column name(s) of the
feature(s) to match on (-m), and the location of the input file. Users
<em>may</em> specify that observations are only to be removed from the first
group (<code>-d</code>), the two-tailed alpha level (<code>-p</code>, by default .2), and the location for the output file (<code>-o</code>; by default, results are printed to
<code>STDOUT</code>).</p>
<p>For more information, refer to the worked examples below.</p>
<h1 id="worked-examples">Worked examples</h1>
<ul>
<li>
<p>Match ALN and ALI children in <code>DX.csv</code> on chronological age (<code>CA</code>), ADOS severity score (<code>ADOS</code>), and SCQ total score (<code>SCQ</code>) at two-tailed alpha >= 0.2, and write the resulting set to a file called <code>TD-SLI.csv</code>:</p>
<p><code>python lda-match2.py -a TD -b SLI -g DX -m CA -m ADOS -m SCQ -p 0.2 -o TD-SLI.csv DX.csv</code></p>
</li>
<li>
<p>Match TD and ALN/ALI children in <code>DX.csv</code> on chronological age (<code>CA</code>) and non-verbal IQ (<code>NVIQ</code>) at two-tailed alpha >= .5 and write the resulting set to a file called <code>TD-ASD-p5.csv</code>:</p>
<p><code>python lda-match2.py -a TD -b ALN -b ALI -g DX -m CA -m NVIQ -p 0.5 -o TD-ASD-p5.csv DX.csv</code></p>
</li>
<li>
<p>Alternative conventions for generating the previous set:</p>
<p><code>python lda-match2.py -aTD -bALN,ALI -gDX -mCA,NVIQ -p.05 DX.csv > TD-ASD-p5.csv</code></p>
</li>
</ul>
<h1 id="some-results-on-the-erpa-data">Some results on the ERPA data</h1>
<ul>
<li>TD (36) vs. SLI (20): CA! [+1]</li>
<li>ALN (25) vs. TD (28): CA, NVIQ, VIQ! [+1]</li>
<li>ALN (23) vs. ALI (24): CA, ADOS!, SCQ [+2]</li>
<li>TD (42) vs. ASD (ALN = 24, ALI = 19): CA, NVIQ! [0]</li>
<li>LN (ALN = 26, TD = 39) vs. LI (ALI = 26, SLI = 20): CA! [+2]</li>
<li>ALI (26) vs. SLI (20): CA, NVIQ, VIQ (no matching necessary) [+3]</li>
<li>ASD (ALN = 25, ALI = 26) vs. nASD (TD = 44, SLI = 20): CA, NVIQ (no matching necessary) [+6]</li>
</ul>
<p>Key:</p>
<ul>
<li>!: last feature matched</li>
<li>[+N]: change in overall subsample size compared to the “greedy” method
operating at the same alpha level</li>
</ul>
<h1 id="license">License</h1>
<p>BSD-like (see the source)</p>
<h1 id="author">Author</h1>
<p>Kyle Gorman (<a href="mailto:gormanky@ohsu.edu">gormanky@ohsu.edu</a>), with thanks to Steven Bedrick</p>
<!-- === end markdown block ===================================================== -->