README.html

<!-- === begin markdown block =====================================================

      generated by markdown 1.0.0 on Ruby 1.8.7 (2012-02-08) [universal-darwin12.0]
                on Sun Jun 02 15:55:13 -0700 2013 with Markdown engine kramdown (0.14.1)
                  using options { !to be done! }
  -->
<h1 id="introduction">Introduction</h1>

<p><code>lda-match2.py</code> and <code>lda-match3.py</code> are Python 2 and Python 3 versions 
(respectively) of scripts which allow the user to generate a subsample of 
data (represented by a comma-separated values file) in which two groups do
not differ (according to a two-tailed unequal variance <em>t</em>-test) on an 
arbitrary number of real-valued measures.</p>

<p>The approach here is similar to the &#8220;greedy&#8221; algorithm used by van Santen
et al. (2010; <em>Autism</em>) but in general results in larger subgroups, as it
uses linear discriminant analysis (LDA) to identify outliers. To a first
approximation, when the <em>t</em>-test assumptions (normality or large samples),
the assumptions of LDA will also hold, and therefore this will give
something close to an &#8220;optimal&#8221; subsample according to a criterion which 
favors large subsamples of approximately the same size.</p>

<h1 id="installation-instructions">Installation instructions</h1>

<p>This code requires Python (either 2 or 3) and two additional packages, 
<code>numpy</code> and <code>rpy2</code>. You will need all the following:</p>

<ul>
  <li>Python (try <code>python --version</code>) </li>
  <li>R (try <code>R --version</code>)</li>
  <li>C compiler (try <code>cc --version</code>)</li>
  <li><code>pip</code>, the Python package manager (try <code>pip</code>)</li>
  <li>
    <p><code>numpy</code> and <code>rpy2</code>:</p>

    <p><code>sudo pip install numpy rpy2</code></p>
  </li>
</ul>

<h1 id="usage">Usage</h1>

<p>Users <em>must</em> specify the labels for the two groups (<code>-a</code>, <code>-b</code>), the 
column name containing the groups (<code>-g</code>), the column name(s) of the 
feature(s) to match on (-m), and the location of the input file. Users 
<em>may</em> specify that observations are only to be removed from the first 
group (<code>-d</code>), the two-tailed alpha level (<code>-p</code>, by default .2), and the location for the output file (<code>-o</code>; by default, results are printed to 
<code>STDOUT</code>).</p>

<p>For more information, refer to the worked examples below.</p>

<h1 id="worked-examples">Worked examples</h1>

<ul>
  <li>
    <p>Match ALN and ALI children in <code>DX.csv</code> on chronological age (<code>CA</code>), ADOS severity score (<code>ADOS</code>), and SCQ total score (<code>SCQ</code>) at two-tailed alpha &gt;= 0.2, and write the resulting set to a file called <code>TD-SLI.csv</code>:</p>

    <p><code>python lda-match2.py -a TD -b SLI -g DX -m CA -m ADOS -m SCQ -p 0.2 -o TD-SLI.csv DX.csv</code></p>
  </li>
  <li>
    <p>Match TD and ALN/ALI children in <code>DX.csv</code> on chronological age (<code>CA</code>) and non-verbal IQ (<code>NVIQ</code>) at two-tailed alpha &gt;= .5 and write the resulting set to a file called <code>TD-ASD-p5.csv</code>:</p>

    <p><code>python lda-match2.py -a TD -b ALN -b ALI -g DX -m CA -m NVIQ -p 0.5 -o TD-ASD-p5.csv DX.csv</code></p>
  </li>
  <li>
    <p>Alternative conventions for generating the previous set:</p>

    <p><code>python lda-match2.py -aTD -bALN,ALI -gDX -mCA,NVIQ -p.05 DX.csv &gt; TD-ASD-p5.csv</code></p>
  </li>
</ul>

<h1 id="some-results-on-the-erpa-data">Some results on the ERPA data</h1>

<ul>
  <li>TD (36) vs. SLI (20): CA! [+1]</li>
  <li>ALN (25) vs. TD (28): CA, NVIQ, VIQ! [+1]</li>
  <li>ALN (23) vs. ALI (24): CA, ADOS!, SCQ [+2]</li>
  <li>TD (42) vs. ASD (ALN = 24, ALI = 19): CA, NVIQ! [0]</li>
  <li>LN (ALN = 26, TD = 39) vs. LI (ALI = 26, SLI = 20): CA! [+2]</li>
  <li>ALI (26) vs. SLI (20): CA, NVIQ, VIQ (no matching necessary) [+3]</li>
  <li>ASD (ALN = 25, ALI = 26) vs. nASD (TD = 44, SLI = 20): CA, NVIQ (no matching necessary) [+6]</li>
</ul>

<p>Key:</p>

<ul>
  <li>!: last feature matched</li>
  <li>[+N]: change in overall subsample size compared to the &#8220;greedy&#8221; method
operating at the same alpha level</li>
</ul>

<h1 id="license">License</h1>

<p>BSD-like (see the source)</p>

<h1 id="author">Author</h1>

<p>Kyle Gorman (<a href="&#109;&#097;&#105;&#108;&#116;&#111;:&#103;&#111;&#114;&#109;&#097;&#110;&#107;&#121;&#064;&#111;&#104;&#115;&#117;&#046;&#101;&#100;&#117;">&#103;&#111;&#114;&#109;&#097;&#110;&#107;&#121;&#064;&#111;&#104;&#115;&#117;&#046;&#101;&#100;&#117;</a>), with thanks to Steven Bedrick</p>
<!-- === end markdown block ===================================================== -->