NB: this is now deprecated in favor of the similarly named ldamatch, written in pure R and introducing additional features.
and lda-match3.py
are Python 2 and Python 3 versions
(respectively) of scripts which allow the user to generate a subsample of
data (represented by a comma-separated values file) in which two groups do
not differ (according to a two-tailed unequal variance t-test) on an
arbitrary number of real-valued measures.
The approach here is similar to the "greedy" algorithm used by van Santen et al. (2010; Autism) but in general results in larger subgroups, as it uses linear discriminant analysis (LDA) to identify outliers. To a first approximation, when the t-test assumptions (normality or large samples), the assumptions of LDA will also hold, and therefore this will give something close to an "optimal" subsample according to a criterion which favors large subsamples of approximately the same size.
This code requires Python (either 2 or 3) and two additional packages,
and rpy2
. You will need all the following:
Python (try
python --version
) -
R (try
R --version
) -
C compiler (try
cc --version
) -
, the Python package manager (trypip
) -
:sudo pip install numpy rpy2
Users must specify the labels for the two groups (-a
, -b
), the
column name containing the groups (-g
), the column name(s) of the
feature(s) to match on (-m), and the location of the input file. Users
may specify that observations are only to be removed from the first
group (-d
), the two-tailed alpha level (-p
, by default .2), and the location for the output file (-o
; by default, results are printed to
For more information, refer to the worked examples below.
Match ALN and ALI children in
on chronological age (CA
), ADOS severity score (ADOS
), and SCQ total score (SCQ
) at two-tailed alpha >= 0.2, and write the resulting set to a file calledTD-SLI.csv
:python lda-match2.py -a TD -b SLI -g DX -m CA -m ADOS -m SCQ -p 0.2 -o TD-SLI.csv DX.csv
Match TD and ALN/ALI children in
on chronological age (CA
) and non-verbal IQ (NVIQ
) at two-tailed alpha >= .5 and write the resulting set to a file calledTD-ASD-p5.csv
:python lda-match2.py -a TD -b ALN -b ALI -g DX -m CA -m NVIQ -p 0.5 -o TD-ASD-p5.csv DX.csv
Alternative conventions for generating the previous set:
python lda-match2.py -aTD -bALN,ALI -gDX -mCA,NVIQ -p.05 DX.csv > TD-ASD-p5.csv
- TD (36) vs. SLI (20): CA! [+1]
- ALN (25) vs. TD (28): CA, NVIQ, VIQ! [+1]
- ALN (23) vs. ALI (24): CA, ADOS!, SCQ [+2]
- TD (42) vs. ASD (ALN = 24, ALI = 19): CA, NVIQ! [0]
- LN (ALN = 26, TD = 39) vs. LI (ALI = 26, SLI = 20): CA! [+2]
- ALI (26) vs. SLI (20): CA, NVIQ, VIQ (no matching necessary) [+3]
- ASD (ALN = 25, ALI = 26) vs. nASD (TD = 44, SLI = 20): CA, NVIQ (no matching necessary) [+6]
- !: last feature matched
- [+N]: change in overall subsample size compared to the "greedy" method operating at the same alpha level
BSD-like (see the source)
Kyle Gorman ([email protected]), with thanks to Steven Bedrick