Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to binarize the data #15

Open
GoogleCodeExporter opened this issue Jun 1, 2015 · 2 comments
Open

how to binarize the data #15

GoogleCodeExporter opened this issue Jun 1, 2015 · 2 comments

Comments

@GoogleCodeExporter
Copy link

Please enter your question:

Based on a limited study, how can we binarize the data?

Original issue reported on code.google.com by [email protected] on 13 Nov 2014 at 4:13

@GoogleCodeExporter
Copy link
Author

First, we have a paper that was accepted for publication that describes the 
math and statistics that are the foundation of this software. The introduction 
addresses this issue. Please check it out as it may help you understand the 
software better and further discussion below.

Chen, W.; Wunderlich, A.; Petrick, N. A. & Gallas, B. D. (Accepted 2014), 'A 
general framework for MRMC reader studies with binary assessments: simulation 
for validation and sizing.' J Med Img.

The iMRMC-binary software expects as input binary outcomes for each reader-case 
evaluation. For example, "zero" may mean that the reader did not agree with the 
reference and "one" may mean that the reader did agree with the reference. This 
agreement could come from a qualitative comparison of a reader's free-text 
report and the reference report (truth).

If the data being compared are quantitative (e.g. size), agreement with the 
reference could be defined by some error tolerance. In other words, a "zero" is 
given to an observation if the size reported by the study reader is outside the 
tolerance region about the reference reader measurement, and a "one" is given 
if the size is within the tolerance region. Alternatively, a threshold can be 
introduced and a 2x2 table can be created for this kind of data.
    Study reader 
size < threshold    Study reader size ≥ threshold
Reference 
size < threshold    A   B
Reference 
size ≥ threshold  C   D
Given a table like the one above, you could get a binary outcome that reflects
• Total agreement: Every case in squares A or D gets a “one”. Every case 
in squares B or C gets a “zero”.
• Agreement given reference size < threshold: Every case in square A gets a 
“one” and every case in square B gets a “zero”. Cases in squares C and 
D are not included in the analysis.
• Agreement given reference size ≥ threshold: Every case in square D gets a 
“one” and every case in square C gets a “zero”. Cases in squares A and 
B are not included in the analysis.
It is possible to generalize to multiple thresholds by considering all squares 
that are equivalent (total agreement) or all squares in a particular row 
(agreement given a particular reference result). 

If the data being compared is ordinal or qualitative, a similar table can be 
constructed and binary outcomes can be determined. For example, qualitative 
data may be one of three disease types. Total agreement could be determined by 
assigning cases a "one" when the reader and the reference decide the same 
disease type and "zero" when the reader and reference decide different disease 
types.

The most important issue to consider when binarizing data is that a measure of 
agreement and the rules for dichotomizing the data make sense to clinicians, 
you, your collaborators, and your audience.

BTW, for ordinal data, we are investigating concordance measures for 
implementation in iMRMC. These would allow you to have multi-level truth 
instead of binary truth. AUC is a special case of a conditional concordance 
measure. Other common concordance measures are Kendall’s tau-a and tau-b. 
Please refer to the following for more information.
Kim, J.-O. (1971), 'Predictive Measures of Ordinal Association.' Am J Sociol, 
76, (5), 891-907.
Smith, W. D.; Dutton, R. C. & Smith, N. T. (1996), 'A Measure of Association 
for Assessing Prediction Accuracy That is a Generalization of Non-Parametric 
ROC Area.' Stat Med, 15, (1), 1199-1215.
Kendall, M. G. (1938), 'A New Measure of Rank Correlation.' Biometrika, 30, 
(1/2), pp. 81-93.

Original comment by Brandon.Gallas on 28 Nov 2014 at 4:01

  • Changed state: Answered
  • Added labels: Type-Question
  • Removed labels: Type

@GoogleCodeExporter
Copy link
Author

Here is the link to the paper discussing the iMRMC-binary software
http://imrmc.googlecode.com/svn/standalone_application/docs/Chen2014_J-Med-Img_a
ccepted.pdf

Here is the table above formatted a little better:
                        Study reader            Study reader
                        size < threshold    size ≥ threshold
Reference 
size < threshold    A                   B
Reference 
size ≥ threshold  C                   D

Original comment by Brandon.Gallas on 28 Nov 2014 at 4:09

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant