Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not fully crossed study - BDG warning #185

Open
JessieGommers opened this issue Oct 9, 2024 · 3 comments
Open

Not fully crossed study - BDG warning #185

JessieGommers opened this issue Oct 9, 2024 · 3 comments

Comments

@JessieGommers
Copy link

We conducted a reader study with 2 different reading conditions using two datasets, each containing 30 exams with a 1:1 ratio of malignant to normal cases. Each of the 37 readers participated in a single reading session, reviewing both datasets: one with condition 1 and the other with condition 2. Due to logistical constraints, our study design is not fully crossed. We know that we pay a statistical price for this, but hope that using 37 readers mitigates this.

afbeelding

We conducted iMRMC analyses using the Java iMRMC software for AUC, sensitivity, and specificity, but encountered warnings with the BDG method stating that the DF_BDG is below a minimum and has been set to 29.0.

e.g. for AUC:
afbeelding

e.g. for specificity:
afbeelding

This warning does not appear when we use the MLE analysis. We observed that the p-values of the DBG and MLE estimate differ, specifically specificity, which turned out to be significantly different for the 2 conditions when using BDG (p=0.0003 with warning) but not when using MLE (p=0.204).

We are uncertain which method would be more appropriate for our study. I understand MLE can avoid a total negative variance estimate. However, the total variance estimate with the BDG method does not seem to be negative. I would greatly appreciate your guidance on the best approach for our context.

@brandon-gallas
Copy link
Member

I think I understand your application and question.

It looks like you are using the java gui. This is a static piece of software that is no longer being maintained. I recommend that you use the R package moving forward. You can find information here: iMRMC: Software to do Multi-reader Multi-case Statistical Analysis of Reader Studies | Center for Devices and Radiological Health (fda.gov)

The warning about the degrees of freedom (DF) is not a problem. If the DF estimates fall below the lower bound, they are set to the lower bound. The DF estimates have uncertainty in them, especially in cases where data is limited. In your case, the number of exams is 15+15+15+15 (2 case sets x 2 truths), which is small for ROC analysis. The lower bound of 29 (=30-1), comes from the number of signal-present cases for sensitivity, signal-absent cases for specificity, or the minimum of these for ROC. It’s a bit of a waste of effort to have 37 readers evaluate the same small number of cases. Please see this paper on split-plot studies:

  • W. Chen, Q. Gong, and B. D. Gallas, “Paired split-plot designs of multireader multicase studies,” Journal of Medical Imaging, vol. 5, p. 031410, 2018, doi: 10.1117/1.JMI.5.3.031410.

There is certainly something funny about the specificity results. The DF_BDG for specificity is calculated as 0.93!!! That is not good … red flag. I wouldn't use any p-values from the software. Notice the DF_BDG is ~24 for AUC. That is healthy. My guess is that many readers are making the exact same interpretations on the signal-absent cases … little to no reader variability. I’m curious to know if this is true.

Your issue is causing me to think I should return a different error if DF_BDG is below 3 or even 5.

Without any more of the output or input data, it is hard to give more of a response. p-values are only one kind of output; they can be misinterpreted or completely inappropriate. Point estimates and confidence intervals tell a much more complete story. I don’t have a solution for you except to refer to the per-reader results .
.. BUT ... It isn't entirely clear that the cases in the two datasets/modalities are independent or just different reading conditions. If they only differ by the reading condition, they should carry the same case ID.

Finally, I would avoid using the MLE results. They are not validated when the study design is not fully crossed, and I’ve observed weird results in such cases. Your data is not fully crossed. Your question is nudging me to remove the MLE results completely from the current software.

@JessieGommers
Copy link
Author

Thank you so much for your reply Brandon Gallas.

Instead of using the Java GUI, we moved forward with the R package, and our results remain similar.

It is good to know that the warning about the degrees of freedom is not a problem, as there is uncertainty in the estimates. Given the extremely low DF_BDG for specificity, we checked whether readers are making more of the exact same binary interpretations in the signal-absent cases compared to the signal-present cases. While there is notable reader agreement, this trend is present for both signal-absent and signal-present cases.

For testing purposes, we slightly modified our data to increase agreement, resulting in a marginally higher degrees of freedom (1.05) but triggering a negative estimate warning. Conversely, we also adjusted the data to reduce agreement, which led to a higher degrees of freedom of 10.38, though still relatively low.

To clarify, the two datasets consist of independent cases, with a total of 60 unique exams, each assigned a distinct case ID. The study design ensures that the same readers do not review the exact same exam during a single reading session, while still allowing them to participate in both study modalities. The readers were distributed across the following combinations:
• Set 1, Modality A followed by Set 2, Modality B
• Set 1, Modality B followed by Set 2, Modality A
• Set 2, Modality A followed by Set 1, Modality B
• Set 2, Modality B followed by Set 1, Modality A
Set 1 consists of 30 exams (1:1 ratio of signal-present to signal-absent cases), with no overlapping exams from Set 2, which also consists of 30 exams (1:1 ratio of signal-present to signal-absent cases). The four conditions were approximately equally distributed among the readers.

We are now considering reporting only point estimates and confidence intervals, omitting p-values.

Thanks,
Jessie

@brandon-gallas
Copy link
Member

Good luck.

BTW, 10 degrees of freedom is much better than 1. Think about it like this. Would you be more confident in a variance estimate from 10 independent observations or just one observation?

Thanks for the feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants