title | abstract | year | volume | publisher | series | software | layout | issn | id | month | tex_title | firstpage | lastpage | page | order | cycles | bibtex_author | author | date | address | container-title | genre | issued | extras | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Regularizing and Interpreting Vision Transformer by Patch Selection on Echocardiography Data |
This work introduces a novel approach to model regularization and explanation in \Glspl{vit}, particularly beneficial for small-scale but high-dimensional data regimes, such as in healthcare. We introduce stochastic embedded feature selection in the context of echocardiography video analysis, specifically focusing on the EchoNet-Dynamic dataset for the prediction of \gls{lvef}. Our proposed method, termed \Glspl{gvit}, augments \Glspl{vvit}, a performant transformer architecture for videos with \Glspl{cae}, a common dataset-level feature selection technique, to enhance \gls{vvit}’s generalization and interpretability. The key contribution lies in the incorporation of stochastic token selection individually for each video frame during training. Such token selection regularizes the training of \gls{vvit}, improves its interpretability, and is achieved by differentiable sampling of categoricals using the Gumbel-Softmax distribution. Our experiments on EchoNet-Dynamic demonstrate a consistent and notable regularization effect. The \gls{gvit} model outperforms both a random selection baseline and standard \gls{vvit}. % using multiple evaluation metrics. The \gls{gvit} is also compared against recent works on EchoNet-Dynamic where it exhibits state-of-the-art performance among end-to-end learned methods. Finally, we explore model explainability by visualizing selected patches, providing insights into how the \gls{gvit} utilizes regions known to be crucial for \gls{lvef} prediction for humans. This proposed approach, therefore, extends beyond regularization, offering enhanced interpretability for \gls{vit}s. |
2024 |
248 |
PMLR |
Proceedings of Machine Learning Research |
inproceedings |
2640-3498 |
nilsson24a |
0 |
Regularizing and Interpreting Vision Transformer by Patch Selection on Echocardiography Data |
155 |
168 |
155-168 |
155 |
false |
Nilsson, Alfred and Azizpour, Hossein |
|
2024-07-24 |
Proceedings of the fifth Conference on Health, Inference, and Learning |
inproceedings |
|
|