This repository hosts our PyTorch implementation of 3D-Jointsformer, a novel approach for real-time hand gesture recognition in video sequences. Traditional methods struggle with managing temporal dependencies while maintaining real-time performance. To address this, we propose a hybrid approach combining 3D-CNNs and Transformers. Our method utilizes a 3D-CNN to compute high-level semantic skeleton embeddings, capturing local spatial and temporal characteristics. A Transformer network with self-attention then efficiently captures long-range temporal dependencies. Evaluation of the Briareo and Multimodal Hand Gesture datasets yielded accuracy scores of 95.49% and 97.25%. Importantly, our approach achieves real-time performance on standard CPUs, distinguishing it from GPU-dependent methods. The hybrid 3D-CNN and Transformer approach outperforms existing methods in both accuracy and speed, effectively addressing real-time recognition challenges.
conda create -n 3DJointsformer python=3.9 -y
conda activate 3DJointsformer
conda install pytorch=1.11.0 torchvision=0.12.0 cudatoolkit=11.3 -c pytorch -y
pip install 'mmcv-full==1.5.0' -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.11.0/index.html
pip install mmaction2 # tested mmaction2 v0.24.0
In this work we have tested the proposed model on two datasets : the Briareo and Multi-Modal Hand Gesture Dataset . The hand keypoints are obtained by Mediapipe, we have also included code to generate these hand keypoints ( see data_preprocessing ).
You can use the following command to train a model.
./tools/run.sh ${CONFIG_FILE} ${GPU_IDS} ${SEED}
Example: train the model on the joint data of Briareo dataset using 2 GPUs with seed 0.
./tools/run.sh configs/transformer/jointsformer3d_briareo.py 0,1 0
You can use the following command to test a model.
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
Example: inference on the joint data of Briareo dataset.
python tools/test.py configs/transformer/jointsformer3d_briareo.py \
work_dirs/jointsformer3d/best_top1_acc_epoch_475.pth \
--eval top_k_accuracy --cfg-options "gpu_ids=[0]"
If this project is useful for you, please consider citing our paper.
@Article{s23167066,
AUTHOR = {Zhong, Enmin and del-Blanco, Carlos R. and Berjón, Daniel and Jaureguizar, Fernando and GarcÃa, Narciso},
TITLE = {Real-Time Monocular Skeleton-Based Hand Gesture Recognition Using 3D-Jointsformer},
JOURNAL = {Sensors},
VOLUME = {23},
YEAR = {2023},
NUMBER = {16},
ARTICLE-NUMBER = {7066},
URL = {https://www.mdpi.com/1424-8220/23/16/7066},
PubMedID = {37631602},
ISSN = {1424-8220},
DOI = {10.3390/s23167066}
}
Our code is based on SkelAct , MMAction2 , SlowFast Sincere thanks to their wonderful works.
This project is released under the Apache 2.0 license.