Independent Sign Language Recognition with 3D Body, Hands, and Face Reconstruction

Independent Sign Language Recognition with 3D Body, Hands, and Face Reconstruction

Sign Language recognition is a crucial task for the Sign Language community. However, it is a difficult task as three different channels of information: face, hands, and body, must be combined.

Unfortunately, while current techniques are successful in each of the tasks, no adequate method to recognize sign language from all three channels has been developed.

Image credit: pxhere.com, CC0 Public Domain

Hence, a recent study suggests employing SMPL-X, a deep learning-based body model that can reconstruct 3D human body information from a single RGB image, for this task. It detects body, hands, and facial 2D keypoints and then projects 3D joints to agree with the detected 2D locations.

The sequence of SMPL-X parameters across the frames of a sign is then used as input to a recurrent neural network to classify the sign. The SMPL-X based approach outperformed current methods. It was also shown that omitting any of the three channels reduces the accuracy of the model significantly.

Independent Sign Language Recognition is a complex visual recognition problem that combines several challenging tasks of Computer Vision due to the necessity to exploit and fuse information from hand gestures, body features and facial expressions. While many state-of-the-art works have managed to deeply elaborate on these features independently, to the best of our knowledge, no work has adequately combined all three information channels to efficiently recognize Sign Language. In this work, we employ SMPL-X, a contemporary parametric model that enables joint extraction of 3D body shape, face and hands information from a single image. We use this holistic 3D reconstruction for SLR, demonstrating that it leads to higher accuracy than recognition from raw RGB images and their optical flow fed into the state-of-the-art I3D-type network for 3D action recognition and from 2D Openpose skeletons fed into a Recurrent Neural Network. Finally, a set of experiments on the body, face and hand features showed that neglecting any of these, significantly reduces the classification accuracy, proving the importance of jointly modeling body shape, facial expression and hand pose for Sign Language Recognition.

Link: https://arxiv.org/abs/2012.05698


Source link