Sign Language Detection and Recognition

School of Science and Technology 科技學院
Computing Programmes 電腦學系

Sign Language Detection and Recognition

LIU Ziwei, WANG Zijian, JIANG Yufeng, LI Fengheng

Programme	Bachelor of Computing with Honours in Internet Technology Bachelor of Science with Honours in Computer Science
Supervisor	Dr. Roy Li
Areas	Intelligent Applications
Year of Completion	2023

Objectives

Sign language is the primary means of communication for millions of deaf and hard-of-hearing individuals worldwide. However, the number of sign language interpreters is limited, and social issues arising from sign language barriers, such as high unemployment among hearing-impaired individuals, are prevalent. Existing solutions are limited by hardware constraints, such as depth cameras and motion capture gloves. Therefore, the research aims to provide a solution that enables hearing-impaired individuals to integrate better into everyday society without needing special equipment. The project uses MediaPipe Holistic to extract key points for hands and pose in sign language actions and convert the sign language videos into sequential data to be processed by RNN. For SSLR, YOLOv5 with MediaPipe Hands has the best performance.

Project Aim

This project aims to solve the problem of real-time recognition of sign language for deaf people by developing a web application with deep learning and image processing techniques. Furthermore, the developed deep learning model will be deployed on the website to provide an easier and faster tool for hearing-impaired individuals to communicate with normal people in sign language.

Project Objectives

The objectives of this project include:

Collect sign language dataset
Achieve SSLR
Achieve CSLR
Deploy a SLR website

Impacts of the Project

This project can improve the quality of life for hearing-impaired individuals by breaking down communication barriers and providing more opportunities for social engagement. Deploying the project on the web can make it a convenient communication tool.

Videos

Demonstration Video

Presentation Video

Methodologies and Technologies used

Overview of Solution

The goal is to build a web application for real-time SLR through the computer camera. Before the interim project, research on SLR using YOLOv5 and LSTM was conducted. After the interim project, research on CSLR using the WLASL dataset with LSTM and Transformer with Self-Attention was conducted, and web deployment and development for the deep learning model were carried out.

Solution 1 – CSLR using Video Data

The proposed CSLR approach pipeline is shown in Figure 1 and is explained in terms of its components. The approach used MediaPipe Holistic to convert raw video to sequential data, sampling strategies to fix the number of input frames, Transformer with Self-Attention to infer sign language video, and SoftMax to get the probability of sign language classes.

Figure 1: Pipeline for CSLR

CSLR is a subfield of SLR that recognized individual vocabulary words in a video sequence. MediaPipe Holistic is a promising approach to tackle the challenges associated with CSLR. It captures spatiotemporal information, represented variations in signing styles, and facilitated the creation of new annotated data. It provides a comprehensive understanding of the human body by detecting key points for the face, hands, and body. MediaPipe Holistic’s real-time capabilities facilitated the creation of new annotated data, expanding the available resources for training CSLR models.

Data preprocessing

Two data preprocessing solutions were used for SLR tasks. The first solution was MediaPipe Holistic, which combined three components to provide real-time detection and tracking of hand, body pose, and facial landmarks. MediaPipe Hands detected and tracked hands, providing 21 3D hand landmarks for each hand. MediaPipe Pose estimated body pose, providing 33 3D landmarks with visibility, including key body joints. MediaPipe Face Mesh detected and tracked facial landmarks, providing 468 3D mesh points representing the face’s geometrical structure.

The second solution was frame sampling strategies for sign language video data. The inconsistency in the number of frames in the sign language videos of the WLASL dataset posed challenges for training models. The project developed several algorithms to standardize the number of frames used for training, with the same sampling strategy currently being applied to both models. Future work could explore various sampling strategies within the same model. These solutions could help improve the accuracy and effectiveness of SLR models.

Transformer with Self-Attention Model for CSLR

The Transformer with Self-Attention model was a neural network that used self-attention to focus on different parts of the input sequence without prior knowledge of the relationships between the elements. In SLR using MediaPipe key points, each frame of the continuous sequence could be represented as a set of key points, with each key point corresponding to a specific joint or feature of the signer’s body pose and hands. The Transformer with Self-Attention could be trained on this input sequence to recognize different sign language gestures or phrases. During inference, the trained model predicted the sign language gesture or phrase for each new input sequence of key points in real-time. The self-attention mechanism allowed the model to adapt to different sign language styles and variations while being robust to noise and other types of input variability.

Experiment

The experiment was conducted on the Word-Level American Sign Language (WLASL) dataset for sign language recognition (SLR). Initially, a custom dataset consisting of three categories with vital facial points was used, but later, the WLASL dataset’s first ten classes were used, and facial key points were excluded from feature selection. The project highlighted the challenges of capturing facial expressions’ impact on SLR accurately and the need to focus on the critical aspects of sign language gestures.

Two models, LSTM and Transformer with Self-Attention, were used for the task of continuous sign language recognition (CSLR). The models were trained and evaluated on benchmark datasets for CSLR, such as the self-recorded three classes and WLASL datasets. The project emphasized the importance of evaluating the models’ stability during the training phase and their robustness to noise. These findings could help improve the accuracy and effectiveness of SLR models and their evaluation on benchmark datasets.

The experiment was conducted on the Word-Level American Sign Language (WLASL) dataset for sign language recognition (SLR). Initially, a custom dataset consisting of three categories with vital facial points was used, but later, the WLASL dataset’s first ten classes were used, and facial key points were excluded from feature selection. The project highlights the challenges of capturing facial expressions’ impact on SLR accurately and the need to focus on the critical aspects of sign language gestures. Two models, LSTM and Transformer with Self-Attention, were used for the task of continuous sign language recognition (CSLR). The models were trained and evaluated on benchmark datasets for CSLR, such as the self-recorded three classes and WLASL datasets. The project emphasizes the importance of evaluating the models’ stability during the training phase and their robustness to noise. These findings can help improve the accuracy and effectiveness of SLR models and their evaluation on benchmark datasets.

Solution 2 – SSLR using Image Data

An SSLR method was implemented that utilized MediaPipe Hands preprocessed hand images for hand localization and gesture classification, with YOLOv5 being used for this purpose.

Figure 2: The pipeline of SSLR

For SSLR implementation based on hand shape image datasets, hand shapes were used to represent words and concepts, which could be expressed as images.

MediaPipe Hands: Hands features enhancement

MediaPipe Hands were used for hand feature enhancement, and YOLOv5 was used for hand position localization and gesture classification in SLR tasks. MediaPipe Hands could mark different fingers in different colors, improving the model’s recognition performance. YOLOv5 was a popular object detection algorithm that could be used for locating hand positions and classifying gestures in real-time on a website. The project highlighted YOLOv5’s support for both video and image input and its fast inference speed, making it suitable for SLR tasks.

Experiment of SSLR

YOLOv5 was limited in learning the relations among fingers’ relative positions and the need for an algorithm to enhance finger features for real-time recognition. MediaPipe Hand was used to preprocess hand images for key point marking before feeding them into YOLOv5. The performance evaluation metrics were used for multi-classification tasks, including average class precision, recall, and F1 score. For SLR tasks, it was suggested to use metrics to measure both bounding box localization and classification accuracy, such as Intersection over Union (IoU) for localization and mean Average Precision (mAP) for classification, with an IoU threshold set to 0.5 for mAP@0.5. These findings could help improve the accuracy and effectiveness of SLR models and their evaluation using appropriate performance metrics. It was emphasized that the importance of evaluating both hand position localization and gesture classification accuracy in SLR tasks and the use of metrics such as IoU and mAP to measure performance. These metrics could help researchers and developers evaluate the effectiveness of their models and improve their accuracy and efficiency for real-time recognition.

Datasets and model training

The American Sign Language Letters Dataset (ASLL Dataset) was used for training models in sign language recognition (SLR) tasks. The dataset contained 26 classes of ASL letters, with 1512 images in the train set, 144 images in the validation set, and 216 in the test set. The images were augmented using various techniques, such as random horizontal flip, random crop, random rotation, vertical and horizontal shear, brightness adjustment, and Gaussian blur. “J” and “Z” classes were removed from the dataset due to their sequential actions. The baseline models used for comparison were ResNet50 and YOLOv5, with ResNet50 being a deep CNN that used residual blocks to construct the network. The models were trained with and without MediaPipe Hands for key point marking to verify their validity. The findings could help improve the accuracy and effectiveness of SLR models and their training using appropriate datasets and techniques.

Web Deployment

The project discussed the deployment of the SLR model on a web page to improve communication between deaf and hearing individuals and increase engagement in online platforms. The deployment process and user experience were outlined, with the objective of integrating the recognition model into the website, presenting a real-time video stream for detecting gestures, and saving the recognition results in a message box provided to the user. Flask was suggested as the back-end framework, OpenCV for video processing and output, and Socket-IO for real-time data transmission to achieve these goals. Flask was chosen because it supported Python for deployment, OpenCV provided many functions and tools for video processing, and Socket-IO allowed for real-time data transmission and custom transmission data time. These technical solutions could help developers and researchers deploy SLR models on web pages and improve communication and engagement for the deaf and hard-of-hearing community.

Figure 3: The Life-cycle for Real-Time Recognition

Results

Solution 1 evaluation and results

Model	Accuracy on Validation Set	Dataset	The number of classes	With Face key points
LSTM	76.47%	Customized dataset	3	Yes

Table 1: Result of accuracy on customized dataset

Model	Accuracy on Validation Set	Dataset	The number of classes	With Face key points	Training oscillation
LSTM	40%	WLASL	10	No	Yes
Transformer with Self-Attention	45%	WLASL	10	No	No

Table 2: Result of accuracy on WLASL-10 glosses dataset

Figure 4: The accuracy on customized dataset with LSTM

Figure 5: The accuracy on WLASL-10 with LSTM

Figure 6: The accuracy on WLASL-10 using Transformer with Self-Attention

The project discussed the possible reasons for training oscillation, including insufficient data, unbalanced distribution of data points, and ineffective preprocessing techniques. The project also compared the performance of LSTM and Transformer with Self-Attention models in processing sequence data obtained from CSLR and achieving 76%, 40%, and 45% accuracy, respectively, on different datasets. The Transformer with Self-Attention model exhibited more excellent stability during training than the LSTM model, making it a focus of future research and exploration in SLR.

Solution 2 results

Model	Avg Precision	Avg Recall	F1 Score	mAP@0.5
ResNet50	0.8868	0.8585	0.8725	NA
ResNet with MediaPipe Hands	0.9103	0.9078	0.9090	NA
YOLOv5	0.9084	0.8872	0.8977	0.963
YOLOv5 with MediaPipe Hands	0.9299	0.93573	0.9328	0.969

Table 3: Results of models

The project compared the performance of YOLOv5 and ResNet50 models with and without MediaPipe Hands for SLR tasks. YOLOv5 was found to outperform ResNet50 in average class precision, recall, and F1 score, indicating better classification performance. The project also highlighted the effectiveness of adding MediaPipe Hands for enhanced hand features and improved recognition of similar gestures. YOLOv5’s additional feature of locating hand positions made it perform better than ResNet50 for SLR problems. The practical use of the recognition system could be affected by factors such as lighting, background, and distance between the user and the camera. The YOLOv5 model with MediaPipe Hands successfully recognized the sign language input “OK, hello.”

Web Deployment

The project presented the results of real-time user testing of the YOLOv5 with key point model with 24 classification categories through a web application. Most letters were correctly classified in over 70% of cases, with the accuracy of gestures usually ranging above 80%. However, there were still some ambiguous identifications for some letters with similar hand gestures. The localization of the bounding box was significantly impacted by the testing environment, with the model performing better in real-time with complex environments due to MediaPipe Hands enhancing the hands features. Result extraction was implemented every 3 seconds to determine the most frequently occurring result as the output. The YOLOv5 with MediaPipe Hands model was successfully deployed on a web page for user interaction. The project highlighted the importance of considering the testing environment and the need for result extraction to determine the output time. These findings can help improve the accuracy and effectiveness of SLR models and their deployment on web applications for user interaction.

Conclusion

The Transformer with Self-Attention and YOLOv5 models were proposed for CSL and SSL, respectively, using MediaPipe for data processing in the project. The LSTM and Transformer models achieved 40% and 45% accuracy, respectively. ResNet50 was outperformed by YOLOv5 on SSLR, and its performance was improved by MediaPipe Hands. The YOLOv5 model with MediaPipe Hands was deployed on the website. Future research would focus on improving the accuracy of the Transformer with Self-Attention model.

Future Development

There were several challenges in SLR, including the low clarity and far shooting distance of the WLASL dataset, insufficient training set size, continuous SLR, signer-independent recognition, handling variations in signing styles and speed, recognition of different angles of hand gestures, and environmental sensitivity. Future research could focus on using a different dataset with clearer and closer videos, employing pre-processing techniques, utilizing transfer learning, applying data augmentation, developing more robust models, exploring unsupervised and semi-supervised learning techniques, adding photos of various angles of the gesture to the dataset, using image enhancement techniques, and adding individual gestures in the dataset under various lighting conditions. These solutions could help improve the accuracy and generalization of SLR models and overcome the challenges associated with SLR.