This document report provides the desired layout to develop an online application service that accepts Human Skeletal key points of a sign video and returns the label of the sign in a JSON response. The document contains information about the extraction of key points from the videos using Tensor Flow’s Pose Net library and four different deep learning models that can classify American Sign Languages into six different signs. i.e {buy, fun, hope, really, communicate, mother}. Moreover, it also contains information about hosting services using flask API on ‘PythonAnywhere’ and steps involving handling HTTP requests coming from different users.
Firstly, we have accumulated all the raw video data sets which have been recorded as a part of Assignment-1 and extracted frames of the particular timeline. Then, we used Tensor Flow’s Pose Net library in order to extract key points from the images, which are considerably used as training data for models. We have tried three different approaches to preprocess data and picked the one which gives the best accuracy for the trained models.
Approach-1: Scaled down raw data using the Universal Normalization technique and extracted a few features like- Standard Deviation, Moving Mean of Window size 5, Zero Crossing Rate, Dynamic Time Warping distance, and built feature matrix. Then we applied PCA on the feature matrix and using K-fold Cross-validation we trained four deep learning models named Convolutional Neural Network, K nearest neighbor, Support Vector Machine, and Random Forest. The average accuracy of the given models lay between 60-65%.
Approach- 2: As a part of the second approach, we expelled some features by observing the movement of each body part in videos for different signs and made a feature matrix of only important features. Then we apply Standard Scaler and Min Max Scaler in order to normalize data and trained our models using the first approach.
Somehow, we were able to increase the average accuracy of the models by 10%.
Approach -3: We have observed in the second approach that, our model is only considering the static coordinates of each body part, so we subtracted each coordinate of different body parts from the static body parts and processed the data in the same manner. So, by doing this approach we got our highest average accuracy which lies between 85% to 90%.
- INITIAL FEATURE EXTRACTION
- Zero Crossing Rate
- Moving Average Window
- Dynamic Time Warping Distance.
Zero Crossing Rate: The zero crossing rate is the rate of sign- changes along with a signal, i.e the rate at which the signal changes from positive to zero to negative or vice versa. Zero Crossing Rate can be used as a primitive pitch detection algorithm for signal processing.
Moving Average Window: Moving Average is optimal for reducing random noise while retaining a sharp step response. This makes it the premier filter for the time domain encoded signals
Standard Deviation: The standard deviation is a measure of how far the signal fluctuates from the mean. It also depicts how data disperse near the mean of particular data series.
Dynamic Time Warping Distance: DTW measures the similarity between two temporal series data. Any linear sequence data can be analyzed with DTW, it aims at aligning two sequences of feature vectors by warping the time axis iteratively until an optimal match between the two sequences is found.
Feature Engineering:
We have expelled a few features by observing the movement of each body part for different signs and made a feature matrix with only important features. Below is the list of features that we considered for training models.
[“nose_x”, “nose_y”, “leftShoulder_x”, “leftShoulder_y”, “rightShoulder_x”, “rightShoulder_y”,”leftElbow_x”, “leftElbow_y”, “rightElbow_x”, “rightElbow_y”, “leftWrist_x”, “leftWrist_y”, “rightWrist_x”, “rightWrist_y”]
Here, we observed that the coordinates value of each body part shows a static position for a given time, so we have subtracted each body part’s coordinates value from the corresponding static body part’s coordinates. Here, we have considered “nose” as a static body part and subtracted each body part with corresponding X and Y coordinates.
The above-mentioned approach would become simpler for the models to understand the movement of each body part, as we have a relative position for each body part, the model can easily predict certain gestures by examining the positive or negative sides of coordinates.
- K nearest neighbor
- Convolution Neural Network
- Random Forest
K Nearest Neighbor: The K Nearest Neighbor classifier is one of the most simple machine learning algorithms that simply relies on the distance feature vectors. It classifies unknown data by finding the most common classes among k nearest examples. The majority vote of the class label is assigned to unknown data. As KNN is a lazy learning algorithm, it works more efficiently when our dataset has been distributed in multi classes.
Support Vector Machine: The core idea of Support Vector Machine is to find a hyperplane that separates two sets of objects having different classes. It uses a technique called kernel trick to transform data and based on this transformation it finds an optimal boundary. It is considered one of the most robust and accurate algorithms among other classifiers.
Random Forest: Random Forest is an ensemble classifier; it takes multiple individual models and combined them into a more powerful aggregate model. So, let’s say we have different individual models, then there might be the case that they work efficiently because some part of the data set would be overfitted to the model. So by combining them, we can reduce the chances of error. Random forest built upon by aggregating n possible decision trees which might be generated by randomly picking data set row as root. So, as the dataset would be increasing, the possibilities of generating random decision trees would also increase and aggregating different decision tree models lead to an increase in the efficiency of the aggregated model, too.
REFERENCES: