Can Kinect Be Used in Sign Language Research
Sensors (Basel). 2022 Jan; 15(1): 135–147.
Sign Language Recognition with the Kinect Sensor Based on Provisional Random Fields
Received 2014 Sep 4; Accepted 2014 Dec nineteen.
Abstract
Sign language is a visual language used past deafened people. One difficulty of sign language recognition is that sign instances of vary in both motion and shape in three-dimensional (3D) infinite. In this research, we use 3D depth data from hand motions, generated from Microsoft's Kinect sensor and employ a hierarchical conditional random field (CRF) that recognizes mitt signs from the hand motions. The proposed method uses a hierarchical CRF to notice candidate segments of signs using hand motions, and then a BoostMap embedding method to verify the hand shapes of the segmented signs. Experiments demonstrated that the proposed method could recognize signs from signed judgement information at a charge per unit of ninety.four%.
Keywords: sign language recognition, conditional random field, BoostMap embedding
1. Introduction
Sign linguistic communication is a visual language used past deaf people, which consists of two types of action: signs and finger spellings. Signs are dynamic gestures characterized by continuous hand motions and mitt configurations, while finger spellings are static postures discriminated by a combination of continuous manus configurations [i–3]. The term "gesture" means that the character is performed with paw motions, while "posture" refers to a character that tin can be described with a static hand configuration [4]. Sign language recognition has been researched using various input devices, such as color cameras, stereo cameras, data gloves, Microsoft's Kinect sensor, time of flight (TOF) cameras, etc. [5]. Although the data glove-based sign language recognition systems have achieved improve operation than other systems, data gloves are as well expensive and too uncomfortable to use, which limits their popularity.
Several approaches to sign language recognition larn information from range sensors such as TOF cameras or the Kinect, which was developed to interact with video games as a means for full-trunk tracking of trunk movements and gestures [6]. Many researchers take developed applications with gesture and sign linguistic communication recognition systems using these sensors such as interactive displays [7], physical rehabilitation [8], robot guidance [9,x], gesture recognition [xi], sign linguistic communication recognition [12,13], manus gesture recognition [14], etc.
Depth information-based sign language recognition has get more than widespread because of improved interactivity, and user comfort, and the development of consumer-priced depth sensors, such every bit Microsoft'south Kinect [5]. Depth information-based approaches are mostly more than authentic and tin can recognize a wider vocabulary than color or 2D-based approaches.
Numerous studies take attempted to utilise the Microsoft Kinect to identify hand gestures. Zafrulla et al. investigated the potential of the Kinect depth-mapping camera for sign linguistic communication recognition [12]. They collected a total of thousand American Sign Language (ASL) phrases and used a hidden Markov model (HMM) to recognize the signed phrases. Ren et al. researched a robust hand gesture recognition arrangement using a Kinect [v]. They proposed a modified Finger-World Mover's Altitude metric (FEMD) in order to distinguish noisy mitt shapes obtained from the Kinect sensor. They achieved a 93.ii% hateful accurateness on a x-gesture dataset.
Chai et al. proposed a sign language recognition and translation system based on 3D trajectory matching algorithms in order to connect the hearing impaired community with non-hearing dumb people [thirteen]. They extracted 3D trajectories of hand motions using the Kinect, and collected a total of 239 Chinese sign language words to validate the functioning of the proposed arrangement. They achieved rank-1 and rank-five recognition rates of 83.51% and 96.32%, respectively. Moreira Almeida et al. as well proposed a sign linguistic communication recognition organization using a RGB-D sensor. They extracted seven vision-based features from RGB-D information, and achieved an average recognition rate of 80% [15].
In addition to the Kinect, other methods of recognizing hand gestures have as well been explored. Shotton predicted 3D positions of body joints from a unmarried depth image without using temporal information [16]. Palacois et al. proposed a system for hand gesture recognition that combined RGB and 3D data provided by a vision and depth sensor, the Microsoft Asus Xtion Pro Live [half dozen]. This method, using a defined 10-gesture dictionary, used maximums of curvature and convexity defects to discover fingertips.
Boosted methods for hand movement recognition include a study by Lahamy and Lichti that used a range camera to recognize the ASL alphabet [four]. A heuristic and voxel-based signature was designed and a Kalman filter was used to rails the manus motions. This method proposed a rotation invariant 3D mitt posture signature. They achieved a 93.88% recognition rate after testing fourteen,732 samples of 12 postures taken from the ASL alphabet. In add-on, Yang et al. [1–3] used a threshold model with a CRF, which performed an adaptive threshold for distinguishing betwixt in-vocabulary signs and out-of-vocabulary non-signs. They proposed augmenting the CRF model past adding one additional label to overcome the weaknesses of the stock-still threshold method.
In this paper, we focus on recognizing signs in a signed sentence using 3D information. The difficulty of sign language recognition comes from the fact that sign occurrences vary in terms of hand move, shape, and location. The following three problems are considered in this inquiry: (ane) signs and non-sign patterns are interspersed within a continuous mitt-motion stream; (ii) some signs shares patterns; and (iii) each sign begins and ends with a specific hand shape.
In club to solve the first and 2d bug, a hierarchical CRF (H-CRF) is applied [2]. The H-CRF can discriminate between signs and non-sign patterns using both hand motions and manus locations. The locations of the face and both hands are needed to excerpt features for sign linguistic communication recognition. The subject's 3D upper-trunk skeletal construction tin can be inferred in real-time using the Kinect. Data about torso components in 3D allows us to locate diverse structural characteristic points on the face and hands. The H-CRF can recognize the shared patterns amidst the signs. An fault in the centre of a sign implies that the sign has been dislocated with another sign because of the shared patterns, or an improper temporal boundary has been detected.
In social club to solve the third problem, BoostMap embeddings are used to recognize the mitt shapes. The BoostMap embeddings are robust to various scales, rotations, and sizes of the signer's hand, which makes this method ideal for this application. The main goal of this hand shape verification method is to determine whether or non to take a sign spotted past ways of the H-CRF. This helps to disambiguate signs that may accept like overall hand motions just different paw shapes.
Effigy 1 shows the framework of our sign language recognition system. We use the Kinect, which acquires both a color image and its corresponding depth map. The hand and face locations are robustly detected in varying lighting atmospheric condition. Afterward detecting the locations of the confront and hands, an H-CRF is used to detect candidate sign segments using hand motions and locations. Then, the BoostMap embedding method is used to verify the hand shapes of the segmented signs.
two. Sign Language Recognition
two.1. Face and Hand Detection
The face and hand positions are robustly detected using the hand tracking function in the Kinect Windows software development kit. The skeletal model consists of 10 feature points that are approximated from the upper body as shown in Figure 2.
The hand region is obtained by establishing a threshold from the hand position as shown in Figure 3a. The signer wears a blackness wristband to segment the hand shape [5]. RANdom SAmple Consensus (RANSAC) [17] is used to detect the black wristband, as shown in Figure 3c. The detected manus shape is normalized.
2.2. Characteristic Extraction
Half dozen and one features are extracted in 3D and second infinite, respectively, using the detected hand and face regions equally shown in Table 1 [1–3].
Table 1.
Features | Meanings |
---|---|
HFL | Position of the left hand with respect to the signer'south confront |
HFR | Position of the right manus with respect to the signer's face |
HHL | Position of the left mitt with respect to the previous left hand |
HHR | Position of the correct hand with respect to the previous correct hand |
FSL | Position of the left hand with respect to the shoulder center |
FSR | Position of the correct manus with respect to the shoulder middle |
OHLR | Occlusion of 2 hands |
The feature, HFFifty , represents the location of the left manus with respect to the signer's face in 3D space. The distance between the face and left hand, DHFL , and the angle from the face to the left hand, θHFL , is measured. In order to extract 3D features, the coordinates of the left hand are projected into the x, y and z axes. The angle between the face and left hand, θHFL = {θten , θy , θz }, is extracted. Then, the feature vector {DHFL , θHFL } is clustered into an alphabetize using an expectation-maximization (EM)-based Gaussian Mixture Model (GMM) [1]. Features, HHL , HFR , HHR , FS50 and FSR , are as well calculated and clustered.
The manus occlusion, OHLR , is adamant from the ratio of the overlapping regions of the two easily in the frontal view:
(1)
where Hl is the left hand region, Hr is the right hand region, Ro is the overlapping region between the two hands, and To is the threshold for hand occlusion (To = 0.3, as determined past experimentation).
2.3. CRF-Based Sign Language Recognition
A hierarchical CRF framework is used to recognize the sign language [2]. In the beginning step, a threshold model (T-CRF) is used to distinguish between signs and non-sign patterns [1]. In this step, non-sign patterns are defined by the characterization "Northward-S" and signs are defined past the labels in the vocabulary. When constructing the T-CRF, a conventional CRF is constructed first. The conventional CRF includes the labels SC = {Y 1, ⋯, Y50 }, where Y 1 through Yl are labels for signs, and 50 is the number of signs in the vocabulary [i].
In a CRF, the probability of a label sequence y, given an observation sequence x, is found using a normalized production of potential functions. Each product of potential functions is represented by [1]:
(2)
where Fθ (yi −i, yi , ten, i) = Σ five λv tv (yi −1, yi , 10, i) + Σ thousand μm southwardm (yi , x, i), tv (yi −ane, yi , 10, i) is a transition feature role of the ascertainment sequence x at positions i and i − 1, where southone thousand (yi , 10, i) is a state feature part of observation sequence 10 at position i, y i − one and yi are the labels of ascertainment sequence x at positions i and i − 1, and λv and μm are the weights of both the transition and state feature functions, respectively. θ represents the weights of the transition features and land characteristic functions, and Zθ (x) is the normalization cistron.
The feature vector x t , of the observation sequence x, at fourth dimension t, is expressed every bit:
(3)
CRF parameter learning is based on the principle of maximum entropy. Maximum likelihood grooming selects parameters that maximize the log-likelihood of the training data [1]. The T-CRF is built using weights from the synthetic conventional CRF. In add-on, the label "North-Southward" for not-sign patterns is added to the conventional CRF. Thus, the T-CRF includes the labels SouthwardT = {Y one, ⋯, Yl , N-Southward}. The starting and ending points of in-vocabulary signs were calculated by back-tracking the Viterbi path, subsequent to a forward pass [1].
The weights of the transition feature functions from other labels to the non-sign pattern label "N-S" and vice versa are assigned by:
(4)
where , and κ is the weight of the self-transition feature function of the not-sign pattern label "N-S" [one].
After amalgam the T-CRF, i.due east., the first layer of the hierarchical CRF, the second layer CRF, which models mutual sign actions, is constructed. The output of the first layer is the input of the second layer. It contains the segmented signs, which signs have a higher probability than the non-sign pattern label "Northward-S". Equally a result, the second layer CRF just has labels SC = {Y 1, ⋯, Y50 }. The detailed algorithm is described in [1].
Finally, the probability of the recognized sign is calculated as:
(5)
where pθ (yi t |x) is the marginal probability of the sign yi at fourth dimension t; Sdue south and Due southeast are the start and end frames of the segmented sign, respectively.
2.iv. Shape-Based Sign Linguistic communication Verification
The hierarchical CRF is useful for recognizing paw motions; however, information technology has difficulty distinguishing between unlike hand shapes. The main goal of the paw shape-based sign verification is to make up one's mind whether or not a sign spotted through the H-CRF should be accustomed every bit a sign. The shape-based sign verification module is performed at the end frame of a recognized sign, when P (yi t ) in Equation (5) is lower than a threshold.
BoostMap embeddings are practical in order to recognize the hand shape. This method accommodates various scales, rotations, and sizes of the signer's easily [2,18]. Synthetic hand images to train the model are generated using the Poser seven animation software. Each sign begins and ends with a specific hand shape, and each alphabet has unique hand shapes. Table 2 and Figure four show examples of hand shapes for sign language recognition. Our system uses a database with 17 hand shapes. For each manus shape, 864 images are generated.
Table 2.
Signs | Dominant Hand Shape | Not-Dominant Hand Shape |
---|---|---|
Car (T) | Due south | S |
Past (O) | Open B > Aptitude B | D.C. |
Out (O) | Flat C > Flat O | D.C. |
The hand shapes are verified over several frames, and a detected sign is accepted when the voting value Fives (yi t ) exceeds threshold Tsouth . The voting value, 5s (yi t ) is calculated as:
(6)
where yi t is the sign detected past the H-CRF at position t, and ta is the window size. Ca (yi t , B (j)) is:
(7)
where B (j) is the recognition effect of the BoostMap embedding method at time j.
three. Experimental Results and Analysis
3.ane. Experimental Environment
For training the CRFs and H-CRFs, 10 sequences for each sign in the 24-sign lexicon were collected. The signer wore a blackness wristband during data collection. The showtime and terminate points of the ASL signs were added manually to the training data and for the basis truth, they were used for testing the performance of the proposed method. We captured the video with a Kinect device. Of the 24 signs, vii were one-handed signs, and 17 were two-handed signs, equally shown in Tabular array iii. Figure five shows two examples of signs used in the experiments. In general, well-nigh sign language recognition tasks face up iii types of errors—substitution errors, insertion errors, and deletion errors.
Table 3.
1-handed signs | And, Know, Homo, Out, Past, Tell, Yesterday |
Two-handed signs | Arrive, Large, Born, Car, Decide, Unlike, Cease, Here, Many, Perchance, Now, Rain, Read, Take-off, Together, What, Wow |
An insertion error occurs when the lookout man reports a nonexistent sign. A deletion error occurs when the spotter fails to spot a sign in an input sequence. A substitution fault occurs when an input sign is incorrectly classified [i–3]. The sign error rate (SER) and correct recognition rate (R) are calculated by:
(8)
where North, S, I, D, and C are the numbers of signs, substitutions, insertions, and deletion errors, and correctly detected signs, respectively. An H-CRF was implemented and the results of the sign language recognition were compared to the performance accuracy in both 2d and 3D feature space.
3.ii. Sign Language Recognition with Continuous Data
As shown in Table 4, 3D features decrease insertion and exchange errors, while slightly decreasing deletion errors, compared to the model with 2D features. As a result, the SER of the H-CRF3D decreases; nevertheless, the correct recognition rates of the H-CRF3D increases.
Table four.
C | S | I | D | SER(%) | R(%) | |
---|---|---|---|---|---|---|
CRF2D | 185 | 34 | 25 | 21 | 33.3 | 77.0 |
T-CRF2D [1] | 197 | 27 | 24 | 16 | 27.9 | 82.0 |
H-CRF2nd [two] | 202 | 23 | 15 | xv | 22.0 | 84.1 |
H-CRF3D | 217 | 12 | 9 | 11 | xiii.three | 90.4 |
Figures vi and 7 show sign recognition results for a sign sequence that contains ii in-vocabulary signs "OUT" and "By" with H-CRF2D and H-CRF3D, respectively. The time evolutions of the probabilities for in-vocabulary signs and not-sign patterns are illustrated by curves. The probabilities of the signs "OUT" and "PAST" fluctuate, while the sign is performed, as shown in Figure half dozen, considering of the similar hand motions of these ii signs in 2D space. On the other hand, as shown in Effigy 7, the characterization for not-sign patterns has the greatest probability during the start 63 frames. Then, it is followed by the sign "OUT." After 63 frames, the probability of the sign "OUT" virtually becomes 0.1, and there is a not-sign pattern.
Figure viii shows the sign recognition results with H-CRF iii D . Hand shape recognition is executed over several frames when the probability of the recognized sign is lower than the threshold, as discussed in Section 3. Every bit shown in the time evolutions of probabilities, the probabilities of the sign "Unlike" and "End" are similar to each other in frames 117 and 129. The probabilities, P (yi t ), of the signs "Dissimilar" and "Finish" are over the threshold in frame 132.
Figure 9 shows the mitt shape verification results with the BoostMap embeddings in the sign segment of Figure viii. The frame-wise fingerspelling inference results are presented. The paw appearances of all signs over the threshold are verified as described in Equation (6). Then the sign that has the maximum Vdue south () is selected, using:
where C is the set of signs, in which probability P(yi t ) is over the threshold.
The BoostMap embedding method decreases the insertion and substitution errors past verifying the hand shape; yet, information technology reduces the correct detection rate because of its own classification errors.
iv. Conclusions and Further Research
Sign language recognition with depth sensors is becoming more widespread. However, it is difficult to detect meaningful signs from a contiguous hand-motion stream considering the signs vary in both motion and shape in 3D infinite. In our work, we recognized meaningful sign linguistic communication from a contiguous manus-motion stream using a hierarchical CRF framework. The first layer, a T-CRF, is applied to distinguish signs and non-sign patterns. The 2d layer, a conventional CRF, is applied to distinguish betwixt the shared patterns among the signs.
In this paper, a novel method for recognizing sign language hand gestures was proposed. In order to detect 3D locations of the hands and confront, depth information generated with Microsoft's Kinect was used. A hierarchical threshold CRF is also used in society to recognize meaningful sign language gestures using continuous mitt motions. Then, the segmented sign was verified with the BoostMap embedding method. Experiments demonstrated that the proposed method recognized signs from signed sentence data at a charge per unit of 90.4%. Near-term future work includes improving the detection accuracy of the upper trunk components.
Acknowledgments
This is work was supported past the National Research Foundation of Korea Grant funded by the Korean Government (NRF-2011-013-D00097). The Author thank you Professor Dimitri Van De Ville for his help and cooperation in this inquiry.
Conflicts of Interest
The authors declare no conflict of interest.
References
1. Yang H.-D., Sclaroff Due south., Lee S.-Westward. Sign language spotting with a threshold model based on conditional random fields. IEEE Trans. Design Anal. Mach. Intell. 2009;31:1264–1277. [PubMed] [Google Scholar]
2. Yang H.-D., Lee Due south.-W. Simultaneous spotting of signs and fingerspellings based on hierarchical conditional random fields and boostmap embeddings. Pattern Recognit. 2010;43:2858–2870. [Google Scholar]
iii. Yang H.-D., Lee Due south.-W. Robust sign linguistic communication recognition by combining transmission and non-manual features based on conditional random field and support vector car. Design Recognit. Lett. 2013;34:2051–2056. [Google Scholar]
4. Lahamy H., Lichti D.D. Towards real-fourth dimension and rotation-invariant American Sign Language alphabet recognition using a range camera. Sensors. 2012;12:14416–14441. [PMC free article] [PubMed] [Google Scholar]
5. Ren Z., Yuan J., Meng J., Zhang Z. Robust part-based hand gesture recognition using kinect sensor. IEEE Trans. Multimed. 2013;15:1110–1120. [Google Scholar]
6. Palacios J., Sagüés C., Montijano E., Llorente S. Human-computer interaction based on hand gestures using RGB-D sensors. Sensors. 2013;13:11842–11860. [PMC free commodity] [PubMed] [Google Scholar]
vii. Zhang Southward., He W., Yu Q., Zheng X. Depression-Toll Interactive Whiteboard Using the Kinect. Proceedings of the International Conference on Image Analysis and Signal Processing; Huangzhou, China. 9–11 November 2012; pp. 1–5. [Google Scholar]
8. Chang Y.J., Chen S.F., Huang J.D. A Kinect-based system for concrete rehabilitation: A airplane pilot report for young adults with motor disabilities. Res. Dev. Disabil. 2011;32:2566–2570. [PubMed] [Google Scholar]
nine. Ramey A., Gonzalez-Pacheco 5., Salichs M.A. Integration of a Depression-Cost RGB-D Sensor in a Social Robot for Gesture Recognition. Proceedings of the 6th ACM/IEEE International Conference on Human being-Robot Interaction; Lausanne, Switzerland. half dozen–9 March 2011; pp. 229–230. [Google Scholar]
ten. Van den Bergh Thou., Carton D., De Nijs R., Mitsou Northward., Landsiedel C., Kuehnlenz K., Wollherr D., van Gool L., Buss M. Real-time 3D paw gesture interaction with a robot for understanding directions from humans. Proceedings of the IEEE RO-Human being; Atlanta, GA, Usa. 31 July–three August 2011; pp. 357–362. [Google Scholar]
eleven. Xu D., Chen Y.L., Lin C., Kong X., Wu X. Existent-Time Dynamic Gesture Recognition System Based on Depth Perception for Robot Navigation. Proceedings of the IEEE International Conference on Robotics and Biomimetics; Guangzhou, Communist china. 11–14 December 2012; pp. 689–694. [Google Scholar]
12. Zafrulla Z., Brashear H., Starner T., Hamilton H., Presti P. American sign linguistic communication recognition with the kinect. Proceedings of the International Briefing on Multimodal Interfaces; Alicante, Spain. xiv–18 November 2011; pp. 279–286. [Google Scholar]
thirteen. Chai X., Li G., Lin Y., Xu Z., Tang Y., Chen X., Zhou Chiliad. Sign Linguistic communication Recognition and Translation with Kinect. Proceedings of IEEE International Conference on Automatic Face and Gesture Recognition; Shanghai, Cathay. 22–26 April 2013. [Google Scholar]
14. Cheng H., Dai Z., Liu Z. Image-to-class dynamic time warping for 3D manus gesture recognition. Proceedings of IEEE Conference on Multimedia and Expo; San Jose, CA, Usa. July 15–19 2013; pp. 1–16. [Google Scholar]
15. Moreira Almeida Southward., Guimarães F., Arturo Ramírez J. Extraction in Brazilian Sign Language Recognition based on phonological structure and using RGB-D sensors. Good Syst. Appl. 2014;41:7259–7271. [Google Scholar]
xvi. Shotton J., Fitzgibbon A., Cook M., Sharp T., Finocchio M., Moore R., Kipman A., Blake A. Existent-time Human Pose Recognition in Parts from Single Depth Images. Proceedings of IEEE Briefing on CVPR, Colorado Springs; CO, USA. 20–25 June 2011; pp. 1297–1304. [Google Scholar]
17. Fischler Thousand.A., Bolles R.C. Random sample consensus: A paradigm for model fitting with applications to image assay and automated cartography. Comm. ACM. 1981;24:381–395. [Google Scholar]
eighteen. Athitsos V., Alon J., Sclaroff S., Kollios One thousand. Boostmap: An embedding method for efficient nearest neighbor retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 2008;30:89–104. [PubMed] [Google Scholar]
Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4327011/#:~:text=researched%20a%20robust%20hand%20gesture,using%20a%20Kinect%20%5B5%5D.&text=They%20extracted%203D%20trajectories%20of,performance%20of%20the%20proposed%20system.
0 Response to "Can Kinect Be Used in Sign Language Research"
Post a Comment