Facial Expression Recognition – A Comprehensive Review

In this paper, we have provided a comprehensive review of modern facial expression recognition system. The history of the technology as well as the current status in terms of accomplishments and challenges has been emphasized. First, we highlighted some modern applications of the technology. The best methods of face detection, an essential component of automatic facial expression system, are also discussed. Facial Action Coding Systems-the cumulative database of research and development of micro expressions within the behavioral science are also enlightened. Then various facial expression databases and the types of recognitions are explained in detail. Finally, we provided the procedures of facial expression recognition from feature extraction to classifications, emphasizing on modern and best approaches. Then the challenges encountered when comparing results with others are highlighted and suggestions to alleviate the problems, provided.


Introduction
The studies of facial expressions originally emerged as physiognomy, which is the general assessment of a person's character or personality from his outer appearance, especially, from the face (Highfield et al., 2009). In a general sense the term refers to features of the face, when these features are used to infer the relatively enduring character or temperament of an individual. Most of these facial features have their basis in the bony structure of the skull, on which the soft tissues lie. These features include the shapes and positions of the major areas and landmarks of the face, such as the forehead, eyebrows, nose, cheeks, and mouth.
In China and other Asian cultures, formal systems of face reading techniques developed in the first millennium B.C.E., and it was integrated with religious beliefs such as Confucianism (Highfield et al., 2009). Substantial confidence in such methods developed in these cultures, and physiognomic inferences included descriptions of character, suitability for certain positions, and predictions about life and death. In Western cultures, the association of facial features with a person's characteristic was first noted in the writings of the ancient Greeks in 4 BC. Much later, several pseudo-scientific and cultish movements exploited the inference of character from physiognomic features. As time expired, physiognomy studies slowly and imperceptibly vanished and replaced by facial expression recognition.
The study of facial expressions had its root in the 17th century, when Le Brun officially gave a series of presentations on the subject in 1667. In his posthumously published treatise, "Méthode pour Manuscript apprendre à dessiner les passions" in 1698, he promoted the expression of the emotions in painting; which later had great influence on art theory for two centuries. Much of Le Brun's idea is found in "The Perfect Imitation of Genuine Facial Expression" (Montagu, 1994). However, the modern day study of facial expression recognition started with Charles Darwin in the nineteenth century. Charles Darwin established the implications of facial expressions in humans and animals and introduced the deformation patterns to facial expression recognition (Darwin, 1998). "Among his many extraordinary contributions Darwin gathered evidence that some emotions have a universal facial expression. He cited examples and published pictures suggesting that emotions are evident in other animals, and proposed principles explaining why particular expressions occur for particular emotions -general principles he applied to all animals" (Ekman & Darwin, 2003). In the 20th century, Ekman & Friesen (1978) developed the Facial Action Coding System (FACS) which is the most widely used and versatile method for measuring and describing facial behaviors.
Today, facial expression studies have attracted researchers from human computer interaction (HCI), computer vision and pattern recognitions. The technology is applied in a wide variety of contexts, including robotics, digital signs, mobile applications and medicine. It is reported that "some robots can operate by first recognizing expressions" (Bruce, 1993). In behavioral sciences and medicine for instance, expression recognition is effectively applied for intensive care monitoring Morik et al., 1999). Currently there are developing systems that are capable of making routine examinations of facial behavior during pain in clinical backgrounds. The observers have been trained to perform real-time measurement of pain expression during physical examinations of low-back pain patients. In infants the Neonatal Facial Coding System (NFCS) has been employed for real-time assessment within 32 to 33 week post-conceptional age infants who are undergoing a heel lance. The technology is being used in more advanced settings to reduce accidents through the implementation of automated detection of driver drowsiness in public transports. These systems relay intermittent information of the drivers" facial expressions or emotional states to observers for effective surveillance and necessary awareness. Another important application is the invention of wearable appliance-based glasses which sense facial muscle movements and recognizes significant expressions such as confusion or interest (Scheirer et al., 1999). The wearable glass uses piezoelectric sensors hidden in a visor extension to a pair of glasses that provides compact, user control, and anonymity. Fig. 1 shows a picture of the expression glass. The rest of the work is organized as follows: Section 2 discusses the universality of facial expressions; section 3 talks about face detection and pose estimation; section 4 discusses facial action units and facial action coding system; section 5 gives a note of facial expression types; section 6 provides types of expression databases; section 7 presents feature extraction; section 8 talks about feature selection and classification; section 9 discusses results of expression recognition systems; finally section 10 concludes the work.

The universality of facial expression recognition
The arguments of universal facial expressions across cultures can be traced as far back as about 150 years ago. In an attempt to prove that expressions are universal Darwin (1872) conducted a research across English citizens living in Africa, America, Australia, Borneo, China, India, Malaysia and New Zealand and concluded that facial expressions are universal. However lack of substantive evidences and intercultural mixture among the respondents made the results inconclusive by many eminent social psychologists such as Klineberg (1940).
The claims of Darwin were later resurrected by Tomkins (1962Tomkins ( , 1963 in about a century later. Tomkins suggested that emotion was the basis of human motivation and that the seat of emotion was in the face. He conducted the first study to demonstrate that facial expressions were reliably associated with certain emotional states (Tomkins & McCarter, 1964). Then, Ekman et al., (1969) Matsumoto et al., (2008) have demonstrated that when emotions are aroused, the same facial expressions of emotion are reliably produced by people all around the world and from all walks of life. Fig. 2 shows a model of the six universal facial expressions.

Face detection and pose estimation
Face detection is a very important component in facial expression recognition systems. Faces in images must be detected before they are further processed for expression recognition. Face detection methods are grouped into feature-based and classification-based. The basic models of feature-based techniques largely lie on searching algorithms. The algorithms must effectively locate potential facial features such as eyes, nose and the mouth and classify them into faces according to their geometric relationships. As the performance of feature-based methods largely depends on the consistent location of facial features, its vulnerability to partial occlusion, unwarranted deformations, and inferior image quality is also devastating. The classification methods, however, utilize search windows over the input image and each local image in the window is classified as a face or non-face by a classifier. Classification methods have received considerable interest by many researchers because they are more effective, hence techniques such as support vector machine (SVM) (Heisele et al., 2003), neural networks (Bouzalmat et al., 2011;Khatun et al., 2011), local binary pattern (Lili et al., 2012) and Bayesian techniques (Liu, 2003) are utilized extensively.

Facial Action Units (FAUs) and Facial Action Coding System (FACS)
The human emotion is composed of thousands of expressions, although most of them differ in subtle changes of a few facial features. The spontaneous deformations of the facial muscles combine in groups to give a meaning to a particular facial expression. This involuntary or voluntary deformation of the facial muscles is referred in this sense as action units (AU). An AU represents the muscular activities that produce momentary changes in facial appearances. Stated another way, an action unit is a numeric code to describe the movements of facial muscles. The mapping between AUs and facial muscles is not necessarily 1:1; some AUs are composed of more than one muscle, and other AUs describe separate movements of the same muscle. Thus the term action unit is used because there is separation of more than one action from what most anatomists described as one muscle. The consistent exploration into action units (AUs) gave rise to Facial Action Units (FAU), which is a developed system to categorize the human facial expressions. The FAU system was first developed by Hjortsjo (1969) and later expounded by Ekman and Friesen (1978). Again years of continuing studies of AUs or FAUs resulted in the development of a database of AUs or FAUs known as Facial Action Coding System (FACS). The primary goal in developing the FACS was to develop a comprehensive system which could distinguish all possible visually distinguishable facial movements. Scientifically, FACS is the cumulative database of research and development of micro-expressions within behavioral science. The last updated version of the system was in 2002 by Hagar, Ekman and Friesen (2002).
The FACS has become the most common manual for facial behavior analysis. Ekman and Friesen (1978) have defined 46 AUs to explain each autonomous face movement. Although they only defined a small number of idiosyncratic AUs, over 7000 dissimilar AU combinations have been observed so far (Ekman, 1982). FACS assesses all observable facial muscle movements, such as the mouth and eye movements, and not just those supposed to be relayed to emotion or any other condition. Recent studies conducted by Vick et al., (2007) show that FACS can be adapted to compare facial repertoires across similar species, such as humans and chimpanzees. According to the studies FACS can be modified by taking differences in underlying morphology into account. "Such considerations enable a comparison of the FACS present in humans and chimpanzees, to show that the facial expressions of both species result from extremely notable appearance changes". A crossspecies analysis of facial expressions can help to answer the question of which emotions are uniquely human (Vick et al., 2007). The revised FACS coding manual affords a detailed description of all the appearance changes occurring with a given action unit.

Facial expression recognition types
A typical facial expression recognition system recognizes the expressions of the face, irrespective of the facial orientation. The earliest researches were focused on 2D orientations only and nearly all the accessible data collections of expressive faces were of restricted range full of only premeditated posed sentimental displays. These were mainly of the six universal exemplary expressions consisting of anger, fear, disgust, surprise, sadness and happiness, and recorded under exceedingly constrained environment in terms of illumination and viewed angle. The problems with this single-viewed 2D analysis was that they could not totally exploit the information that is paraded by the face since 2D static image or 2D video recordings have problems in capturing the out-ofplane transformations of the facial surface. These problems should not surprise anyone because the human face is 3D than 2D. Thus a purely 2D projection of the face is sensitive to changes in illumination and pose angle (Pantic & Rothkrantz, 2000). In the presence of these setbacks, the conditions that will make the 2D face data achieve a good performance in expression recognition are the abilities to normalize illumination and correct the head pose, which indeed is difficult to achieve . Interestingly, the mainstream of the current upto-date facial expression recognition systems is footed on 2D facial images and videos, which only offer some appreciable accomplishment only for the data captured under the restricted environments already stated above. As a result, in recent times, there is a paradigm shift towards the application of 3D facial data since it is fundamentally invulnerable to changes in pose and illumination (Yin et al., 2006) and therefore generates better recognition accuracy. Again, since the facial data is the implicit supplier of information for the facial expression recognition assignment, it follows that the processing intricacy and the overall success of the built system soundly depends on the techniques which are employed to capture the data.
The current progress in structured light scanning, stereo photogrammetry and photometric stereo have made the acquisition of 3D facial structure and motion a viable task . Currently we have a variety of gadgets and methods that are employed to capture 3D facial expression data. Samples include the use of single image reconstruction, structured light technologies, and stereo reconstruction algorithms. We have 3D imaging systems or scanners that can scan the expressive face and generate a geometric point cloud that corresponds to samples taken from the observed 3D facial data. Despite the fact that these scanners provide accurate surface measurements, they also require excessively long acquisition time. Non-contact scanners are much more suitable for 3D facial data acquisition. Again, notwithstanding these recent advancements, it must be stressed still that, the 3D face acquisition does not unravel all the problems; for example, it does not aid to alleviate the concerns associated with occlusions, where typical examples of facial occlusions include subjects wearing scuffs, make-ups, glasses, or having long hair that covers portions of the face. Thus occlusion is still a problem that must be unraveled to make facial expression systems more attractive and dependable. For problems caused by pose variations, some researchers proposed that the use of multiple views of the face (Pantic & Rothkrantz, 2004) and deformable 3D models fitted on 2D images (Wen & Huang, 2003) or 3D images are tenable solution. No doubt, the success of the 3D face models may possibly advance viewindependent facial expression recognition, which is very vital for spontaneous facial expression recognition, because the subject can be recorded in less controlled real-world scenery. Current attempts have also been directed to the recognition of multifaceted and impulsive emotional phenomena such as stress, frustration, depression and boredom rather than on the recognition of premeditated or the exhibited prototypical expressions of emotions (Gunes & Pantic, 2010;Zeng et al., 2009;Nicolaou et al., 2011;Vinciarelli et al., 2012).

Facial expression databases
One of the central contributors to the success of facial expression recognition researches is the development of 2D and recently 3D and 4D databases. The importance of the database is for the successful building of the system by aid of training, testing for the reliability and dependency of the system and for analysis and evaluation by means of making meaningful comparison with other systems. The choice of a particular database is based on the type of facial data used. The facial expression databases are cataloged into 4 main types -2D static, 2D video, 3D static and 3D dynamic (also known as 4D). As facial expression recognition studies evolved from 2D datasets, the majority of present databases are found in this category. 3D and recently 4D datasets are not numerous. One would expect that, facial expression recognition; being such imperative research with immense applications, by now, would have had a defacto standard dataset that would serve as a benchmark for measuring the success of all developed work; but till today, making such realization is just a dream. Failure to do so has left the comparisons of results a very difficult and challenging task. This problem is discussed in detail in section 9 of this work. Let us now move forward to discuss some of the influential databases that are publicly available for research. The descriptions are summarized in Table 1. The images of some of them are also displayed accordingly (See Fig. 6)  (Gross et al., 2010;Lyons et al, 1999).
213 images of 10 subjects The database has 7 facial expressions (6 basic facial expressions and 1 neutral) posed by Japanese female models. The database was assembled at the Psychology Department in Kyushu University (Lyons et al., 1999 (Pilz et al., 2006). The database contains video sequences of four different expressions: anger, disgust, surprise and gratefulness. Each expression was record from five different views concurrently. All facial expressions are available in three repetitions, in two intensities, as well as from three different camera angles (Kaulard et al., 2012).

BU-3DFE
3D static 100 subjects The database was developed at the Binghamton University for the purpose of 3D facial expression analysis . It contains 100 subjects, with ages within 18 to 70 years. It has a diversity of ethnic origins-Whites, Blacks, East-Asians, Middle-East Asians, Indians and Hispanics. Each subject displayed seven expressions, that consist of neutral and six prototypical facial expressions at four intensity levels. There are 25 3D facial scans containing different expressions for each subject. The total of the facial scans are 2,500. Each 3D facial scan in the database contains 13,000 to 21,000 polygons with 8,711 to 9,325 vertices .

images of 101 subjects
This database is an extension of BU-3DFE. It contains sequences of images captured at 25 frames per second (fps) of the six prototypical facial expressions with their temporal segments (onset, apex and offset) with each sequence lasting about 4s. Matuszewski et al., 2011). The database provides no AU annotation.

ADSIP 4D (3D dynamic)
First edition contains 210 images of 10 subjects. Final edition will have 100 more added subjects The first version (ADSIPmark1) was released in 2008. The participants were graduates from the School of Performing Arts in University of Central Lancashire (Frowd, 2009). Each subject displayed seven expressions: anger, disgust, happiness, fear, sadness, surprise and pain, at three intensity levels (mild, normal and extreme). Each sequence was captured at 24 fps and lasted within 3s. The final objective of the ADSIP database is to contain 3D dynamic facial data of over 100 control subjects and another 100 subjects with different facial dysfunctions. (Quan et al., 2010a).

Feature Extraction
Extraction of the expressive facial feature is an important step after the face is detected. Though there are many options, the choice of the method depends on the face representation and the kind of the input image -whether static or dynamic. The facial representation features are classified into geometric and appearance. The geometric features embody the permanent features which in this regard represent the shape as well as the locations of facial parts; the most significant ones are the eyes, eyebrows, nose and the mouth. The muscles round the eyes and the mouth are the major indicators for action unit recognition (Cohn & Zlochower,1995) The essential task involves accurate extraction of these features to represent the face geometry. The appearance features are rather the transient features of the face which constitutes the skin texture and in effect, provides critical information for the recognition of certain AUs. Such features include wrinkles, bulges and furrows, which are extracted through the application of image filters to either the entire face or some explicit sections. Based on these two facial feature representation, in broad terms, there are two main feature extraction methods; holistic approaches where the face is processed as one piece, and local approaches, where attention is given to only a set of specific facial features or sections according to the targeted AU. Holistic feature extraction is the most extensively used technique. For this approach, each pixel of an image is considered valuable information. Many methods have been applied successfully with this approach. For instance, [Lanitis et al., 1997) presented an Active Appearance Model that was powered by principal component analysis (PCA) to locate facial features and recover 3D pose as well. Bartlett et al., (1997) convolve the entire face images by a set of Gabor wavelet kernels; the resulting Gabor wavelet magnitudes response was used as the input to a recognition engine. Koelstra et al., (2010) proposed an appearance dynamic model that detects AUs and the time phases; onset, apex and offset by using Free Form Deformations and Motion History Images as descriptors. The local feature class has also been very successful too. Valstar and Pantic (2011) utilized 20 facial points and applied a facial point tracker to track sparse set of facial points. From the tracked points both static and dynamic features were computed to detect the temporal phases -onset, apex, and offset. Kakumanu employed local graph to track facial features (Kakumanu & Bourbakis, 2006). Chang proposed a method of toning the overlapping regions around the nose (Chang et al., 2006). Gundimada and Asari, (2009) extracted the local facial features by means of modular kernel eigen-spaces for multi-dimensional spaces. A third approach which is a hybrid of holistic and local has been proposed. Kimura and Yachida (1997) used Potential Net to fit a normalized image and computed the edges by utilizing differential filter via Gaussian filters. Tian et al., ( , 2005 has indicated that the hybrid approach performs better than any of the traditional methods. Extensive discussion of the feature extraction method is beyond the scope of this material. For detail discussions, we refer interested readers to (Pantic & Rothkrantz, 2000;Shan, 2010;Tian et al., 2001;De la Torre & Cohn, 2011).

Expression feature selection and classification
The last step of expression recognition systems is classification. The expression classifier is trained with the extracted features to recognize expressions of unknown datasets also known as test datasets.

Feature selection
Sometimes it is necessary to select a few representational set of the extracted facial features to avoid misclassification. For instance Gabor and Haar wavelets are unnecessarily huge in number and contain redundant data that can inhibit their practical implementation. This is a potential cause to misclassification in training. Such huge features need to be processed by a feature selection tool that selects fewer but substantial representations to reduce feature dimensions to the classifier for processing. Quite a few feature selection algorithms have been developed. For instance, the AdaBoost algorithm originally proposed by Freund and Schapire (1995) has been very successful to select extracted facial features for classification (Shen & Bai, 2006;Sandbach et al., 2011;Zhou et al., 2006). The boosting algorithm boosts the performance of weak binary classifiers by strengthening training on misclassified sets. The scheme of the algorithm is to weight a set of weak classifiers in respect to a function of the classification error (Freund & Schapire, 1995). The final strong classifier H is a weighted combination of weak classifiers ht followed by a threshold and it is denoted by: where, the binary digits 1 and 0 represents the respective classified and misclassified sets. The principal component analysis (PCA) method is a traditional method that has also been successfully utilized for dimensionality reduction in a number of facial expression applications (Soyel & Demirel, 2009).
The selection system consists of searching amongst a certain number n of principal components, m such that (n<m) are the best descriminant for the facial image expression recognition. Normally an iterative procedure is employed to select the components stepwise to assemble the optimal components. The Linear Discriminant Analysis (LDA) (Soyel et al., 2010) has also been utilized to calculate the optimal subspace to create a certain number of dimensional discriminant subspaces. Under this procedure the LDA explores for the vectors in the core subspace that best categorize among sets. Lastly, Sha et al., (2011) proposed the normalized cutting based filter (NCBF) to select optimum features headed for pattern classification. The NCBF composes of two main parts; it works on the principle that the features with higher discriminative aptitude have stronger correlation with each other.

Expression classification
Classification is the final essential component of expression recognition system. The classifier has to be trained to recognize expressions of unknown datasets. The training process depends on the type of classifier and the type of the input image. For instance both spatial and special-temporal classification approaches have been employed to identify AUs for expression recognition.
The philosophy of a spatial classification is to locate the units on a spatial network and provide concurrent set of structured classes of these units which are compatible with the network. One of the traditional classifiers employed is discriminant analysis. For instance, Cohn et al., (2004) applied this approach for automatic facial expression recognition of video images. Their system focused on two action units; brow raising and brow lowering. The system achieved a recognition rate of 89% for two-state recognition and 76% for three-state recognition. Independent Component Analysis (ICA) (Bartlett et al., 2002) has also been used to recognize both facial characteristics and expression constructions. Neural network methods have also been very successful in expression recognitions (Bazzo & Lamar, 2004;Ma & Khorasani, 2004). Ma and Khorasani (2004)) employed DCT over the whole face image as a feature detector and used a neural network of onehidden layer feed-forward neural network as an expression classifier for five expressions -smile, anger, sadness, and surprise. The best recognition rate on their system was 100% and 93.75% (without rejection). The problem with neural network classifiers is that, they find it rather difficult to train unconstrained facial behavior where there might be several thousands of AU combinations. The SVM classifier is very popular in expression recognition systems. They have been very successful in recognition of several AUs (Gundimada & Asari, 2009;Valstar et al., 2005). Shan et al., (2009) extracted facial features using Gabor filters and boosted the features with AdaBoost technique; the selected features were input into LDA and SVM. The recognition in the SVM classifier showed superior performance.
The spatio-temporal methods have also been successful, though not as popular as the spatial methods. The commonest classification among these approaches is Hidden Markov Models (HMM). The power of HMM lies in their ability to model facial actions' dynamics (Otsuka & Ohya, 1998;Cohn et al., 1997). The HMM classification is executed by selecting AU or combination of AUs, that exploits the maximum likelihood of the extracted facial features. Outside the set of spatial and spatio-temporal classifiers, there are other ones which have also emerged strongly. For example, Bayesian Networks (BN) is a strong a force to be reckoned (Cohen et al., 2003;Zhang & Ji, 2005). Cohen et al., (2004) proposed a stochastic structure search (SSS) algorithm to train a BN classifier that recognizes facial expression from labeled and unlabeled datasets.

Comparing results of expression recognition systems
One disturbing problem that researchers of facial expression recognitions face is an attempt to make realistic comparisons of performances of proposed methods with the existing ones. There is lack of common benchmark to compare results hence various research groups use their own methods, but in doing so, cannot make direct comparisons since the choice of the methods and databases differs from one research group to the other (Shan et al., 2009;Lee, Huang & Shih, 2010 Table 2), or comparisons of performances of different methods the same group have performed with same datasets under similar conditions (see Table 3). As fellow researchers of the field, we think modern research groups can do better by publishing the results of their replicates such as detailed by Tong et al., (2007) so that other researchers, when comparing performances can go a step further to conduct significant statistic testing techniques.
The use of statistical tests to assess the conclusions drawn in an experimental study is becoming a must both in the pattern recognition and soft computing communities. Researchers can compare their results by using significant testing techniques such as the Wilcoxon test or paired T-Test.   Performance comparisons of facial expression recognition in JAFFE database, (Shih et al., 2008) (2010): Facial expression recognition rates (%) obtained for the proposed 2D and 3D classifier and the 2D appearance-based classifier under the per frame classification scenario.

Conclusion
Facial expression recognition has long history within the ancient studies of physiognomy as far as in the first millennia B.C.E. Years of successful studies yielded to a common agreement that expressions are universal across cultures. Since the facial data is the implicit supplier of information for the facial expression recognition assignment, the processing intricacy and the overall success of the automatic expression recognition system depends on the techniques which are employed to capture the data. Face detection, which is an early step, used to have many challenges but today, the onset of good algorithms have helped improved the process extremely. Extraction of the expressive facial feature is an important step after the face is detected. The last component of expression recognition system is classification. The classifier has to be trained to recognize expressions of unknown datasets. The most successful classifier is SVM, when combined with AdaBoost as feature selection tool. The development of 2D, 3D and 4D databases and Facial Action Coding System (FACS) have also contributed immensely to the success of expression recognition.
Though a lot of the early problems faced with 2D data have been solved with the onset of 3D and 4D data acquisition, the 3D and 4D face acquisition does not unravel all the problems; for example facial occlusions, including subjects wearing scuffs, makeups, glasses, or long hair that covers portions of the face are still major problems yet to be solved. For problems caused by pose variations, it is proposed that the use of multiple views of the face and deformable 3D models fitted on 2D or 3D images are tenable solutions.
Finally, lack of standardize benchmark to compare results is a major challenge. Since various research groups use different methods, making direct comparisons with other findings is impossible. As a step to standardizations, we suggest that various research groups should publish their replicate results so as to make it possible for others to make significant statistical comparatives with theirs. The use of statistical tests to assess the conclusions drawn in an experimental study is becoming a must both in the pattern recognition and soft computing communities. Researchers can compare their results by using significant testing techniques such as the Wilcoxon test or paired T-Test.