IdenNet:Facial Action Unit Detection Using Identity Normalization

   Facial action unit (AU) detection is an important task that enables emotion recognition from facial movements. Figure 1 shows a few AU examples that illustrate the advanced ability to decently describe facial expressions across primary categories such as neutral, happy, and sad. Prof. Hsu’s team is motivated to conduct this research to improve the social ability of robots when interacting with humans. To that end, the team proposes a novel algorithm using identity normalization to address a challenging problem associated with AU detection: different subjects have significantly different appearances for the same AU, which causes existing methods to exhibit low performances in cross-domain scenarios, especially when the training and test datasets are dissimilar. The proposed method is implemented on convolutional neural network cascades consisting of two sub-tasks: face clustering and AU detection. The face clustering network, trained on a large identity-annotated face dataset, is designed to learn a transformation for extracting identity-dependent image features, which are used to predict AU labels in the second network by normalization. The cascades are jointly trained by AU- and identity-annotated datasets that contain numerous subjects to improve the method's applicability. Experimental results show that the proposed method achieves state-of-the-art AU detection performance on the benchmark datasets BP4D (Lucey, Jeffrey, Prkachin, Solomon, & Matthews, 2011), UNBC-McMaster (Zhang, et al., 2014), and DISFA (Mohammad Mavadati, Mahoor, Bartlett, Trinh, & Cohn, 2013).
Proposed method
   Prof. Hsu’s team proposes multi-task convolutional neural network (CNN) cascades containing two sub-tasks—face clustering and AU detection—with shared convolutional layers to extract low-level features, as shown in Figure 2. The incorporated face clustering task is trained using identity-annotated datasets by exploiting their large number of face images and wide range of subject attributes. The task is designed to extract features from face images such that the feature vectors will be close to each other in the feature space if they are taken from the same subject and far apart if they are not, as illustrated in Figure 3. The extracted features are used as parameters for translation transformation in the feature space such that the difference caused by subjects will be significantly reduced, as illustrated in Figure 4. After obtaining the managed feature vectors, a fully connected convolutional layer is simply applied to estimate the probability values of an AU label vector.
   Because our training process requires not only face images but also identity labels, we use the CelebA dataset (Liu, Luo, Wang, & Tang, 2015), which contains a large number of samples with identity labels.
   We evaluate our method on three benchmark datasets, i.e., the BP4D, UNBC-McMaster and DISFA datasets, and demonstrate its performance in Tables 1, 2, and 3 against the state-of-the-art AU detection methods DRML (Zhao, Chu, & Zhang, 2016), SVTPT (Zen, Sangineto, Ricci, & Sebe, 2014), and ROINet (Li, Abtahi, & Zhu, 2017). AlexNet (Krizhevsky, Sutskever, & Hinton, 2012) and LightCNN (Wu, He, Sun, & Tan, 2015) are also listed as baseline algorithms for comparison because AlexNet is a widely used object recognition method and LightCNN is a subset of our proposed network. We are unable to compare ROINet’s performance on the UNBC-McMaster and DISFA datasets because we do not have a copy of its code, and the original authors have never reported the algorithm’s performance.
   The proposed method outperforms existing AU-detection methods on the three benchmark datasets. In addition, the proposed network cascades have a more compact parameter size, as shown in Table 4.
   We propose IdenNet, a novel method for CNN cascades for AU detection by exploiting identity normalization. The cascades extract identity-dependent features in the face clustering task and normalize them in the AU detection network. After reducing the influence of individual identities, the feature vectors better present the differences caused by AUs. Experimental results conducted on benchmark datasets validate the effectiveness and robustness of the proposed method.
Table 1. F1-frame on the BP4D dataset with 3-fold random splits.
Table 2. F1-frame on the UNBC-McMaster dataset with 3-fold random splits. 
Table 3. F1-frame on the DISFA dataset with 3-fold random splits.
Table 4. Methods compared with respect to model size. Because the full size of ROINet is unknown, we conservatively report the size of its feature extractor.
Examples of facial action units
Figure 1. Facial action units code nearly any anatomically possible facial expression and can be used for emotion recognition in an ambient intelligent environment.
Overview of the proposed IdenNet
Figure 2. The method is implemented in an architecture of multi-task network cascades in which the two sub-tasks, face clustering and AU detection, share a common network and own their specific CNN layers.
Local distribution in a feature space
Figure 3. The proposed method aims to learn an image representation in which feature vectors of images captured from the same subject will be close to each other but far away from other subjects’ image feature vectors, such as the blue, green, and yellow ovals presenting three bunches of features captured from three different subjects.
Identity normalization for improved AU representation
Figure 4. Although facial expressions are insignificant, they still contribute to feature vectors and form an AU-specific distribution that is highly similar to other subjects' distributions in the feature space. Because the vectors’ local structures are similar, we normalize those vectors by removing their identity-level signals so that the new feature vectors will better represent AUs contained in those images. 
1. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Conference on Neural Information Processing Systems.
2. Li, W., Abtahi, F., & Zhu, Z. (2017). Action Unit Detection with Region Adaptation, Multi-labeling Learning and Optimal Temporal Fusing. IEEE Conference on Computer Vision and Pattern Recognition.
3. Liu, Z., Luo, P., Wang, X., & Tang, X. (2015). Deep Learning Face Attributes in the Wild. IEEE International Conference on Computer Vision.
4. Lucey, P., Jeffrey, C. F., Prkachin, K. M., Solomon, P. E., & Matthews, I. (2011). Painful Data: The UNBC-McMaster Shoulder Pain Expression Archive Database. IEEE Conference on Automatic Face and Gesture Recognition.
5. Mohammad Mavadati, S., Mahoor, M. H., Bartlett, K., Trinh, P., & Cohn, J. F. (2013). DISFA: A Spontaneous Facial Action Intensity Database. IEEE Transactions on Affective Computing, 4(2), 151-160.
6. Wu, X., He, R., Sun, Z., & Tan, T. (2015). A Light CNN for Deep Face Representation with Noisy Labels. arXiv preprint. arXiv:1511.02683 [cs.CV]
7. Zen, G., Sangineto, E., Ricci, E., & Sebe, N. (2014). Unsupervised Domain Adaptation for Personalized Facial Emotion Recognition. International Conference on Multimodal Interaction.
8. Zhang, X., Yin, L., Cohn, J. F., Canavan, S., Reale, M., Horowitz, A., Girard, J. M. (2014). BP4D-Spontaneous: A High-resolution Spontaneous 3D Dynamic Facial Expression Database. Image and Vision Computing, 10(32), pp. 692-706.
9. Zhao, K., Chu, W.-S., & Zhang, H. (2016). Deep Region and Multi-label Learning for Facial Action Unit Detection. IEEE Conferences on Computer Vision and Pattern Recognition.
Andy Cheng-Hao Tu
Department of Computer Science and Information Engineering
Chih-Yuan Yang
Post-doctoral Researcher, Department of Computer Science and Information Engineering
Jane Yen-jen Hsu
Professor, Department of Computer Science and Information Engineering