Multi-face tracking in unconstrained videos is a challenging problem as faces of one person often appear drastically different in multiple shots due to significant variations in scale, pose, expression, illumination, and make-up. Existing multi-target tracking methods often use low-level features which are not sufficiently discriminative for identifying faces with such large appearance variations. In this paper, we tackle this problem by learning discriminative, video-specific face representations using convolutional neural networks (CNNs). Unlike existing CNN-based approaches which are only trained on large-scale face image datasets offline, we use the contextual constraints to generate a large number of training samples for a given video, and further adapt the pre-trained face CNN to specific videos using discovered training samples. Using these training samples, we optimize the embedding space so that the Euclidean distances correspond to a measure of semantic face similarity via minimizing a triplet loss function. With the learned discriminative features, we apply the hierarchical clustering algorithm to link tracklets across multiple shots to generate trajectories. We extensively evaluate the proposed algorithm on two sets of TV sitcoms and YouTube music videos, analyze the contribution of each component, and demonstrate significant performance improvement over existing techniques.
We tackle the problem of tracking multiple faces of people while maintaining their identities in unconstrained videos. Such videos consist of many shots from different cameras. The main challenge is to address large appearance variations of faces from different shots due to changes in pose, view angle, scale, makeup, illumination, camera motion and heavy occlusions.
Our multi-face tracking algorithm has four main steps: (a) Pre-training a CNN on a large-scale face recognition dataset to learn identity-preserving features, (b) Generating face pairs or face triplets from the tracklets in a specific video with the proposed spatio-temporal constraints and contextual constraints, (c) Adapting the pre-trained CNN to learn video-specific features from the automatically generated training samples, and (d) Linking tracklets within each shot and then across shots to form the face trajectories.
Here, we label the faces in T1 and T3 as the same identity given the sufficiently high similarity between the contextual features of T1 and T3. With this additional constraint, we can propagate the constraints transitively and derive that the faces from T1 and T4 (or T5, T6}) are in fact belong to different identities, and the faces from T3 and T2 are from different people.
The clustering purity versus the number of clusters in comparison with different features on YouTube music video, Big Bang Theory and BUFFY datasets. The ideal line indicates that all faces are correctly grouped into ideal clusters, and its corresponding weighted purity is equal to 1. For the more effective feature, its purity approximates to 1 faster with the increase in the number of clusters. The legend contains the purities at the ideal number of clusters for each feature.
|Tara||Pussycat Dolls||Bruno Mars|
Shun Zhang, Jia-Bin Huang, Jongwoo Lim, Yihong Gong, Jinjun Wang, Narendra Ahuja and Ming-Hsuan Yang, "Tracking Persons-of-Interest via Unsupervised Representation Adaptation", submitted to IJCV 2018. [paper] [supp] [video_demo] [data: Dropbox or BaiduYun] [Code coming soon]
Shun Zhang, Yihong Gong, Jia-Bin Huang, Jongwoo Lim, Jinjun Wang, Narendra Ahuja and Ming-Hsuan Yang, "Tracking Persons-of-Interest via Adaptive Discriminative Features", ECCV 2016.
Last updated: Feb. 6, 2018