PI: Shao-Yi Chien (National Taiwan University), CoPI: Yu Tsao (Academia Sinica)
Champions: Yen-Kuang Chen (Intel), Shao-Wen Yang (Intel)
The main objective of this project is to adopt state-of-the-art deep learning algorithms to improve the performance and overcome the obstacles of the internet of things (IoT) and augmented collective beings (ACB) framework. So far, we have attained great progress on the directions of object based video summarization, object tracking with re-identification, wearable social camera, designed optimized CNN and cascading DNN.
Object based video summarization can achieve better summarization results compared with our previous frame based approach. With object detection and two-layer KNN clustering, the quality of summarization is considerably improved: the F1 scores are improved from 0.4 to 0.6 in average. In the object tracking subproject, thanks for the re-identification system based on the convolutional neural network as well as the traveling time model, 81.54% in precision and 90.21% in recall can be achieved, which is the state-of-the-art performance. In addition, novel and effective interaction features have been developed for the wearable social camera sub-project. Meanwhile, advanced algorithms are derived to enable CNN and DNN to increase throughput while minimizing computation, storage and bandwidth requirement under power constraints. The implementation of these five sub-projects resulted in significant accomplishments, including: (1) one paper has been accepted for publication in IEEE International Workshop on Mobile Multimedia Computing (MMC 2016), in conjunction with the 2016 IEEE International Conference on Multimedia & Expo (ICME 2016), one paper has been accepted by 2017 IEEE Symposium on Circuits and Systems (ISCAS 2017), and one paper is submitted to IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2) The PI and Co-PI have been invited as Associate Editor, IEEE Transactions on Circuits and Systems for Video Technology, Associate Editor, IEICE transactions on Information and Systems, and Editor, SpringerPlus.
In the future, we will continue three directions: object tracking with re-identification, wearable camera, and optimized CNN. For the re-identification sub-project, we plan to further improve the spatio-temporal model with the developed dataset, and also consider to re-train the neural network on the fly. For the wearable social camera sub-project, we plan to extend on the regression engine to combine it with body language vectors. Moreover, since human voice also plays an important role in social interactions, combining the voice features with our interaction features will also be explored. In addition to automatic speech recognition, we plan to derived novel speech emotion, speaker identification, and paralinguistics in speech, based on deep learning algorithms to obtain rich context-aware information about human behavior and the environments. For the optimized CNN, we will put our focus on binarized network and the associated hardware architecture design. (updated in Feb, 2017)