Selected Publications

we propose an end-to-end unified model, the Invertible Question Answering Network (iQAN), to introduce question generation as a dual task of question answering to improve the VQA performance. With our proposed invertible bilinear fusion module and parameter sharing scheme, our iQAN can accomplish VQA and its dual task VQG simultaneously. By jointly trained on two tasks, our model has a better understanding of the interactions among images, questions and answers. Evaluated on the CLEVR and VQA2 datasets, our iQAN could improve the top-1 accuracy of the baseline MUTAN VQA method by 1.33% and 0.88%. We also show that our proposed dual training framework can consistently improve model performances on many popular VQA architectures.
Spotlight In CVPR 2018.

To leverage the mutual connections across semantic levels, we propose a novel neural network model, termed as Multi-level Scene Description Network (denoted as MSDN), to solve the three vision tasks jointly in an end-to-end manner. Object, phrase, and caption regions are first aligned with a dynamic graph based on their spatial and semantic connections. Then a feature refining structure is used to pass messages across the three levels of semantic tasks through the graph. We benchmark the learned model on three tasks, and show the joint learning across three tasks with our proposed method can bring mutual improvements over previous models. Particularly, on the scene graph generation task, our proposed method outperforms the state-of-art method with more than 3% margin.
In ICCV 2017

In this paper, we formulate the visual relationship detection as three inter-connected recognition problems and propose a Visual Phrase guided Convolutional Neural Network (ViP-CNN) to address them simultaneously. In ViP-CNN, we present a Phrase-guided Message Passing Structure (PMPS) to establish the connection among relationship components and help the model consider the three problems jointly. Corresponding non-maximum suppression method and model training strategy are also proposed. Experimental results show that our ViP-CNN outperforms the state-of-art method both in speed and accuracy. We further pretrain ViP-CNN on our cleansed Visual Genome Relationship dataset, which is found to perform better than the pretraining on the ImageNet for this task.
In CVPR 2017

Recent Publications

Presentations & Talks

Recent Posts

A simple collection of the popular datasets in Computer Vision for your reference.



I have been a teaching assistant for the following courses at CUHK:

  • ENGG1110: Introduction to Programming
  • ENGG5202: Pattern Recognition