Our most recent works on Scene Graph Generation will be presented during ICCV2017 at poster session. We propose a novel neural network model, termed as Multi-level Scene Description Network (denoted as MSDN), to solve the object detection, scene graph generation, and region captioning jointly in an end-to-end manner. Objects, phrases, and caption regions are first aligned with a dynamic graph based on their spatial and semantic connections. Then a feature refining structure is used to pass messages across the three levels of semantic tasks through the graph. We benchmark the learned model on three tasks, and show the joint learning across three tasks with our proposed method can bring mutual improvements over previous models. Particularly, on the scene graph generation task, our proposed method outperforms the state-of-art method with more than 3% margin.