A successful autonomous system needs to not only understand the visual world but also communicate its understanding with humans. To make this possible, language can serve as a natural link between high-level semantic concepts and low-level visual perception. We'll present our recent work in the interdisciplinary domain of vision and language. We'll show how we can exploit the alignment between movies and books to build more descriptive captioning systems. We'll also discuss our efforts toward automatic understanding of stories from long and complex videos.