As the number of videos on the web rapidly increases, it becomes more and more important to have a good visual presentation of videos that enables us to effectively perform high-level tasks such as video search, summarization, and captioning. In the image domain, it has been common to use a deep network-based representation and it has shown promising results in many image understanding problems. In the video domain, however, the progress is still far behind that in the image domain due to the lack of large-scale training data and a huge computational requirement for training. We'll give a brief review on the state-of-the-art techniques for video representation and talk about our ongoing research effort for video representation learning. We'll conclude with discussions about challenges and future work.