Optimizing data movement between host and device memories is an important step when porting applications to GPUs. This is true for any programming model (CUDA, OpenACC, OpenMP 4+, etc.), and becomes even more challenging with complex aggregate data structures (arrays of structs with dynamically allocated array members). The CUDA and OpenACC APIs expose the separate host and device memories, requiring the programmer or compiler to explicitly manage the data allocation and coherence. The OpenACC committee is designing directives to extend this explicit data management for aggregate data structures. CUDA C++ has managed memory allocation routines and CUDA Fortran has the managed attribute for allocatable arrays, allowing the CUDA driver to manage data movement and coherence. Future NVIDIA GPUs will support true unified memory, with operating system and driver support for sharing the entire address space between the host and the GPU. We'll compare and contrast the current and future explicit memory movement with driver- and system-managed memory, and discuss how future developments will affect application development and performance.