Matrix factorization (MF) has been widely used in recommender systems, topic modeling, word embedding, and more. Stochastic gradient descent (SGD) for MF is memory bound. Meanwhile, single-node CPU systems with caching performs well only for small datasets. Distributed systems have higher aggregated memory bandwidth but suffer from relatively slow network connections. This observation inspires us to accelerate MF by utilizing GPUs's high memory bandwidth and fast intra-node connection. We present cuMF_SGD, a CUDA-based SGD solution for large-scale MF problems. On a single CPU, we design two workload schedule schemes, i.e., batch-Hogwild! and wavefront-update, that fully exploit the massive amount of cores. batch-Hogwild! as a vectorized version of Hogwild! especially overcomes the issue of memory discontinuity. On three datasets with only one Maxwell or Pascal GPU, cuMF_SGD runs 3.1 to 28.2x as fast compared with state-of-art CPU solutions on 1 to 64 CPU nodes.