Community detection is a key kernel in the analysis of complex networks for a variety of fields. We'll present our implementation of a new GPU algorithm for community detection based on the Louvain Method. Our approach parallelizes the access to individual edges, enabling load balancing of networks with nodes of highly varying degrees. We're able to obtain speedups up to a factor of 270 compared to the sequential algorithm. The algorithm consistently outperforms other recent shared memory implementations and is only one order of magnitude slower than the current fastest parallel Louvain method running on a Blue Gene/Q supercomputer using more than 500K threads.