Learn how to efficiently parallelize Maximal Independent Set computations for GPUs. Our CUDA implementation is at least three times faster than the leading GPU codes on every one of the 16 real-world and synthetic graphs we tested. Moreover, it produces a larger maximal independent set in all but one case. It is asynchronous, atomic free, and requires fewer than 30 kernel statements. We'll present the included code optimizations to achieve heretofore unreached performance and describe how to exploit monotonicity to minimize the memory footprint of this important irregular graph algorithm.