Take your kernels to the next level with performance-enhancing techniques for all levels of the CUDA memory hierarchy. We'll share lessons gleaned from implementing demanding image-processing algorithms into the real-time visual simulation world. From CPU prototype to optimized GPU implementation, one algorithm saw 150,000X speedup. Techniques to be presented include: instantaneous image decimation; CDF via warp shuffle; block and grid shapes for easy-to-program cache optimization; designing XY-separable kernels and their intermediate data; and sliding window tradeoffs for maximum cache locality. Straightforward examples will make these optimizations easy to add to your CUDA toolbox.