Real-time stereo matching is the need of many practical applications. Matching algorithms are required to perform at high speeds. We'll present a semi-global matching (SGBM) algorithm, which has several advantages. We'll present our hybrid implementation, which achieves around 23x performance over well known OpenCV implementations. We'll present a simplified approach to break problems into multiple modules and port suitable sections to CUDA and optimize sequential sections to the CPU itself. Our CUDA implementation is accelerated on a Tesla K20 card with Kepler architecture. We focused on basic CUDA performance optimizations like coalesced access pattern, collapsing of nested loops, reduction of iterative data transfers between CPU and GPU, etc. We'll present how with a simplified CPU/GPU hybrid programming approach we achieved 23 times faster performance.