No
Yes
View More
View Less
Working...
Close
OK
Cancel
Confirm
System Message
Delete
Schedule
An unknown error has occurred and your request could not be completed. Please contact support.
Scheduled
Wait Listed
Personal Calendar
Speaking
Conference Event
Meeting
Interest
Schedule TBD
Conflict Found
This session is already scheduled at another time. Would you like to...
Loading...
Please enter a maximum of {0} characters.
Please enter a maximum of {0} words.
must be 50 characters or less.
must be 40 characters or less.
Session Summary
We were unable to load the map image.
This has not yet been assigned to a map.
Search Catalog
Reply
Replies ()
Search
New Post
Microblog
Microblog Thread
Post Reply
Post
Your session timed out.
This web page is not optimized for viewing on a mobile device. Visit this site in a desktop browser to access the full set of features.
2017 GTC San Jose

S7255 - cuTT: A High-Performance Tensor Transpose Library for GPUs

Session Speakers
Session Description

We'll introduce cuTT, a tensor transpose library for GPUs that on average achieves over 70% of the attainable memory bandwidth, independent of tensor rank. Tensor transposing is important in many applications such as multi-dimensional Fast Fourier Transforms and deep learning, and in quantum chemistry calculations. Until now, no runtime library existed that fully utilized the remarkable memory bandwidth of GPUs and could perform well independent of tensor rank. We'll describe two transpose algorithms, "Tiled" and "Packed," which achieve high-memory bandwidth in most use cases, as well as their variations that take care of many important corner cases. We'll also discuss a heuristic method based on GPU performance modeling that helps cuTT choose the optimal algorithm for the particular use case. Finally, we'll present benchmarks for tensor ranks 2 to 12 and show that cuTT, a fully runtime library, performs as well as an approach based on code generation.


Additional Session Information
Intermediate
Talk
Algorithms Deep Learning and AI Performance Optimization Tools and Libraries
25 minutes
Session Schedule