We'll cover how to automatically select the best parallelism for a deep learning algorithm. Current deep learning systems like Tensorflow and MXNet focus on only one specific parallelization strategy, data parallelism, which requires large training batch sizes to scale. An alternative approach, model parallelism, does not have such requirements but is inefficient when model parameter size is large. Choosing the right parallelism is tedious for users because it requires extensive analysis of the whole program. Therefore, we propose Tofu that could automatically parallelize a deep learning program. We first cast the problem of finding the best parallelization strategy as the problem of finding the best tiling to partition the computation with the least overall communication. We propose an algorithm that is provably optimal and the resulting optimal parallelization solution is a hybrid of data and model parallelism. Our system, called Tofu, can automatically transform the dataflow graph captured by an existing deep learning system frontend into a parallel dataflow graph based on the optimal tiling it has found.