Parallelized GPU Code of City-Level Large Eddy Simulation
Daisuke Tsuji 1, Taisuke Boku2, Ryosaku Ikeda24, Takuto Sato3, Hiroto Tadano 2 and Hiroyuki Kusaka 2
1 Department of Computer Science, Graduate School of Systems and Information Engineering, University of Tsukuba
2 Center for Computational Sciences, University of Tsukuba
3 Doctoral Program in Geoenvironmental Sciences, Graduate School of Life and Environmental Sciences, University of Tsukuba
4 Weathernews Inc.
dtsuji @ hpcs cs tsukuba ac jp, taisuke @ cs tsukuba ac jp, iked-r @ wni com, s1830199 @ s tsukuba ac jp, tadano @ cs tsukuba ac jp, kusaka.hiroyuki.ff @ u tsukuba ac jp
In this paper, we describe the GPU implementation of our City-LES, especially focusing on 1) performance comparison between CUDA and OpenACC, and 2) how to reduce the data exchange between CPU and GPU memories. With a number of GPU devices of NVIDIA Tesla V100, we found today’s OpenACC compiler by PGI can achieve comparable performance with CUDA on the main part of LES calculation. We also apply OpenACC aggressively even for lower performance than CPU to avoid data copy between GPU and CPU with encapsulating all the data only on the GPU memory.
In our optimized OpenACC (partially in CUDA) code, we could double the performance of full GPU version of code, removing most of GPU-CPU data copy from the original GPU code. For the scaling performance test, full GPU version achieved 4.7x to 10x performance of CPU version, on a GPU cluster Cygnus at CCS where each node is equipped with two Intel Xeon CPUs and four NVIDIA Tesla V100 GPUs, with strong scaling up to 32 nodes. For weak scaling, full GPU version achieved more than 9x performance of CPU version up to 32 nodes of parallel execution.
Keywords: GPU, large eddy simulation, climate, cluster, CUDA, OpenACC