Paper #9
Parallelized GPU Code of City-Level Large Eddy Simulation
Daisuke Tsuji 1, Taisuke Boku2, Ryosaku Ikeda24, Takuto Sato3, Hiroto Tadano 2 and Hiroyuki Kusaka 2
1 Department of Computer Science, Graduate School of Systems and Information Engineering, University of Tsukuba
2 Center for Computational Sciences, University of Tsukuba
3 Doctoral Program in Geoenvironmental Sciences, Graduate School of Life and Environmental Sciences, University of Tsukuba
4 Weathernews Inc.
dtsuji @ hpcs cs tsukuba ac jp, taisuke @ cs tsukuba ac jp, iked-r @ wni com, s1830199 @ s tsukuba ac jp, tadano @ cs tsukuba ac jp, kusaka.hiroyuki.ff @ u tsukuba ac jp
Abstract: GPU computing is a trend of accelerated computing especially for large scale supercomputers to achieve a high performance/power ratio. GPU is also suitable for applications which require a large degree of SIMD-type parallelism and high memory bandwidth. One of the typical applications is the climate simulation. In Center for Computational Sciences (CCS), University of Tsukuba, we have been developing a city-level high resolution weather model named City-LES by the Large Eddy Simulation. City-LES can treat not just the surface of the region but also the building blocks, material of surface such as ground, grass or concrete, etc. Since the LES simulation takes a long time for computation according to its heavy calculation as well as frequent memory access, so that GPU implementation of the code is quite desired for speeding up the simulation.
In this paper, we describe the GPU implementation of our City-LES, especially focusing on 1) performance comparison between CUDA and OpenACC, and 2) how to reduce the data exchange between CPU and GPU memories. With a number of GPU devices of NVIDIA Tesla V100, we found today’s OpenACC compiler by PGI can achieve comparable performance with CUDA on the main part of LES calculation. We also apply OpenACC aggressively even for lower performance than CPU to avoid data copy between GPU and CPU with encapsulating all the data only on the GPU memory.
In our optimized OpenACC (partially in CUDA) code, we could double the performance of full GPU version of code, removing most of GPU-CPU data copy from the original GPU code. For the scaling performance test, full GPU version achieved 4.7x to 10x performance of CPU version, on a GPU cluster Cygnus at CCS where each node is equipped with two Intel Xeon CPUs and four NVIDIA Tesla V100 GPUs, with strong scaling up to 32 nodes. For weak scaling, full GPU version achieved more than 9x performance of CPU version up to 32 nodes of parallel execution.
In this paper, we describe the GPU implementation of our City-LES, especially focusing on 1) performance comparison between CUDA and OpenACC, and 2) how to reduce the data exchange between CPU and GPU memories. With a number of GPU devices of NVIDIA Tesla V100, we found today’s OpenACC compiler by PGI can achieve comparable performance with CUDA on the main part of LES calculation. We also apply OpenACC aggressively even for lower performance than CPU to avoid data copy between GPU and CPU with encapsulating all the data only on the GPU memory.
In our optimized OpenACC (partially in CUDA) code, we could double the performance of full GPU version of code, removing most of GPU-CPU data copy from the original GPU code. For the scaling performance test, full GPU version achieved 4.7x to 10x performance of CPU version, on a GPU cluster Cygnus at CCS where each node is equipped with two Intel Xeon CPUs and four NVIDIA Tesla V100 GPUs, with strong scaling up to 32 nodes. For weak scaling, full GPU version achieved more than 9x performance of CPU version up to 32 nodes of parallel execution.
Keywords: GPU, large eddy simulation, climate, cluster, CUDA, OpenACC