We'll show a method that decreases random memory accesses for GPUs by splitting up calculations properly. The target application is unstructured low-order finite element analysis, the core application for manufacturing analyses. To reduce the memory access cost, we apply the element-by-element method for matrix-vector multiplication in the analysis. This method conducts local matrix-vector computation for each element in parallel. Atomic and cache hardware in GPUs has improved and we can utilize the data locality in the element node connectivity by using atomic functions for addition of local results. We port codes to GPUs using OpenACC directives and attain high performance with low development costs. We'll also describe the performance on NVIDIA DGX-1, which contains eight Pascal GPUs.