Explicit Vector Programming with OpenMP 4.0 SIMD Extensions

By Xinmin Tian and Bronis R. de Supinski | November 19, 2014

6 – SUMMARY

A rich body of compiler development explores how to exploit vector-level parallelism for modern CPU [9, 10] and GPU [8] cores with powerful SIMD hardware support [1, 2, 3, 4, 5, 7, 8, 12, 14, 15] through auto-vectorization. However, modern SIMD architectures pose new constraints such as data alignment, masking for control flow, non-unit stride memory accesses and the fixed-length nature of SIMD vectors. Although significant effort has been directed in the past decade towards these challenges [1, 5, 7, 14], automatic vectorization often still fails to vectorize application programs or to generate optimized SIMD code due to reasons such as compile-time unknown loop trip count, memory access patterns or strides, alignment and control flow complexity. To overcome these reasons, the programmer had to perform low-level SIMD intrinsic programming or to write inline ASM code in order to utilize SIMD hardware resources effectively [6].

Driven by the increasing prevalence of SIMD architectures in modern CPU and GPU processors, OpenMP 4.0 leveraged Intel’s explicit SIMD extensions [11, 16] to provide an industry standard set of high-level SIMD vector extensions [13]. These extensions form a thin abstraction layer between the programmer and the hardware that the programmer can use to harness the computational power of SIMD vector units without the low productivity activity of directly writing SIMD intrinsics or inline ASM code. With these SIMD extensions, compiler vendors can now deliver excellent performance on modern CPU and GPU processors.

[References]

[1] A. Bik, M. Girkar, P. M. Grey, and X. Tian. Automatic Intra-Register Vectorization for the Intel Architecture. International Journal of Parallel Programming, (2):65-98, April 2002.

[2] A. Eichenberger, K. O’Brien, K. O’Brien, P. Wu, T. Chen, P. Oden, D. Prener, J. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao, and M. Gschwind. Optimizing Compiler for the CELL Processor. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, 2005.

[3] A. Krall and S. Lelait. Compilation Techniques for Multimedia Processors, International J. of Parallel Programming, (4):347-361, August 2000.

[4] Crescent Bay Software. VAST-F/AltiVec: Automatic Fortran Vectorizer for PowerPC Vector Unit. 2004.

[5] D. Nuzman, I. Rosen, A. Zaks. Auto-Vectorization of Interleaved Data for SIMD. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation, 2006.

[6] G. Ren, P. Wu, and D. Padua, A Preliminary Study on the Vectorization of Multimedia Applications for Multimedia Extensions. In 16th International Workshop of Languages and Compilers for Parallel Computing, October 2003.

[7] G. Cheong and M. S. Lam. An Optimizer for Multimedia Instruction Sets, In Second SUIF Compiler Workshop, August 1997.

[8] I. Buck, T. Foley, D. Horn, J. Superman, K. Patahalian, M. Hourston and P. Hanrahan. Brook for GPUs: Stream Computing on Graphics Hardware, ACM Transactions on Graphics, 23(3): 777-786, 2004.

[9] Intel Corporation, “Intel® Xeon Phi™ Coprocessor System Software Developers Guide,” November 2012.

[10] Intel Corporation. Intel® Advanced Vector Extensions Programming Reference, Document number 319433-011, June 2011.

[11] M. Klemm, A. Duran, X. Tian, H. Saito, D. Caballero, X. Martorell, Extending OpenMP with Vector Constructs for Modern Multicore SIMD Architectures. International Workshop on OpenMP, June 2012: pp.59-72.

[12] M. J. Wolfe, High Performance Compilers for Parallel Computers, Addison-Wesley Publishing Company, Redwood City, California, 1996.

[13] OpenMP Architecture Review Board, “OpenMP Application Program Interface,” Version 4.0, July 2013, http://www.openmp.org

[14] P. Wu, A. E. Eichenberger, and A. Wang. Efficient SIMD Code Generation for Runtime Alignment. In Proceedings of the Symposium on Code Generation and Optimization, 2005.

[15] S. Larsen and S. Amarasinghe. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation, pp.145-156, June 2000.

[16] X. Tian, H. Saito, M. Girkar, S. Preis, S. Kozhukhov, A.G. Cherkasov, C. Nelson, N. Panchenko, R. Geva, Compiling C/C++ SIMD Extensions for Function and Loop Vectorization on Multicore-SIMD Processors. In Proc. of IEEE 26th International Parallel and Distributed Processing Symposium – Multicore and GPU Prog. Models, Lang. and Compilers Workshop, pp.2349 – 2358, 2012.

[17] X. Tian, Y.K. Chan, M. Girkar, S. Ge, R. Lienhart, and S. Shah, “Exploring the Use of Hyper-Threading Technology for Multimedia Applications with Intel OpenMP Compiler”, In Proc. of IEEE International Parallel and Distributed Processing Symposium, Nice, France, April 22-26, 2003.

<1 2 3 4 5 6 7 8 >

Navigation