BPLG

Documentation

User Guide
Publications
FAQ

Project Documentation

August 19th, 2016

BPLG was created using a series of operators or building blocks. For detailed information about the library design, please refer to the associated publications. BPLG has a very simple interface, as described in the documentation and the header files. In contrast to other libraries, it does not need to create execution plans.

User Guide

Provisional user reference guide.

Publications

Jacobo Lobeiras, M. Amor, R. Doallo. Designing Efficient Index-Digit Algorithms for CUDA GPU Architectures
IEEE Transactions on Parallel and Distributed Systems (IEEE TPDS). 2016. DOI: 10.1109/TPDS.2015.2450718
Adrián P. Diéguez, M. Amor, R. Doallo. New Tridiagonal Systems Solver on GPU architectures
22nd International Conference on High Performance Computing (HiPC). 2015. DOI: 10.1109/HiPC.2015.17
Adrián P. Diéguez, M. Amor, R. Doallo. BPLG-BMCS: GPU-Sorting Algorithm using a Tuning Skeleton Library.
Journal of Supercomputing. 2015. DOI: 10.1007/s11227-015-1591-9
J. Lobeiras, M. Amor, R. Doallo. BPLG: A Tuned Butterfly Processing Library for GPU Architectures.
International Journal of Parallel Programming (IJPP). 2015. DOI: 10.1007/s10766-014-0323-8
J. Lobeiras, M. Amor, R. Doallo. SPLG: A Tuned Signal Processing Library for GPU Architectures.
25th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2013). October 2013. DOI: 10.1109/SBAC-PAD.2013.30

FAQ

Why does the library return an error for large problems?
The library is restricted to problems that fit in the GPU shared memory, thus large problems are not currently supported. To overcome this limitation we are working on a multi-kernel version for some of the algorithms.
Why the real FFT output has only N/2 complex elements?
In reality, the FFT of a real signal of size N has N complex elements as output. However, half of the data is redundant, since:

The real FFT is a specialization that exploits this property to reduce GPU bandwidth requirements. Normally, due to this symmetry the output would require N/2+1 elements. However, since both are pure real numbers, it is a common trick to store the last element in the complex part of the first element. Thus, the FFT of a real signal of size N will require only N/2 complex elements, enabling in-place transforms.
Why BPLG is slightly slower than the CUFFT?
The CUFFT has improved over time, and the current implementation is very efficient. It uses several kernels, depending on the GPU hardware and problem size. On the other hand, BPLG was created with simplicity in mind, with just a single kernel per algorithm to process a wide range of problem sizes. Thanks to the modular design it can be easily extended or customized.
What about other transforms?
As long as the transform can be expressed with some kind of permutation and butterfly operations, it can be implemented with the provided building blocks.
Is any kind of pivoting used for the resolution of tridiagonal systems?
Similarly to other GPU libraries, the current version of BPLG does not use pivoting, thus, it is not suitable for non diagonal-dominant problems. We have plans to improve this in a future version, complementing the algorithms with another method when some stability issue is detected.

BPLG

Butterfly Processing Library for GPUs

Project Documentation

User Guide

Publications

FAQ