Publications

Books

Heterogeneous System Architecture
Yeh-Ching Chung, Benedict R. Gaster, Juan Gómez-Luna, Derek Hower, Lee Howes, Shih-Hao Hung, Thomas B. Jablin, David Kaeli, Phil Rogers, Ben Sander, I-Jui (Ray) Sung, Wen-Mei Hwu. Morgan Kaufman2015.
Abstract

Heterogeneous System Architecture – a new compute platform infrastructure presents a next-generation hardware platform, and associated software, that allows processors of different types to work efficiently and cooperatively in shared memory from a single source program. HSA also defines a virtual ISA for parallel routines or kernels, which is vendor and ISA independent thus enabling single source programs to execute across any HSA compliant heterogeneous processer from those used in smartphones to supercomputers.

The book begins with an overview of the evolution of heterogeneous parallel processing, associated problems, and how they are overcome with HSA. Later chapters provide a deeper perspective on topics such as the runtime, memory model, queuing, context switching, the architected queuing language, simulators, and tool chains. Finally, three real world examples are presented, which provide an early demonstration of how HSA can deliver significantly higher performance thru C++ based applications. Contributing authors are HSA Foundation members who are experts from both academia and industry.

Heterogeneous Computing with OpenCL: Revised OpenCL 1.2 edition
Benedict R. Gaster, Lee Howes, David R. Kaeli, Perhaad Mistry and Dana Schaa. Morgan Kaufman2012.
Abstract

Heterogeneous Computing with OpenCL teaches OpenCL and parallel programming for complex systems that may include a variety of device architectures: multi-core CPUs, GPUs, and fully-integrated Accelerated Processing Units (APUs) such as AMD Fusion technology. Designed to work on multiple platforms and with wide industry support, OpenCL will help you more effectively program for a heterogeneous future.

This book will give you hands-on OpenCL experience to address a range of fundamental parallel algorithms. The book explore memory spaces, optimization techniques, graphics interoperability, extensions, and debugging and profiling. Intended to support a parallel programming course, Heterogeneous Computing with OpenCL includes detailed examples throughout, plus additional online exercises and other supporting materials.

Heterogeneous Computing with OpenCL
Benedict R. Gaster, Lee Howes, David R. Kaeli, Perhaad Mistry and Dana Schaa. Morgan Kaufman2011.
Abstract

Heterogeneous Computing with OpenCL teaches OpenCL and parallel programming for complex systems that may include a variety of device architectures: multi-core CPUs, GPUs, and fully-integrated Accelerated Processing Units (APUs) such as AMD Fusion technology. Designed to work on multiple platforms and with wide industry support, OpenCL will help you more effectively program for a heterogeneous future.

This book will give you hands-on OpenCL experience to address a range of fundamental parallel algorithms. The book explore memory spaces, optimization techniques, graphics interoperability, extensions, and debugging and profiling. Intended to support a parallel programming course, Heterogeneous Computing with OpenCL includes detailed examples throughout, plus additional online exercises and other supporting materials.

Journal and Magazine articles

HRF-Relaxed: Adapting HRF to the Complexities of Industrial Heterogeneous Memory Models
Benedict R. Gaster and Derek Hower and Lee Howes. ACM TACO April 2015.
Abstract

Memory consistency models, or memory models, allow both programmers and program language implementers to reason about concurrent accesses to one or more memory locations. Memory model specifications balance the often conflicting needs for precise semantics, implementation flexibility, and ease of understanding. Toward that end, popular programming languages like Java, C, and C++ have adopted memory models built on the conceptual foundation of Sequential Consistency for Data-Race-Free programs (SC for DRF). These SC for DRF languages were created with general-purpose homogeneous CPU systems in mind, and all assume a single, global memory address space. Such a uniform address space is usually power and performance prohibitive in heterogeneous Systems on Chips (SoCs), and for that reason most heterogeneous languages have adopted split address spaces and operations with nonglobal visibility.

There have recently been two attempts to bridge the disconnect between the CPU-centric assumptions of the SC for DRF framework and the realities of heterogeneous SoC architectures. Hower et al. proposed a class of Heterogeneous-Race-Free (HRF) memory models that provide a foundation for understanding many of the issues in heterogeneous memory models. At the same time, the Khronos Group developed the OpenCL 2.0 memory model that builds on the C++ memory model. The OpenCL 2.0 model includes features not addressed by HRF: primarily support for relaxed atomics and a property referred to as scope inclusion. In this article, we generalize HRF to allow formalization of and reasoning about more complicated models using OpenCL 2.0 as a point of reference. With that generalization, we (1) make the OpenCL 2.0 memory model more accessible by introducing a platform for feature comparisons to other models, (2) consider a number of shortcomings in the current OpenCL 2.0 model, and (3) propose changes that could be adopted by future OpenCL 2.0 revisions or by other, related, models.

Bibtex

@article{Gaster:2015:HAH:2744295.2701618, author = {Gaster, Benedict R. and Hower, Derek and Howes, Lee}, title = {HRF-Relaxed: Adapting HRF to the Complexities of Industrial Heterogeneous Memory Models}, journal = {ACM Trans. Archit. Code Optim.}, issue_date = {April 2015}, volume = {12}, number = {1}, month = apr, year = {2015}, issn = {1544-3566}, pages = {7:1–7:26}, articleno = {7}, numpages = {26}, url = {http://doi.acm.org/10.1145/2701618}, doi = {10.1145/2701618}, acmid = {2701618}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {Memory models, computer architecture, formal models, programming languages}, }

Can GPGPU Programming be Liberated from the Data Parallel Bottleneck
Benedict R. Gaster and Lee Howes. IEEE Computer, pages 42-52 August 2012.
Bibtex

@article{10.1109/MC.2012.257, author = {Benedict R. Gaster and Lee Howes}, title = {Can GPGPU Programming Be Liberated from the Data-Parallel Bottleneck?}, journal ={Computer}, volume = {45}, issn = {0018-9162}, year = {2012}, pages = {42-52}, doi = {http://doi.ieeecomputersociety.org/10.1109/MC.2012.257}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }

A Systematic Design Space Exploration Approach to Customising Multi-Processor Architectures: Exemplified Using Graphics Processors
Ben Cope, Peter Y. K. Cheung, Wayne Luk and Lee W. Howes Transactions on High-Performance Embedded Architectures and Compilers, pages 63-83. 2011.
Bibtex

@article{DBLP:journals/thipeac/CopeCLH11, author = {Ben Cope and Peter Y. K. Cheung and Wayne Luk and Lee W. Howes}, title = { A Systematic Design Space Exploration Approach to Customising Multi-Processor Architectures: Exemplified Using Graphics Processors}, journal = {T. HiPEAC}, volume = {4}, year = {2011}, pages = {63-83}, ee = {http://dx.doi.org/10.1007/978-3-642-24568-8_4}, crossref = {DBLP:journals/thipeac/2011-4}, bibsource = {DBLP, http://dblp.uni-trier.de} }

Performance Comparison of Graphics Processors to Reconfigurable Logic: A Case Study
Ben Cope, Peter Y. K. Cheung, Wayne Luk and Lee W. Howes IEEE Transactions on Computers, pages 433-448. April 2010.
Bibtex

@article{DBLP:journals/tc/CopeCLH10, author = {Ben Cope and Peter Y. K. Cheung and Wayne Luk and Lee W. Howes}, title = {Performance Comparison of Graphics Processors to Reconfigurable Logic: A Case Study}, journal = {IEEE Trans. Computers}, volume = {59}, number = {4}, year = {2010}, pages = {433-448}, ee = {http://doi.ieeecomputersociety.org/10.1109/TC.2009.179}, bibsource = {DBLP, http://dblp.uni-trier.de} }

PhD Thesis

Indexed dependence metadata and its applications in software performance optimisation
Lee William Howes. Imperial College London 2010.
Abstract

To achieve continued performance improvements, modern microprocessor design is tending to concentrate an increasing proportion of hardware on computation units with less automatic management of data movement and extraction of parallelism. As a result, architectures increasingly include multiple computation cores and complicated, software-managed memory hierarchies. Compilers have difficulty characterizing the behaviour of a kernel in a general enough manner to enable automatic generation of efficient code in any but the most straightforward of cases.

We propose the concept of indexed dependence metadata to improve application development and mapping onto such architectures. The metadata represent both the iteration space of a kernel and the mapping of that iteration space from a given index to the set of data elements that iteration might use: thus the dependence metadata is indexed by the kernel’s iteration space. This explicit mapping allows the compiler or runtime to optimise the program more efficiently, and improves the program structure for the developer. We argue that this form of explicit interface specification reduces the need for premature, architecture-specific optimisation. It improves program portability, supports intercomponent optimisation and enables generation of efficient data movement code.

We offer the following contributions: an introduction to the concept of indexed dependence metadata as a generalisation of stream programming, a demonstration of its advantages in a component programming system, the decoupled access/execute model for C++ programs, and how indexed dependence metadata might be used to improve the programming model for GPU-based designs. Our experimental results with prototype implementations show that indexed dependence metadata supports automatic synthesis of double-buffered data movement for the Cell processor and enables aggressive loop fusion optimisations in image processing, linear algebra and multigrid application case studies.

Bibtex

@phdthesis{Howes2010, author = {Howes, Lee William}, title = {Indexed dependence metadata and its applications in software performance optimisation}, year = {2010}, school = {Imperial College London}, }

Conference papers

Efficient Parallel Image Clustering and Search on a Heterogeneous Platform
Dong Ping Zhang, Lifan Xu, Lee Howes 22nd High Performance Computing Symposium (HPC), Best Paper Award, 2014.
Abstract

We present a parallel image clustering and search framework for large scale datasets that does not require image annotation, segmentation or registration. This work addresses the image search problem while avoiding the need for user-specified or auto-generated metadata. Instead we rely on image data alone to avoid the ambiguity inherent in user-provided information. We propose a parallel algorithm exploiting heterogeneous hardware resources to generate global descriptors for the set of input images. Given a group of query images we derive the global descriptors in parallel. Secondly, we propose to build a customisable search tree of the image database by performing a hierarchical K-means (H-Kmeans) clustering of the corresponding descriptors. Lastly, we design a novel parallel vBFS algorithm to search through the H-Kmeans tree and locate the set of closest matches for query image descriptors.

To validate our design we analyse the search performance and energy efficiency under a range of hardware clock frequencies and in comparison with alternative approaches. The result of our analysis shows that the framework greatly increases the search efficiency and thereby reduces the energy consumption per query.

Bibtex

@inproceedings{Zhang:2014:EPI:2663510.2663527, author = {Zhang, Dong Ping and Xu, Lifan and Howes, Lee}, title = {Efficient Parallel Image Clustering and Search on a Heterogeneous Platform}, booktitle = {Proceedings of the High Performance Computing Symposium}, series = {HPC ’14}, year = {2014}, location = {Tampa, Florida}, pages = {17:1–17:8}, articleno = {17}, numpages = {8}, url = {http://dl.acm.org/citation.cfm?id=2663510.2663527}, acmid = {2663527}, publisher = {Society for Computer Simulation International}, address = {San Diego, CA, USA}, keywords = {OpenCL image search, energy efficiency, hierarchical K-means, hybrid vector-based breadth first search (vBFS), parallel GIST descriptor generation}, }

KMA: A Dynamic Memory Manager for OpenCL
R. Spliet, L. Howes, B. R. Gaster, A. L. Varbanescu 7th Annual Workshop on General Purpose Processing with Graphics Processing Units (GPGPU 7). March 2014. 2014.
Abstract

OpenCL is becoming a popular choice for the parallel programming of both multi-core CPUs and GPGPUs. One of the features missing in OpenCL, yet commonly found in irregular parallel applications, is dynamic memory allocation. In this paper, we propose KMA, a first dynamic memory allocator for OpenCL. KMA’s design is based on a thorough analysis of a set of 11 algorithms, which shows that dynamic memory allocation is a necessary commodity, typically used for implementing complex data structures (arrays, lists, or trees) that need constant restructuring at run- time. Taking into account both the survey findings and the OpenCL challenges, we design KMA as a two-layer memory manager that makes smart use of these patterns: its basic functionality provides generic malloc() and free() APIs, while the higher layer provides support for building and efficiently managing dynamic data structures. Our experiments focus on the performance and usability of KMA, for both micro-benchmarks and a real-life case-study, and our results show that when dynamic allocation is mandatory, KMA is a competitive allocator. We conclude that embedding dynamic memory allocation in OpenCL is feasible, but it is a complex, delicate task due to the massive parallelism of the platform and the requirement for portability.

Bibtex

@inproceedings{Spliet:2014:KDM:2588768.2576781, author = {Spliet, Roy and Howes, Lee and Gaster, Benedict R. and Varbanescu, Ana Lucia}, title = {KMA: A Dynamic Memory Manager for OpenCL}, booktitle = {Proceedings of Workshop on General Purpose Processing Using GPUs}, series = {GPGPU-7}, year = {2014}, isbn = {978-1-4503-2766-4}, location = {Salt Lake City, UT, USA}, pages = {9:9–9:18}, articleno = {9}, numpages = {10}, url = {http://doi.acm.org/10.1145/2576779.2576781}, doi = {10.1145/2576779.2576781}, acmid = {2576781}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {Dynamic memory allocation, Massive parallelism, Multi-/many-cores, OpenCL kernels}, }

Vasculature segmentation using parallel multi-hypothesis template tracking on heterogeneous platforms.
D. P. Zhang, L. Howes. SPIE Electronic Imaging: Parallel Processing in Image Processing Systems 2013.
Abstract

We present a parallel multi-hypothesis template tracking algorithm on heterogeneous platforms using a layered dispatch programming model. The contributions of this work are: an architecture-speci c optimised solution for vasculature structure enhancement, an approach to segment the vascular lumen network from volumetric CTA images and a layered dispatch programming model to free the developers from hand-crafting mappings to particularly constrained execution domains on high throughput architecture. This abstraction is demonstrated through a vasculature segmentation application and can also be applied in other real-world applications.

Current GPGPU programming models de ne a grouping concept which may lead to poorly scoped local/shared memory regions and an inconvenient approach to projecting complicated iterations spaces. To improve on this situation, we propose a simpler and more flexible programming model that leads to easier computation projections and hence a more convenient mapping of the same algorithm to a wide range of architectures.

We first present an optimised image enhancement solution step-by-step, then solve a separable nonlinear least squares problem using a parallel Levenberg-Marquardt algorithm for template matching, and perform the energy effciency analysis and performance comparison on a variety of platforms, including multi-core CPUs, discrete GPUs and APUs. We propose and discuss the eciency of a layered-dispatch programming abstraction for mapping algorithms onto heterogeneous architectures.

Bibtex

@article{ZhangH13, author = {Zhang, Dong Ping and Howes, Lee}, title = { Vasculature segmentation using parallel multi-hypothesis template tracking on heterogeneous platforms }, booktitle = {SPIE Electronic Imaging: Parallel Processing in Image Processing Systems} volume = {8655}, number = {}, pages = {86550P-86550P-9}, year = {2013}, doi = {10.1117/12.2002698}, URL = { + http://dx.doi.org/10.1117/12.2002698}, eprint = {} }

Efficient implementation of GPGPU synchronization primitives on CPUs
Jayanth Gummaraju, Ben Sander, Laurent Morichetti, Benedict Gaster and Lee Howes, ACM international conference on Computing Frontiers 2010.
Bibtex

@inproceedings{Gummaraju:2010:EIG:1787275.1787295, author = {Gummaraju, Jayanth and Sander, Ben and Morichetti, Laurent and Gaster, Benedict and Howes, Lee}, title = {Efficient implementation of GPGPU synchronization primitives on CPUs}, booktitle = {Proceedings of the 7th ACM international conference on Computing frontiers}, series = {CF ’10}, year = {2010}, isbn = {978-1-4503-0044-5}, location = {Bertinoro, Italy}, pages = {85–86}, numpages = {2}, url = {http://doi.acm.org/10.1145/1787275.1787295}, doi = {http://doi.acm.org/10.1145/1787275.1787295}, acmid = {1787295}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {gpgpu, multicore, synchronization}, }

High-performance SIMT code generation in an active visual effects library
Jay L.T. Cornwall, Lee Howes, Paul H.J. Kelly, Phil Parsonage, Bruno Nicoletti. ACM international conference on Computing Frontiers 2009.
Abstract

SIMT (Single-Instruction Multiple-Thread) is an emerging programming paradigm for high-performance computational accelerators, pioneered in current and next generation GPUs and hybrid CPUs. We present a domain-specific active-library supported approach to SIMT code generation and optimisation in the field of visual effects. Our approach uses high-level metadata and runtime context to guide and to ensure the correctness of optimisation-driven code transformations and to implement runtime-context-sensitive optimisations. Our advanced optimisations require no analysis of the original C++ kernel code and deliver 1.3x to 6.6x speedups over syntax-directed translation on GeForce 8800 GTX and GTX 260 GPUs with two commercial visual effects.

Bibtex

@inproceedings{CornwallHKPN09, author = {Cornwall, Jay L.T. and Howes, Lee and Kelly, Paul H.J. and Parsonage, Phil and Nicoletti, Bruno}, title = {High-performance SIMT code generation in an active visual effects library}, booktitle = {CF ’09: Proceedings of the 6th ACM conference on Computing frontiers}, year = {2009}, isbn = {978-1-60558-413-3}, pages = {175–184}, location = {Ischia, Italy}, doi = {http://doi.acm.org/10.1145/1531743.1531772}, publisher = {ACM}, address = {New York, NY, USA}, }

A Comparison of CPUs, GPUs, FPGAs, and Massively Parallel Processor Arrays for Random Number Generation
David B. Thomas, Lee Howes and Wayne Luk. Proceedings of FPGA 2009.
Abstract

The future of high-performance computing is likely to rely on the ability to efficiently exploit huge amounts of parallelism. One way of taking advantage of this parallelism is to formulate problems as “embarrassingly parallel” Monte-Carlo simulations, which allow applications to achieve a linear speedup over multiple computational nodes, without requiring a super-linear increase in inter-node communication. However, such applications are reliant on a cheap supply of high quality random numbers, particularly for the three main maximum entropy distributions: uniform, used as a general source of randomness; Gaussian, for discrete-time simulations; and exponential, for discrete-event simulations. In this paper we look at four different types of platform: conventional multi-core CPUs (Intel Core2); GPUs (NVidia GTX 200); FPGAs (Xilinx Virtex-5); and Massively Parallel Processor Arrays (Ambric AM2000). For each platform we determine the most appropriate algorithm for generating each type of number, then calculate the peak generation rate and estimated power efficiency for each device.

Bibtex

@inproceedings{ThomasHL09, author = {David B. Thomas and Lee Howes and Wayne Luk}, title = {A Comparison of CPUs, GPUs, FPGAs, and Masssively Parallel Processor Arrays for Random Number Generation}, booktitle = {Proceedings of FPGA}, year = {2009}, }

Deriving efficient data movement from decoupled Access/Execute specifications
Lee W. Howes, Anton Lokhmotov, Alastair F. Donaldson and Paul H.J. Kelly. Proceedings of the 4th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC, AR: 28%) Paphos, Cyprus. January, 2009.
Abstract

On multi-core architectures with software-managed memories, effectively orchestrating data movement is essential to performance, but is tedious and error-prone. In this paper we show that when the programmer can explicitly specify both the memory access pattern and the execution schedule of a computation kernel, the compiler or run-time system can derive efficient data movement, even if analysis of kernel code is difficult or impossible. We have developed a framework of C++ classes for decoupled Access/Execute specifications, allowing for automatic communication optimisations such as software pipelining and data reuse. We demonstrate the ease and efficiency of programming Sony/Toshiba/IBM’s Cell BE architecture using these classes by implementing a set of benchmarks, which exhibit data reuse and non-affine access functions, and by comparing these implementations against alternative implementations, which use hand-written DMA transfers and software-based caching.

Bibtex

@inproceedings{HowesLDK09, author = {Lee W. Howes and Anton Lokhmotov and Alastair F. Donaldson and Paul H.J. Kelly}, title = {Deriving Efficient Data Movement From Decoupled Access/Execute Specifications}, booktitle = {Proceedings of the 4th International Conference on High Performance and Embedded Architectures and Compilers (HiPEAC)}, year = {2009}, publisher = {Springer}, series = {Lecture Notes in Computer Science}, volume = {5409}, pages = {168–182}, }

Comparing FPGAs to Graphics Accelerators and the Playstation 2 using a unified source description
Lee W. Howes, Paul Price, Oskar Mencer, Olav Beckmann, Oliver Pell. Proc. of the IEEE Conference on Field Programmable Logic and Applications, Madrid, Spain. 2006
Abstract

Field programmable gate arrays (FPGAs), graphics processing units (GPUs) and Sony’s PlayStation 2 vector units offer scope for hardware acceleration of applications. We compare the performance of these architectures using a unified description based on \textit{A Stream Compiler} (ASC) for FPGAs, which has been extended to target GPUs and PS2 vector units. Programming these architectures from a single description enables us to reason about optimizations for the different architectures. Using the ASC description we implement a Montecarlo simulation, a Fast Fourier Transform (FFT) and a weighted sum algorithm. Our results show that without much optimization the GPU is suited to the Montecarlo simulation, while the weighted sum is better suited to PS2 vector units. FPGA implementations benefit particularly from architecture specific optimizations which ASC allows us to easily implement by adding simple annotations to the shared code.

Bibtex

@INPROCEEDINGS{ HowesPMBP06, booktitle = {International Conference on Field-Programmable Logic}, title = {{Comparing FPGAs to Graphics Accelerators and the Playstation 2 Using a Unified Source Description}}, author = {Howes, Lee and Beckmann, Olav and Mencer, Oskar and Pell, Oliver and Price, Paul}, year = {2006}, url = {http://pubs.doc.ic.ac.uk/asc-fpga-gpu-ps2/} }

Design Space Exploration with A Stream Compiler
Oskar Mencer, David J. Pearce, Lee W. Howes, Wayne Luk. Proc. of the IEEE Conference on Field Programmable Technology (FPT) Tokyo, Japan. 2003
Abstract

We consider speeding up general-purpose applications with hardware accelerators. Traditionally hardware accelerators are tediously hand-crafted to achieve top performance. ASC (A Stream Compiler) simplifies exploration of hardware accelerators by transforming the hardware design task into a software design process using only gcc and make to obtain a hardware netlist. ASC enables programmers to customize hardware accelerators at three levels of abstraction: the architecture level, the functional block level, and the bit level. All three customizations are based on one uniform representation: a single C++ program with custom types and operators for each level of abstraction. This representation allows ASC users to express and reason about the design space, extract parallelism at each level and quickly evaluate different design choices. In addition, since the user has full control over each gate-level resource in the entire design, ASC accelerator performance can always be equal to or better than hand-crafted designs, usually with much less effort. We present several ASC benchmarks, including wavelet compression and Kasumi encryption.

Bibtex

@INPROCEEDINGS{ MencerPHL03, booktitle = {International Conference on Field-Programmable Technology}, title = {{Design space exploration with A Stream Compiler}, author = {Mencer, Oskar and Pearce, David J. and Howes, Lee W. and Luk, Wayne}, year = {2003}, pages = {270 – 277} }

Book Chapters

Cloth Simulation in the Bullet physics SDK
Lee Howes In OpenCL Programming Guide Aaftab Munshi, Benedict R Gaster, Timothy G. Mattson, James Fung, Dan Ginsburg. Addison-Wesley Professional 2012 2012
Bibtex

@InCollection{howes-2011-opencl_programming_guide_cloth_simulation, author = {Aaftab Munshi and Benedict R Gaster and Timothy G Mattson and James Func and Dan Ginsburg}, booktitle = {{OpenCL} Programming Guide}, title = {Cloth Simulation in the Bullet physics SDK}, chapter = 17, publisher = {Addison Wesley}, month = July, year = 2011, pages = {425–448} }

Efficient random number generation and application using CUDA
Lee Howes, David Thomas. GPU Gems 3 2007
Abstract

We describe various methods for performing Gaussian random number generation using CUDA and NVIDIA’s GPU hardware.

Bibtex

@InCollection{howes-2007-gpugems3_random_number_generation_cuda, author = {Lee Howes and David Thomas}, editor = {Hubert Nguyen}, booktitle = {{GPU} Gems 3}, title = {Efficient random number generation and application using {CUDA}}, chapter = 37, publisher = {Addison Wesley}, month = July, year = 2007, pages = {805–830} }

Workshop papers

OpenCL C++
Benedict R. Gaster, Lee Howes Proceedings of the Sixth Workshop on General Purpose Processing Using GPUs (GPGPU-6) Houston, Texas. March 2013
Abstract

With the success of programming models such as Khronos’ OpenCL, heterogeneous computing is going mainstream. However, these models are low-level, even when considering them as systems pro- gramming models. For example, OpenCL is effectively an extended subset of C99, limited to the type unsafe procedural abstraction that C has provided for more than 30 years. Computer systems programming has for more than two decades been able to do a lot better. One successful case in point is the systems programming language C++, known for its strong(er) type system, templates, and object-oriented abstraction features.

In this paper we introduce OpenCL C++, an object-oriented pro- gramming model (based on C++11) for heterogeneous computing and an alternative for developers targeting OpenCL enabled de- vices. We show that OpenCL C’s address space qualifiers, and by implication Embedded C’s, can be lifted into C++’s type system. A novel application of C++11’s new type inference features (auto/de- cltype) with respect to address space qualifiers allows natural and generic use of the this pointer. We qualitatively show that OpenCL C++ is a simpler and a more expressive development platform than its OpenCL C counter part.

Bibtex

@inproceedings{GasterH13a, author = {Benedict R. Gaster and Lee Howes}, title = {OpenCL C++}, booktitle = { Proceedings of the Sixth Workshop on General Purpose Processing Using GPUs (GPGPU-6)}, publisher = {ACM}, year = {2013}, date = {16 March 2013}, location = {Houston, TX, USA}, }

Formalizing Address Spaces with application to Cuda, OpenCL, and beyond
Benedict R. Gaster, Lee Howes Proceedings of the Sixth Workshop on General Purpose Processing Using GPUs (GPGPU-6) Houston, Texas. March 2013
Abstract

Cuda and OpenCL are aimed at programmers developing paral- lel applications targeting GPUs and embedded micro-processors. These systems often have explicitly managed memories exposed directly though a notion of disjoint address spaces. OpenCL ad- dress spaces are based on a similar concept found in Embedded C. A limitation of OpenCL is that a specific pointer must be assigned to a particular address space and thus functions, for example, must say which pointer arguments point to which address spaces. This leads to a loss of composability and moreover can lead to imple- menting multiple versions of the same function. This problem is compounded in the OpenCL C++ variant where a class’ implicit this pointer can be applied to multiple address spaces.

Modern GPUs, such as AMD’s Graphics Core Next and Nvidia’s Fermi, support an additional generic address space that dynami- cally determines an address’ disjoint address space, submitting the correct load/store operation to the particular memory subsystem. Generic address spaces allow for dynamic casting between generic and non-generic address spaces that is similar to the dynamic sub- typing found in objected oriented languages. The advantage of the generic address space is it simplifies the programming model but sometimes at the cost of decreased performance, both dynamically and due to the optimization a compiler can safely perform.

This paper describes a new type system for inferring Cuda and OpenCL style address spaces. We show that the address space system can be inferred. We extend this base system with a notion of generic address space, including dynamic casting, and show that there also exists a static translation to architectures without support for generic address spaces but comes at a potential performance cost. This performance cost can be reclaimed when an architecture directly supports generic address space.

Bibtex

@inproceedings{GasterH13b, author = {Benedict R. Gaster and Lee Howes}, title = {Formalizing Address Spaces with application to Cuda, OpenCL, and beyond}, booktitle = { Proceedings of the Sixth Workshop on General Purpose Processing Using GPUs (GPGPU-6)}, publisher = {ACM}, year = {2013}, date = {16 March 2013}, location = {Houston, TX, USA}, }

Towards metaprogramming for parallel systems on a chip.
Lee Howes, Anton Lokhmotov, Alastair F. Donaldson, Paul H.J. Kelly. Proceedings of the 3rd Euro-Par Workshop on Highly Parallel Processing on a Chip (HPPC, AR: 27.8%) Delft, The Netherlands. August 2009
Abstract

We demonstrate that the performance of commodity parallel systems significantly depends on low-level details, such as storage layout and iteration space mapping, which motivates the need for tools and techniques that separate a high-level algorithm description from low-level mapping and tuning. We propose to build a tool based on the concept of decoupled Access/Execute metadata which allow the programmer to specify both execution constraints and memory access pattern of a computation kernel.

Bibtex

@inproceedings{HowesLDK09b, author = {Lee Howes and Anton Lokhmotov and Alastair F. Donaldson and Paul H.J. Kelly}, title = {Towards metaprogramming for parallel systems on a chip}, booktitle = { Proceedings of the 3rd Euro-Par Workshop on Highly Parallel Processing on a Chip (HPPC)}, publisher = {Springer}, series = {Lecture Notes in Computer Science}, year = {2009}, date = {25 August 2009}, location = {Delft, The Netherlands}, }

Decoupled Access/Execute Metaprogramming for GPU-Accelerated Systems
Lee Howes, Anton Lokhmotov, Paul H.J. Kelly, Alastair F. Donaldson. Symposium on Application Accelerators in High Performance Computing (SAAHPC, AR: 27.8%) Urbana-Champaign, Illinois. July 2009.
Bibtex

@inproceedings{howes-2009-decoupled_access_execute_metaprogramming, author = {Lee Howes and Anton Lokhmotov and Paul H.J. Kelly and Alastair F. Donaldson}, title = {Decoupled Access/Execute Metaprogramming for GPU-Accelerated Systems}, booktitle = {Symposium on Application Accelerators in High Performance Computing (SAAHPC)}, series = {}, year = {2009}, date = {27 July 2009}, location = {Delft, The Netherlands}, }

Optimising component composition using indexed dependence metadata.
Lee W. Howes, Anton Lokhmotov, Paul H.J. Kelly, Anthony J. Field. Proceedings of the 1st International Workshop on New Frontiers in High-performance and Hardware-aware Computing ( HipHaC) Lake Como, Italy. November 8, 2008.
Abstract

This paper explores the use of {\em dependence metadata} for optimising composition in component-based parallel programs. The idea is for each component to carry additional information about how points in its iteration space map to memory locations associated with its input and output data structures. When two components are composed this information can be used to implement optimisations that would otherwise require expensive analysis of the components’ code at the time of composition. This dependence metadata facilitates a number of cross-component optimisations — in this paper we focus on loop fusion and array contraction. We describe a prototype framework, based on the CLooG loop generator tool, that embodies these ideas and report experimental performance results for three non-trivial parallel benchmarks. Our results show execution time reductions of up to 50\% using the proposed framework on an eight-core Intel Xeon system.

Bibtex

@inproceedings{HowesLKF08, author = {Lee W. Howes and Anton Lokhmotov and Paul H.J. Kelly and A. J. Field}, title = {Optimising component composition using indexed dependence metadata}, booktitle = { Proceedings of the 1st International Workshop on New Frontiers in High-performance and Hardware-aware Computing}, year = {2008}, }

Abstracts

Automating generation of data movement code for processors with distributed memories
Lee Howes, Anton Lokhmotov, Paul Kelly, Alastair Donaldson. 5th HiPEAC industrial workshop HP Labs, Barcelona, Spain. June 2008.
FPGAs, GPUs and the PS2 – A Single Programming Methodolgy
Lee W. Howes, Paul Price, Oskar Mencer, Olav Beckmann. IEEE Symposium on Field Programmable Custom Computing Machines, 2006 (poster session). Napa Valley, California, USA, April 2006.
Abstract

Field programmable gate arrays (FPGAs), graphics processing units (GPUs) and Sony’s Playstation 2 vector units offer scope for hardware acceleration of applications. Implementing algorithms on multiple architectures can be a long and complicated process. We demonstrate an approach to compiling for FPGAs, GPUs and PS2 vector units using a unified description based on A Stream Compiler (ASC) for FPGAs. As an example of its use we implement a Montecarlo simulation using ASC. The unified description allows us to evaluate optimisations for specific architectures on top of a single base description, saving time and effort.

Bibtex

@INPROCEEDINGS{ HowesPMB06, booktitle = { The Fourteenth Annual IEEE Symposium on Field-Programmable Custom Computing Machines Napa Valley, CA, April 24-26, 2006}, title = {{FPGAs, GPUs and the PS2 -A Single Programming Methodology}}, author = {Howes, Lee and Price, Paul and Mencer, Oskar and Beckmann, Olav}, year = {2006}, url = {http://pubs.doc.ic.ac.uk/fpgas-gpus-ps2/} }

Accelerating the Development of Hardware Accelerators
Lee W. Howes, Oliver Pell, Oskar Mencer, Olav Beckmann. EDGE Workshop, 2006 University of North Carolina, Chapel Hill. 2006

White Papers

Kite: Braided Parallelism for Heterogeneous Systems
J. Garrett Morris, Benedict R. Gaster, and Lee Howes. AMD 2010.
Abstract

Modern processors are evolving into hybrid, heterogeneous processors with both CPU and GPU cores used for general purpose computation. Several languages, such as BrookGPU, CUDA, and more recently OpenCL, have been developed to harness the potential of these processors. These languages typically involve control code running on a host CPU, while performance-critical, massively data-parallel kernel code runs on the GPUs.

In this paper we present Kite, a rethinking of the GPGPU programming model for heterogeneous braided parallelism: a mix of task and data-parallelism that executes code from a single source efficiently on CPUs and/or GPUs.

The Kite research programming language demonstrates that despite the limitations of today’s GPGPU architectures, it is still possible to move beyond the currently pervasive data-parallel models. We qualitatively demonstrate that opening the GPGPU programming model to braided-parallelism allows the expression of yetunported algorithms, while simultaneously improving programmer productivity by raising the level of abstraction. We further demonstrate Kite’s usefulness as a theoretical foundation for exploring alternative models for GPGPU by deriving task extensions for the C-based data-parallel programming language OpenCL.

Bibtex

@techreport{ Morris2012, title = {Kite: Braided Parallelism for Heterogeneous Systems}, author = {Morris, J. Garrett and Gaster, Benedict R and Howes, Lee}, year = {2010}, institution = {Advanced Micro Devices, Inc.}, url = {http://developer.amd.com/wordpress/media/2012/10/kite.pdf} }