{"id":5,"date":"2011-06-09T10:26:27","date_gmt":"2011-06-09T18:26:27","guid":{"rendered":"http:\/\/www.leehowes.com\/wordpress\/?page_id=5"},"modified":"2021-12-07T08:24:43","modified_gmt":"2021-12-07T16:24:43","slug":"publications","status":"publish","type":"page","link":"http:\/\/www.leehowes.com\/?page_id=5","title":{"rendered":"Publications"},"content":{"rendered":"<h1>Books<\/h1>\r\n  <div class=\"publication\">\r\n    <div class=\"pubTitle\">\r\n      Heterogeneous System Architecture\r\n    <\/div>\r\n    <div class=\"pubInfo\">\r\n      <span class=\"authors\">\r\n        Yeh-Ching Chung, Benedict R. Gaster, Juan G\u00f3mez-Luna, Derek Hower, Lee Howes, Shih-Hao Hung, Thomas B. Jablin, David Kaeli, Phil Rogers, Ben Sander, I-Jui (Ray) Sung, Wen-Mei Hwu.\r\n      <\/span> \r\n      <span class=\"publisher\">Morgan Kaufman<\/span><span class=\"pubDate\">2015.<\/span>\r\n    <\/div>\r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'hsa_book_abstract')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Abstract<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"hsa_book_abstract\">\r\n        <p class=\"abstract\">\r\n          Heterogeneous System Architecture &#8211; a new compute platform infrastructure presents a next-generation hardware platform, and associated software, that allows processors of different types to work efficiently and cooperatively in shared memory from a single source program. HSA also defines a virtual ISA for parallel routines or kernels, which is vendor and ISA independent thus enabling single source programs to execute across any HSA compliant heterogeneous processer from those used in smartphones to supercomputers.\r\n        <\/p>\r\n        <p class=\"abstract\">\r\nThe book begins with an overview of the evolution of heterogeneous parallel processing, associated problems, and how they are overcome with HSA. Later chapters provide a deeper perspective on topics such as the runtime, memory model, queuing, context switching, the architected queuing language, simulators, and tool chains. Finally, three real world examples are presented, which provide an early demonstration of how HSA can deliver significantly higher performance thru C++ based applications. Contributing authors are HSA Foundation members who are experts from both academia and industry.\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"publicationDownload\">\r\n      <a href=\"http:\/\/store.elsevier.com\/product.jsp?isbn=9780128003862\">More information<\/a>\r\n    <\/div>\r\n  <\/div>\r\n\r\n  <div class=\"publication\">\r\n    <div class=\"pubTitle\">\r\n      Heterogeneous Computing with OpenCL: Revised OpenCL 1.2 edition\r\n    <\/div>\r\n    <div class=\"pubInfo\">\r\n      <span class=\"authors\">\r\n        Benedict R. Gaster, Lee Howes, David R. Kaeli, Perhaad Mistry and Dana Schaa.\r\n      <\/span> \r\n      <span class=\"publisher\">Morgan Kaufman<\/span><span class=\"pubDate\">2012.<\/span>\r\n    <\/div>\r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'hc1.2_abstract')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Abstract<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"hc1.2_abstract\">\r\n        <p class=\"abstract\">\r\n          Heterogeneous Computing with OpenCL teaches OpenCL and parallel programming for complex \r\n          systems that may include a variety of device architectures: multi-core CPUs, GPUs, and \r\n          fully-integrated Accelerated Processing Units (APUs) such as AMD Fusion technology. \r\n          Designed to work on multiple platforms and with wide industry support, OpenCL will \r\n          help you more effectively program for a heterogeneous future.\r\n        <\/p>\r\n        <p class=\"abstract\">\r\n          This book will give you hands-on OpenCL experience to address a range of fundamental \r\n          parallel algorithms. The book explore memory spaces, optimization techniques, graphics \r\n          interoperability, extensions, and debugging and profiling. Intended to support a parallel \r\n          programming course, Heterogeneous Computing with OpenCL includes detailed examples \r\n          throughout, plus additional online exercises and other supporting materials.\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"publicationDownload\">\r\n      <a href=\"http:\/\/www.heterogeneouscomputingwithopencl.org\">More information<\/a>\r\n    <\/div>\r\n  <\/div>\r\n\r\n  <div class=\"publication\">\r\n    <div class=\"pubTitle\">\r\n      Heterogeneous Computing with OpenCL\r\n    <\/div>\r\n    <div class=\"pubInfo\">\r\n      <span class=\"authors\">\r\n        Benedict R. Gaster, Lee Howes, David R. Kaeli, Perhaad Mistry and Dana Schaa.\r\n      <\/span> \r\n      <span class=\"publisher\">Morgan Kaufman<\/span><span class=\"pubDate\">2011.<\/span>\r\n    <\/div>\r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'hc1.1_abstract')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Abstract<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"hc1.1_abstract\">\r\n        <p class=\"abstract\">\r\n          Heterogeneous Computing with OpenCL teaches OpenCL and parallel programming for complex \r\n          systems that may include a variety of device architectures: multi-core CPUs, GPUs, and \r\n          fully-integrated Accelerated Processing Units (APUs) such as AMD Fusion technology. \r\n          Designed to work on multiple platforms and with wide industry support, OpenCL will \r\n          help you more effectively program for a heterogeneous future.\r\n        <\/p>\r\n        <p class=\"abstract\">\r\n          This book will give you hands-on OpenCL experience to address a range of fundamental \r\n          parallel algorithms. The book explore memory spaces, optimization techniques, graphics \r\n          interoperability, extensions, and debugging and profiling. Intended to support a parallel \r\n          programming course, Heterogeneous Computing with OpenCL includes detailed examples \r\n          throughout, plus additional online exercises and other supporting materials.\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"publicationDownload\">\r\n      <a href=\"http:\/\/www.heterogeneouscomputingwithopencl.org\">More information<\/a>\r\n    <\/div>\r\n  <\/div>\r\n\r\n<h1>Journal and Magazine articles<\/h1>\r\n  <div class=\"publication\">\r\n    <div class=\"pubTitle\">\r\n      HRF-Relaxed: Adapting HRF to the Complexities of Industrial Heterogeneous Memory Models\r\n    <\/div>\r\n    <div class=\"pubInfo\">\r\n      <span class=\"authors\">\r\n        Benedict R. Gaster and Derek Hower and Lee Howes.\r\n      <\/span> \r\n      <span class=\"publisher\">ACM TACO<\/span>\r\n      <span class=\"pubDate\">April 2015.<\/span>\r\n    <\/div>\r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'hrfrelaxed_abstract')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Abstract<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"hrfrelaxed_abstract\">\r\n        <p class=\"abstract\">\r\n          Memory consistency models, or memory models, allow both programmers and program language implementers to reason about concurrent accesses to one or more memory locations. Memory model specifications balance the often conflicting needs for precise semantics, implementation flexibility, and ease of understanding. Toward that end, popular programming languages like Java, C, and C++ have adopted memory models built on the conceptual foundation of Sequential Consistency for Data-Race-Free programs (SC for DRF). These SC for DRF languages were created with general-purpose homogeneous CPU systems in mind, and all assume a single, global memory address space. Such a uniform address space is usually power and performance prohibitive in heterogeneous Systems on Chips (SoCs), and for that reason most heterogeneous languages have adopted split address spaces and operations with nonglobal visibility.\r\n        <\/p>\r\n        <p class=\"abstract\">\r\n          There have recently been two attempts to bridge the disconnect between the CPU-centric assumptions of the SC for DRF framework and the realities of heterogeneous SoC architectures. Hower et al. proposed a class of Heterogeneous-Race-Free (HRF) memory models that provide a foundation for understanding many of the issues in heterogeneous memory models. At the same time, the Khronos Group developed the OpenCL 2.0 memory model that builds on the C++ memory model. The OpenCL 2.0 model includes features not addressed by HRF: primarily support for relaxed atomics and a property referred to as scope inclusion. In this article, we generalize HRF to allow formalization of and reasoning about more complicated models using OpenCL 2.0 as a point of reference. With that generalization, we (1) make the OpenCL 2.0 memory model more accessible by introducing a platform for feature comparisons to other models, (2) consider a number of shortcomings in the current OpenCL 2.0 model, and (3) propose changes that could be adopted by future OpenCL 2.0 revisions or by other, related, models.\r\n        <\/p> \r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'hrfrelaxed_bibtex')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Bibtex<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"hrfrelaxed_bibtex\">\r\n        <p class=\"bibtex\">\r\n          @article{Gaster:2015:HAH:2744295.2701618,\r\n           author = {Gaster, Benedict R. and Hower, Derek and Howes, Lee},\r\n           title = {HRF-Relaxed: Adapting HRF to the Complexities of Industrial Heterogeneous Memory Models},\r\n           journal = {ACM Trans. Archit. Code Optim.},\r\n           issue_date = {April 2015},\r\n           volume = {12},\r\n           number = {1},\r\n           month = apr,\r\n           year = {2015},\r\n           issn = {1544-3566},\r\n           pages = {7:1&#8211;7:26},\r\n           articleno = {7},\r\n           numpages = {26},\r\n           url = {http:\/\/doi.acm.org\/10.1145\/2701618},\r\n           doi = {10.1145\/2701618},\r\n           acmid = {2701618},\r\n           publisher = {ACM},\r\n           address = {New York, NY, USA},\r\n           keywords = {Memory models, computer architecture, formal models, programming languages},\r\n          } \r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"publicationDownload\">\r\n      <div id=\"item2701618\"><a href=\"http:\/\/dl.acm.org\/authorize?N97977\" title=\"HRF-Relaxed: Adapting HRF to the Complexities of Industrial Heterogeneous Memory Models\">Download<\/a><\/div>\r\n    <\/div>\r\n  <\/div>\r\n\r\n  <div class=\"publication\">\r\n    <div class=\"pubTitle\">\r\n      Can GPGPU Programming be Liberated from the Data Parallel Bottleneck\r\n    <\/div>\r\n    <div class=\"pubInfo\">\r\n      <span class=\"authors\">\r\n        Benedict R. Gaster and Lee Howes.\r\n      <\/span> \r\n      <span class=\"publisher\">IEEE Computer, pages 42-52<\/span>\r\n      <span class=\"pubDate\">August 2012.<\/span>\r\n    <\/div>\r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'liberated_bibtex')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Bibtex<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"liberated_bibtex\">\r\n        <p class=\"bibtex\">\r\n          @article{10.1109\/MC.2012.257,\r\n            author = {Benedict R. Gaster and Lee Howes},\r\n            title = {Can GPGPU Programming Be Liberated from the Data-Parallel Bottleneck?},\r\n            journal ={Computer},\r\n            volume = {45},\r\n            issn = {0018-9162},\r\n            year = {2012},\r\n            pages = {42-52},\r\n            doi = {http:\/\/doi.ieeecomputersociety.org\/10.1109\/MC.2012.257},\r\n            publisher = {IEEE Computer Society},\r\n            address = {Los Alamitos, CA, USA},\r\n          }\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n  <\/div>\r\n\r\n  <div class=\"publication\">\r\n    <div class=\"pubTitle\">\r\n      A Systematic Design Space Exploration Approach to Customising Multi-Processor Architectures: Exemplified Using Graphics Processors\r\n    <\/div>\r\n    <div class=\"pubInfo\">\r\n      <span class=\"authors\">\r\n        Ben Cope, Peter Y. K. Cheung, Wayne Luk and Lee W. Howes\r\n      <\/span> \r\n      <span class=\"publisher\">\r\n        Transactions on High-Performance Embedded Architectures and Compilers, pages 63-83.\r\n      <\/span>\r\n      <span class=\"pubDate\">2011.<\/span>\r\n    <\/div>\r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'systematic_design_space_bibtex')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Bibtex<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"systematic_design_space_bibtex\">\r\n        <p class=\"bibtex\">\r\n          @article{DBLP:journals\/thipeac\/CopeCLH11,\r\n            author  = {Ben Cope and Peter Y. K. Cheung and Wayne Luk and Lee W. Howes},\r\n            title = {\r\n              A Systematic Design Space Exploration Approach to Customising\r\n              Multi-Processor Architectures: Exemplified Using Graphics\r\n              Processors},\r\n            journal = {T. HiPEAC},\r\n            volume = {4},\r\n            year = {2011},\r\n            pages = {63-83},\r\n            ee  = {http:\/\/dx.doi.org\/10.1007\/978-3-642-24568-8_4},\r\n            crossref  = {DBLP:journals\/thipeac\/2011-4},\r\n            bibsource = {DBLP, http:\/\/dblp.uni-trier.de}\r\n          }\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"publicationDownload\">\r\n      <a href=\"..\/files\/cope-2011-systematic_design_space_exploration_approach.pdf\">Download<\/a>\r\n    <\/div>\r\n  <\/div>\r\n\r\n  <div class=\"publication\">\r\n    <div class=\"pubTitle\">\r\n     Performance Comparison of Graphics Processors to Reconfigurable Logic: A Case Study\r\n    <\/div>\r\n    <div class=\"pubInfo\">\r\n      <span class=\"authors\">\r\n        Ben Cope, Peter Y. K. Cheung, Wayne Luk and Lee W. Howes\r\n      <\/span> \r\n      <span class=\"publisher\">\r\n        IEEE Transactions on Computers, pages 433-448.\r\n      <\/span>\r\n      <span class=\"pubDate\">April 2010.<\/span>\r\n    <\/div>\r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'performance_comparison_graphics_bibtex')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Bibtex<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"performance_comparison_graphics_bibtex\">\r\n        <p class=\"bibtex\">\r\n          @article{DBLP:journals\/tc\/CopeCLH10,\r\n            author    = {Ben Cope and Peter Y. K. Cheung and Wayne Luk and Lee W. Howes},\r\n            title     = {Performance Comparison of Graphics Processors to Reconfigurable\r\n              Logic: A Case Study},\r\n            journal   = {IEEE Trans. Computers},\r\n            volume    = {59},\r\n            number    = {4},\r\n            year      = {2010},\r\n            pages     = {433-448},\r\n            ee        = {http:\/\/doi.ieeecomputersociety.org\/10.1109\/TC.2009.179},\r\n            bibsource = {DBLP, http:\/\/dblp.uni-trier.de}\r\n          }\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"publicationDownload\">\r\n      <a href=\"..\/files\/cope-2010-performance_comparison.pdf\">Download<\/a>\r\n    <\/div>\r\n  <\/div>\r\n\r\n<h1>PhD Thesis<\/h1>\r\n\r\n  <div class=\"publication\">\r\n    <div class=\"pubTitle\">\r\n     Indexed dependence metadata and its applications in software performance optimisation\r\n    <\/div>\r\n    <div class=\"pubInfo\">\r\n      <span class=\"authors\">\r\n        Lee William Howes.\r\n      <\/span> \r\n      <span class=\"publisher\">\r\n        Imperial College London\r\n      <\/span>\r\n      <span class=\"pubDate\">2010.<\/span>\r\n    <\/div>\r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'thesis_abstract')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Abstract<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"thesis_abstract\">\r\n        <p class=\"abstract\">\r\n          To achieve continued performance improvements, modern microprocessor design is tending to concentrate \r\n          an increasing proportion of hardware on computation units with less automatic management of data movement \r\n          and extraction of parallelism. As a result, architectures increasingly include multiple computation cores and \r\n          complicated, software-managed memory hierarchies. Compilers have difficulty characterizing the behaviour of \r\n          a kernel in a general enough manner to enable automatic generation of efficient code in any but the most \r\n          straightforward of cases.\r\n      <\/p>\r\n      <p class=\"abstract\">We propose the concept of indexed dependence metadata to improve application development \r\n          and mapping onto such architectures. The metadata represent both the iteration space of a kernel and the mapping \r\n          of that iteration space from a given index to the set of data elements that iteration might use: thus the dependence \r\n          metadata is indexed by the kernel\u2019s iteration space. This explicit mapping allows the compiler or runtime to optimise \r\n          the program more efficiently, and improves the program structure for the developer. We argue that this form of explicit \r\n          interface specification reduces the need for premature, architecture-specific optimisation. It improves program portability, \r\n          supports intercomponent optimisation and enables generation of efficient data movement code. \r\n      <\/p>\r\n      <p class=\"abstract\">We offer the following contributions: an introduction to the concept of indexed dependence metadata \r\n          as a generalisation of stream programming, a demonstration of its advantages in a component programming system, \r\n          the decoupled access\/execute model for C++ programs, and how indexed dependence metadata might be used to \r\n          improve the programming model for GPU-based designs. Our experimental results with prototype implementations \r\n          show that indexed dependence metadata supports automatic synthesis of double-buffered data movement for the Cell \r\n          processor and enables aggressive loop fusion optimisations in image processing, linear algebra and multigrid application \r\n          case studies.\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'thesis_bibtex')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Bibtex<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"thesis_bibtex\">\r\n        <p class=\"bibtex\">\r\n          @phdthesis{Howes2010,\r\n            author = {Howes, Lee William},\r\n            title = {Indexed dependence metadata and its applications in software performance optimisation},\r\n            year = {2010},\r\n            school = {Imperial College London},\r\n          }\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"publicationDownload\">\r\n      <a href=\"..\/files\/HowesThesis2010.pdf\">Download<\/a>\r\n    <\/div>\r\n  <\/div>\r\n\r\n\r\n\r\n<h1>Conference papers<\/h1>\r\n  <div class=\"publication\">\r\n    <div class=\"pubTitle\">\r\n      Efficient Parallel Image Clustering and Search on a Heterogeneous Platform\r\n    <\/div>\r\n    <div class=\"pubInfo\">\r\n      <span class=\"authors\">\r\n        Dong Ping Zhang, Lifan Xu, Lee Howes\r\n      <\/span> \r\n      <span class=\"publisher\">\r\n        22nd High Performance Computing Symposium (HPC), Best Paper Award,\r\n      <\/span>\r\n      <span class=\"pubDate\">2014.<\/span>\r\n    <\/div>    \r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'HPC2014_abstract')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Abstract<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"HPC2014_abstract\">\r\n        <p class=\"abstract\">\r\n          We present a parallel image clustering and search framework for large scale datasets that does not require image annotation, segmentation or registration. This work addresses the image search problem while avoiding the need for user-specified or auto-generated metadata. Instead we rely on image data alone to avoid the ambiguity inherent in user-provided information. We propose a parallel algorithm exploiting heterogeneous hardware resources to generate global descriptors for the set of input images. Given a group of query images we derive the global descriptors in parallel. Secondly, we propose to build a customisable search tree of the image database by performing a hierarchical K-means (H-Kmeans) clustering of the corresponding descriptors. Lastly, we design a novel parallel vBFS algorithm to search through the H-Kmeans tree and locate the set of closest matches for query image descriptors.\r\n        <\/p>\r\n        <p class=\"abstract\">\r\nTo validate our design we analyse the search performance and energy efficiency under a range of hardware clock frequencies and in comparison with alternative approaches. The result of our analysis shows that the framework greatly increases the search efficiency and thereby reduces the energy consumption per query.\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'HPC2014_bibtex')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Bibtex<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"HPC2014_bibtex\">\r\n        <p class=\"bibtex\">\r\n          @inproceedings{Zhang:2014:EPI:2663510.2663527,\r\n           author = {Zhang, Dong Ping and Xu, Lifan and Howes, Lee},\r\n           title = {Efficient Parallel Image Clustering and Search on a Heterogeneous Platform},\r\n           booktitle = {Proceedings of the High Performance Computing Symposium},\r\n           series = {HPC &#8217;14},\r\n           year = {2014},\r\n           location = {Tampa, Florida},\r\n           pages = {17:1&#8211;17:8},\r\n           articleno = {17},\r\n           numpages = {8},\r\n           url = {http:\/\/dl.acm.org\/citation.cfm?id=2663510.2663527},\r\n           acmid = {2663527},\r\n           publisher = {Society for Computer Simulation International},\r\n           address = {San Diego, CA, USA},\r\n           keywords = {OpenCL image search, energy efficiency, hierarchical K-means, hybrid vector-based breadth first search (vBFS), parallel GIST descriptor generation},\r\n          } \r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"publicationDownload\">\r\n      <a href=\"..\/files\/zhang-2014-HPC.pdf\">Download paper<\/a>\r\n    <\/div>\r\n  <\/div>\r\n\r\n  <div class=\"publication\">\r\n    <div class=\"pubTitle\">\r\n      KMA: A Dynamic Memory Manager for OpenCL\r\n    <\/div>\r\n    <div class=\"pubInfo\">\r\n      <span class=\"authors\">\r\n        R. Spliet, L. Howes, B. R. Gaster, A. L. Varbanescu\r\n      <\/span> \r\n      <span class=\"publisher\">\r\n        7th Annual Workshop on General Purpose Processing with Graphics Processing Units (GPGPU 7). March 2014.\r\n      <\/span>\r\n      <span class=\"pubDate\">2014.<\/span>\r\n    <\/div>    \r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'GPGPU2014_abstract')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Abstract<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"GPGPU2014_abstract\">\r\n        <p class=\"abstract\">\r\n          OpenCL is becoming a popular choice for the parallel programming of both multi-core CPUs and GPGPUs. \r\n          One of the features missing in OpenCL, yet commonly found in irregular parallel applications, is dynamic memory allocation. \r\n          In this paper, we propose KMA, a first dynamic memory allocator for OpenCL. \r\n          KMA\u2019s design is based on a thorough analysis of a set of 11 algorithms, which shows that dynamic memory allocation is a necessary commodity, typically used for implementing complex data structures (arrays, lists, or trees) that need constant restructuring at run- time. \r\n          Taking into account both the survey findings and the OpenCL challenges, we design KMA as a two-layer memory manager that makes smart use of these patterns: its basic functionality provides generic malloc() and free() APIs, while the higher layer provides support for building and efficiently managing dynamic data structures. \r\n          Our experiments focus on the performance and usability of KMA, for both micro-benchmarks and a real-life case-study, and our results show that when dynamic allocation is mandatory, KMA is a competitive allocator. \r\n          We conclude that embedding dynamic memory allocation in OpenCL is feasible, but it is a complex, delicate task due to the massive parallelism of the platform and the requirement for portability. \r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'GPGPU2014_bibtex')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Bibtex<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"GPGPU2014_bibtex\">\r\n        <p class=\"bibtex\">\r\n          @inproceedings{Spliet:2014:KDM:2588768.2576781,\r\n            author = {Spliet, Roy and Howes, Lee and Gaster, Benedict R. and Varbanescu, Ana Lucia},\r\n            title = {KMA: A Dynamic Memory Manager for OpenCL},\r\n            booktitle = {Proceedings of Workshop on General Purpose Processing Using GPUs},\r\n            series = {GPGPU-7},\r\n            year = {2014},\r\n            isbn = {978-1-4503-2766-4},\r\n            location = {Salt Lake City, UT, USA},\r\n            pages = {9:9&#8211;9:18},\r\n            articleno = {9},\r\n            numpages = {10},\r\n            url = {http:\/\/doi.acm.org\/10.1145\/2576779.2576781},\r\n            doi = {10.1145\/2576779.2576781},\r\n            acmid = {2576781},\r\n            publisher = {ACM},\r\n            address = {New York, NY, USA},\r\n            keywords = {Dynamic memory allocation, Massive parallelism, Multi-\/many-cores, OpenCL kernels},\r\n          } \r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"publicationDownload\">\r\n      <a href=\"..\/files\/spliet-2014-KMA-DynamicMemoryManager.pdf\">Download paper<\/a>\r\n    <\/div>\r\n  <\/div>\r\n\r\n  <div class=\"publication\">\r\n    <div class=\"pubTitle\">\r\n     Vasculature segmentation using parallel multi-hypothesis template tracking on heterogeneous platforms.\r\n    <\/div>\r\n    <div class=\"pubInfo\">\r\n      <span class=\"authors\">\r\n        D. P. Zhang, L. Howes.\r\n      <\/span> \r\n      <span class=\"publisher\">\r\n        SPIE Electronic Imaging: Parallel Processing in Image Processing Systems\r\n      <\/span>\r\n      <span class=\"pubDate\">2013.<\/span>\r\n    <\/div>    \r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'SPIE2013_abstract')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Abstract<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"SPIE2013_abstract\">\r\n        <p class=\"abstract\">\r\n          We present a parallel multi-hypothesis template tracking algorithm on \r\n          heterogeneous platforms using a layered dispatch programming model. \r\n          The contributions of this work are: an architecture-speci\fc optimised solution\r\n          for vasculature structure enhancement, an approach to segment the vascular \r\n          lumen network from volumetric CTA images and a layered dispatch programming \r\n          model to free the developers from hand-crafting mappings to particularly \r\n          constrained execution domains on high throughput architecture. This \r\n          abstraction is demonstrated through a vasculature segmentation application \r\n          and can also be applied in other real-world applications. \r\n        <\/p>\r\n        <p class=\"abstract\">\r\n          Current GPGPU \r\n          programming models de\fne a grouping concept which may lead to poorly \r\n          scoped local\/shared memory regions and an inconvenient approach to \r\n          projecting complicated iterations spaces. To improve on this situation, \r\n          we propose a simpler and more flexible programming model that leads to \r\n          easier computation projections and hence a more convenient mapping \r\n          of the same algorithm to a wide range of architectures. \r\n        <\/p>          \r\n        <p class=\"abstract\">\r\n          We first present \r\n          an optimised image enhancement solution step-by-step, then solve a \r\n          separable nonlinear least squares problem using a parallel \r\n          Levenberg-Marquardt algorithm for template matching, and perform the energy \r\n          effciency analysis and performance comparison on a variety of platforms, \r\n          including multi-core CPUs, discrete GPUs and APUs. We propose and \r\n          discuss the e\u000eciency of a layered-dispatch programming abstraction\r\n          for mapping algorithms onto heterogeneous architectures.\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'SPIE2013_bibtex')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Bibtex<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"SPIE2013_bibtex\">\r\n        <p class=\"bibtex\">\r\n          @article{ZhangH13,\r\n            author = {Zhang, Dong Ping and Howes, Lee},\r\n            title = {\r\n              Vasculature segmentation using parallel multi-hypothesis template tracking on heterogeneous platforms\r\n            },\r\n            booktitle = {SPIE Electronic Imaging: Parallel Processing in Image Processing Systems}\r\n            volume = {8655},\r\n            number = {},\r\n            pages = {86550P-86550P-9},\r\n            year = {2013},\r\n            doi = {10.1117\/12.2002698},\r\n            URL = { + http:\/\/dx.doi.org\/10.1117\/12.2002698},\r\n            eprint = {}\r\n          }\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"publicationDownload\">\r\n      <a href=\"..\/files\/zhang-2013-spie_electronic_imaging.pdf\">Download paper<\/a>\r\n    <\/div>\r\n  <\/div>\r\n\r\n  <div class=\"publication\">\r\n    <div class=\"pubTitle\">\r\n     Efficient implementation of GPGPU synchronization primitives on CPUs\r\n    <\/div>\r\n    <div class=\"pubInfo\">\r\n      <span class=\"authors\">\r\n        Jayanth\u00a0Gummaraju, Ben Sander, Laurent Morichetti, Benedict Gaster and Lee Howes,\r\n      <\/span> \r\n      <span class=\"publisher\">\r\n        ACM international conference on Computing Frontiers\r\n      <\/span>\r\n      <span class=\"pubDate\">2010.<\/span>\r\n    <\/div>\r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'GPGPU_synchronization_bibtex')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Bibtex<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"GPGPU_synchronization_bibtex\">\r\n        <p class=\"bibtex\">\r\n          @inproceedings{Gummaraju:2010:EIG:1787275.1787295,\r\n            author = {Gummaraju, Jayanth and Sander, Ben and Morichetti, Laurent and Gaster, Benedict and Howes, Lee},\r\n            title = {Efficient implementation of GPGPU synchronization primitives on CPUs},\r\n            booktitle = {Proceedings of the 7th ACM international conference on Computing frontiers},\r\n            series = {CF &#8217;10},\r\n            year = {2010},\r\n            isbn = {978-1-4503-0044-5},\r\n            location = {Bertinoro, Italy},\r\n            pages = {85&#8211;86},\r\n            numpages = {2},\r\n            url = {http:\/\/doi.acm.org\/10.1145\/1787275.1787295},\r\n            doi = {http:\/\/doi.acm.org\/10.1145\/1787275.1787295},\r\n            acmid = {1787295},\r\n            publisher = {ACM},\r\n            address = {New York, NY, USA},\r\n            keywords = {gpgpu, multicore, synchronization},\r\n          }\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"publicationDownload\">\r\n      <a href=\"..\/files\/gummaraju-2010-efficient_implementation_of_GPGPU_synchronization_primitives_on_CPUs.pdf\">Download paper<\/a>\r\n    <\/div>\r\n  <\/div>\r\n\r\n  <div class=\"publication\">\r\n    <div class=\"pubTitle\">\r\n     High-performance SIMT code generation in an active visual effects library\r\n    <\/div>\r\n    <div class=\"pubInfo\">\r\n      <span class=\"authors\">\r\n        Jay L.T. Cornwall, Lee Howes, Paul H.J. Kelly, Phil Parsonage, Bruno Nicoletti.\r\n      <\/span> \r\n      <span class=\"publisher\">\r\n        ACM international conference on Computing Frontiers\r\n      <\/span>\r\n      <span class=\"pubDate\">2009.<\/span>\r\n    <\/div>    \r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'SIMT_abstract')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Abstract<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"SIMT_abstract\">\r\n        <p class=\"abstract\">\r\n          SIMT (Single-Instruction Multiple-Thread) is an emerging programming paradigm for high-performance \r\n          computational accelerators, pioneered in current and next generation GPUs and hybrid CPUs. We \r\n          present a domain-specific active-library supported approach to SIMT code generation and optimisation \r\n          in the field of visual effects. Our approach uses high-level metadata and runtime context to guide and to \r\n          ensure the correctness of optimisation-driven code transformations and to implement runtime-context-sensitive \r\n          optimisations. Our advanced optimisations require no analysis of the original C++ kernel code and \r\n          deliver 1.3x to 6.6x speedups over syntax-directed translation on GeForce 8800 GTX and GTX 260 GPUs \r\n          with two commercial visual effects.\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'SIMT_bibtex')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Bibtex<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"SIMT_bibtex\">\r\n        <p class=\"bibtex\">\r\n          @inproceedings{CornwallHKPN09,\r\n            author = {Cornwall, Jay L.T. and Howes, Lee and Kelly, Paul H.J. and Parsonage, Phil and Nicoletti, Bruno},\r\n            title = {High-performance SIMT code generation in an active visual effects library},\r\n            booktitle = {CF &#8217;09: Proceedings of the 6th ACM conference on Computing frontiers},\r\n            year = {2009},\r\n            isbn = {978-1-60558-413-3},\r\n            pages = {175&#8211;184},\r\n            location = {Ischia, Italy},\r\n            doi = {http:\/\/doi.acm.org\/10.1145\/1531743.1531772},\r\n            publisher = {ACM},\r\n            address = {New York, NY, USA},\r\n          }\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"publicationDownload\">\r\n      <a href=\"..\/files\/cornwall-2009-high-performance_SIMT_code_generation.pdf\">Download paper<\/a> | \r\n      <a href=\"..\/files\/cornwall-2009-high-performance_SIMT_code_generation-presentation.pdf\">Download presentation<\/a>\r\n    <\/div>\r\n  <\/div>\r\n\r\n\r\n\r\n  <div class=\"publication\">\r\n    <div class=\"pubTitle\">\r\n       A Comparison of CPUs, GPUs, FPGAs, and Massively Parallel Processor Arrays for Random Number Generation\r\n    <\/div>\r\n    <div class=\"pubInfo\">\r\n      <span class=\"authors\">\r\n        David B. Thomas, Lee Howes and Wayne Luk.\r\n      <\/span> \r\n      <span class=\"publisher\">\r\n        Proceedings of FPGA\r\n      <\/span>\r\n      <span class=\"pubDate\">2009.<\/span>\r\n    <\/div>    \r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'FPGA_RNGS_abstract')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Abstract<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"FPGA_RNGS_abstract\">\r\n        <p class=\"abstract\">\r\n          The future of high-performance computing is likely to rely on the ability to efficiently exploit \r\n          huge amounts of parallelism. One way of taking advantage of this parallelism is to formulate \r\n          problems as \u201cembarrassingly parallel\u201d Monte-Carlo simulations, which allow applications to \r\n          achieve a linear speedup over multiple computational nodes, without requiring a super-linear \r\n          increase in inter-node communication. However, such applications are reliant on a cheap \r\n          supply of high quality random numbers, particularly for the three main maximum entropy \r\n          distributions: uniform, used as a general source of randomness; Gaussian, for discrete-time \r\n          simulations; and exponential, for discrete-event simulations. In this paper we look at four different \r\n          types of platform: conventional multi-core CPUs (Intel Core2); GPUs (NVidia GTX 200); FPGAs \r\n          (Xilinx Virtex-5); and Massively Parallel Processor Arrays (Ambric AM2000). For each platform \r\n          we determine the most appropriate algorithm for generating each type of number, then calculate \r\n          the peak generation rate and estimated power efficiency for each device.\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'FPGA_RNGS_bibtex')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Bibtex<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"FPGA_RNGS_bibtex\">\r\n        <p class=\"bibtex\">\r\n          @inproceedings{ThomasHL09,\r\n            author = {David B. Thomas and Lee Howes and Wayne Luk},\r\n            title = {A Comparison of CPUs, GPUs, FPGAs, and Masssively Parallel Processor Arrays for Random Number Generation},\r\n            booktitle = {Proceedings of FPGA},\r\n            year = {2009},\r\n          }\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"publicationDownload\">\r\n      <a href=\"..\/files\/thomas-2009-comparison_of_cpus_gpus_fpgs_random_number_generation.pdf\">Download paper<\/a>\r\n    <\/div>\r\n  <\/div>\r\n\r\n  <div class=\"publication\">\r\n    <div class=\"pubTitle\">\r\n       Deriving efficient data movement from decoupled Access\/Execute specifications\r\n    <\/div>\r\n    <div class=\"pubInfo\">\r\n      <span class=\"authors\">\r\n        Lee W. Howes, Anton Lokhmotov, Alastair F. Donaldson and Paul H.J. Kelly.\r\n      <\/span> \r\n      <span class=\"publisher\">\r\n        Proceedings of the 4th International Conference on High-Performance and Embedded Architectures and Compilers (<a href=\"http:\/\/www.hipeac.net\/conference\/\">HiPEAC<\/a>, AR: 28%)\r\n      <\/span>\r\n      <span class=\"pubLocation\">Paphos, Cyprus.<\/span>\r\n      <span class=\"pubDate\">January, 2009.<\/span>\r\n    <\/div>    \r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'AECUTE_abstract')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Abstract<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"AECUTE_abstract\">\r\n        <p class=\"abstract\">\r\n          On multi-core architectures with software-managed memories, effectively orchestrating \r\n          data movement is essential to performance, but is tedious and error-prone. In this paper \r\n          we show that when the programmer can explicitly specify both the memory access pattern \r\n          and the execution schedule of a computation kernel, the compiler or run-time system can \r\n          derive efficient data movement, even if analysis of kernel code is difficult or impossible. We \r\n          have developed a framework of C++ classes for decoupled Access\/Execute specifications, \r\n          allowing for automatic communication optimisations such as software pipelining and data \r\n          reuse. We demonstrate the ease and efficiency of programming Sony\/Toshiba\/IBM&#8217;s Cell BE \r\n          architecture using these classes by implementing a set of benchmarks, which exhibit data \r\n          reuse and non-affine access functions, and by comparing these implementations against \r\n          alternative implementations, which use hand-written DMA transfers and software-based caching.\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'AECUTE_bibtex')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Bibtex<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"AECUTE_bibtex\">\r\n        <p class=\"bibtex\">\r\n          @inproceedings{HowesLDK09,\r\n            author = {Lee W. Howes and Anton Lokhmotov and Alastair F. Donaldson and Paul H.J. Kelly},\r\n            title = {Deriving Efficient Data Movement From Decoupled Access\/Execute Specifications},\r\n            booktitle = {Proceedings of the 4th International Conference on High Performance and Embedded Architectures and Compilers (HiPEAC)},\r\n            year = {2009},\r\n            publisher = {Springer},\r\n            series = {Lecture Notes in Computer Science},\r\n            volume = {5409},\r\n            pages = {168&#8211;182},\r\n          }\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"publicationDownload\">\r\n      <a href=\"..\/files\/howes-2009-aecute.pdf\">Download paper<\/a> | <a href=\"..\/files\/howes-2009-aecute-presentation.pdf\">Download presentation<\/a>\r\n    <\/div>\r\n  <\/div>\r\n\r\n  <div class=\"publication\">\r\n    <div class=\"pubTitle\">\r\n       Comparing FPGAs to Graphics Accelerators and the Playstation 2 using a unified source description\r\n    <\/div>\r\n    <div class=\"pubInfo\">\r\n      <span class=\"authors\">\r\n        Lee W. Howes, Paul Price, Oskar Mencer, Olav Beckmann, Oliver Pell.\r\n      <\/span> \r\n      <span class=\"publisher\">\r\n        Proc. of the IEEE Conference on Field Programmable Logic and Applications,\r\n      <\/span>\r\n      <span class=\"pubLocation\">Madrid, Spain.<\/span>\r\n      <span class=\"pubDate\">2006<\/span>\r\n    <\/div>    \r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'comparing_FPGAS_abstract')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Abstract<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"comparing_FPGAS_abstract\">\r\n        <p class=\"abstract\">\r\n          Field programmable gate arrays (FPGAs), graphics processing units (GPUs) \r\n          and Sony&#8217;s PlayStation 2 vector units offer scope for hardware acceleration of \r\n          applications. We compare the performance of these architectures using a unified \r\n          description based on \\textit{A Stream Compiler} (ASC) for FPGAs, which has \r\n          been extended to target GPUs and PS2 vector units. Programming these \r\n          architectures from a single description enables us to reason about optimizations \r\n          for the different architectures. Using the ASC description we implement a Montecarlo \r\n          simulation, a Fast Fourier Transform (FFT) and a weighted sum algorithm. Our \r\n          results show that without much optimization the GPU is suited to the Montecarlo \r\n          simulation, while the weighted sum is better suited to PS2 vector units. FPGA \r\n          implementations benefit particularly from architecture specific optimizations which \r\n          ASC allows us to easily implement by adding simple annotations to the shared code.\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'comparing_FPGAS_bibtex')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Bibtex<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"comparing_FPGAS_bibtex\">\r\n        <p class=\"bibtex\">\r\n          @INPROCEEDINGS{ HowesPMBP06,\r\n            booktitle = {International Conference on Field-Programmable Logic},\r\n            title = {{Comparing FPGAs to Graphics Accelerators and the Playstation 2 Using a Unified Source Description}},\r\n            author = {Howes, Lee and Beckmann, Olav and Mencer, Oskar and Pell, Oliver and Price, Paul},\r\n            year = {2006},\r\n            url = {http:\/\/pubs.doc.ic.ac.uk\/asc-fpga-gpu-ps2\/}\r\n          }\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"publicationDownload\">\r\n      <a href=\"..\/files\/howes-2006-comparing_fpags_to_graphics_accelerators_and_the_playstation2.pdf\">Download paper<\/a> | \r\n      <a href=\"..\/files\/howes-2006-comparing_fpags_to_graphics_accelerators_and_the_playstation2-presentation.pdf\">Download presentation<\/a>\r\n    <\/div>\r\n  <\/div>\r\n\r\n  <div class=\"publication\">\r\n    <div class=\"pubTitle\">\r\n       Design Space Exploration with A Stream Compiler\r\n    <\/div>\r\n    <div class=\"pubInfo\">\r\n      <span class=\"authors\">\r\n        Oskar Mencer, David J. Pearce, Lee W. Howes, Wayne Luk.\r\n      <\/span> \r\n      <span class=\"publisher\">\r\n        Proc. of the IEEE Conference on Field Programmable Technology (FPT)\r\n      <\/span>\r\n      <span class=\"pubLocation\">Tokyo, Japan.<\/span>\r\n      <span class=\"pubDate\">2003<\/span>\r\n    <\/div>    \r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'design_space_abstract')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Abstract<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"design_space_abstract\">\r\n        <p class=\"abstract\">\r\n          We consider speeding up general-purpose applications with hardware accelerators. \r\n          Traditionally hardware accelerators are tediously hand-crafted to achieve top performance. \r\n          ASC (A Stream Compiler) simplifies exploration of hardware accelerators by transforming \r\n          the hardware design task into a software design process using only gcc and make to \r\n          obtain a hardware netlist. ASC enables programmers to customize hardware accelerators \r\n          at three levels of abstraction: the architecture level, the functional block level, and the bit \r\n          level. All three customizations are based on one uniform representation: a single C++ \r\n          program with custom types and operators for each level of abstraction. This representation \r\n          allows ASC users to express and reason about the design space, extract parallelism at each \r\n          level and quickly evaluate different design choices. In addition, since the user has full control \r\n          over each gate-level resource in the entire design, ASC accelerator performance can always \r\n          be equal to or better than hand-crafted designs, usually with much less effort. We present \r\n          several ASC benchmarks, including wavelet compression and Kasumi encryption.\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'design_space_bibtex')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Bibtex<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"design_space_bibtex\">\r\n        <p class=\"bibtex\">\r\n          @INPROCEEDINGS{ MencerPHL03,\r\n            booktitle = {International Conference on Field-Programmable Technology},\r\n            title = {{Design space exploration with A Stream Compiler},\r\n            author = {Mencer, Oskar and Pearce, David J. and Howes, Lee W. and Luk, Wayne},\r\n            year = {2003},\r\n            pages = {270 &#8211; 277}\r\n          }\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"publicationDownload\">\r\n      <a href=\"..\/files\/mencer_2003-design_space_exploration_with_a_stream_compiler.pdf\">Download paper<\/a>\r\n    <\/div>\r\n  <\/div>\r\n\r\n\r\n<h1>Book Chapters<\/h1>\r\n  <div class=\"publication\">\r\n    <div class=\"pubTitle\">\r\n       Cloth Simulation in the Bullet physics SDK\r\n    <\/div>\r\n    <div class=\"pubInfo\">\r\n      <span class=\"authors\">\r\n        Lee Howes\r\n      <\/span> \r\n      <span class=\"publisher\">\r\n        In <em>OpenCL Programming Guide<\/em> Aaftab Munshi, Benedict R Gaster, Timothy G. Mattson, James Fung, Dan Ginsburg. Addison-Wesley Professional 2012\r\n      <\/span>\r\n      <span class=\"pubDate\">2012<\/span>\r\n    <\/div>    \r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'programming_guide_bibtex')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Bibtex<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"programming_guide_bibtex\">\r\n        <p class=\"bibtex\">\r\n          @InCollection{howes-2011-opencl_programming_guide_cloth_simulation,\r\n            author = {Aaftab Munshi and Benedict R Gaster and Timothy G Mattson and James Func and Dan Ginsburg},\r\n            booktitle = {{OpenCL} Programming Guide},\r\n            title = {Cloth Simulation in the Bullet physics SDK},\r\n            chapter = 17,\r\n            publisher = {Addison Wesley},\r\n            month = July,\r\n            year = 2011,\r\n            pages = {425&#8211;448}\r\n          }\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n  <\/div>\r\n\r\n  <div class=\"publication\">\r\n    <div class=\"pubTitle\">\r\n       Efficient random number generation and application using CUDA\r\n    <\/div>\r\n    <div class=\"pubInfo\">\r\n      <span class=\"authors\">\r\n        Lee Howes, David Thomas.\r\n      <\/span> \r\n      <span class=\"publisher\">\r\n        GPU Gems 3\r\n      <\/span>\r\n      <span class=\"pubDate\">2007<\/span>\r\n    <\/div>    \r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'GPUGems_abstract')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Abstract<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"GPUGems_abstract\">\r\n        <p class=\"abstract\">\r\n          We describe various methods for performing Gaussian random \r\n          number generation using CUDA and NVIDIA&#8217;s GPU hardware.\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'GPUGems_bibtex')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Bibtex<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"GPUGems_bibtex\">\r\n        <p class=\"bibtex\">\r\n          @InCollection{howes-2007-gpugems3_random_number_generation_cuda,\r\n            author = {Lee Howes and David Thomas},\r\n            editor = {Hubert Nguyen},\r\n            booktitle = {{GPU} Gems 3},\r\n            title = {Efficient random number generation and application using {CUDA}},\r\n            chapter = 37,\r\n            publisher = {Addison Wesley},\r\n            month = July,\r\n            year = 2007,\r\n            pages = {805&#8211;830}\r\n          }\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"publicationDownload\">\r\n      <a href=\"..\/files\/howes-2007-efficient_random_number_generation_CUDA.pdf\">Download chapter<\/a>\r\n    <\/div>\r\n  <\/div>\r\n\r\n\r\n\r\n<h1>Workshop papers<\/h1>\r\n  <div class=\"publication\">\r\n    <div class=\"pubTitle\">\r\n       OpenCL C++\r\n    <\/div>\r\n    <div class=\"pubInfo\">\r\n      <span class=\"authors\">\r\n        Benedict R. Gaster, Lee Howes\r\n      <\/span> \r\n      <span class=\"publisher\">\r\n        Proceedings of the Sixth Workshop on General Purpose Processing Using GPUs (<a href=\"http:\/\/www.ece.neu.edu\/groups\/nucar\/GPGPU\/GPGPU6\/\">GPGPU-6<\/a>)\r\n      <\/span>\r\n      <span class=\"pubLocation\"> Houston, Texas.<\/span>\r\n      <span class=\"pubDate\">March 2013<\/span>\r\n    <\/div>    \r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'GPGPU6a_abstract')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Abstract<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"GPGPU6a_abstract\">\r\n        <p class=\"abstract\">\r\n          With the success of programming models such as Khronos\u2019 OpenCL,\r\n          heterogeneous computing is going mainstream. However, these\r\n          models are low-level, even when considering them as systems pro-\r\n          gramming models. For example, OpenCL is effectively an extended\r\n          subset of C99, limited to the type unsafe procedural abstraction\r\n          that C has provided for more than 30 years. Computer systems\r\n          programming has for more than two decades been able to do a lot\r\n          better. One successful case in point is the systems programming\r\n          language C++, known for its strong(er) type system, templates, and\r\n          object-oriented abstraction features.\r\n        <\/p>\r\n        <p class=\"abstract\">\r\n          In this paper we introduce OpenCL C++, an object-oriented pro-\r\n          gramming model (based on C++11) for heterogeneous computing\r\n          and an alternative for developers targeting OpenCL enabled de-\r\n          vices. We show that OpenCL C\u2019s address space quali\ufb01ers, and by\r\n          implication Embedded C\u2019s, can be lifted into C++\u2019s type system. A\r\n          novel application of C++11\u2019s new type inference features (auto\/de-\r\n          cltype) with respect to address space quali\ufb01ers allows natural and\r\n          generic use of the this pointer. We qualitatively show that OpenCL\r\n          C++ is a simpler and a more expressive development platform than\r\n          its OpenCL C counter part.\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'GPGPU6a_bibtex')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Bibtex<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"GPGPU6a_bibtex\">\r\n        <p class=\"bibtex\">\r\n          @inproceedings{GasterH13a,\r\n            author = {Benedict R. Gaster and Lee Howes},\r\n            title = {OpenCL C++},\r\n            booktitle = {\r\n              Proceedings of the Sixth Workshop on General Purpose Processing Using GPUs (GPGPU-6)},\r\n            publisher = {ACM},\r\n            year = {2013},\r\n            date = {16 March 2013},\r\n            location = {Houston, TX, USA},\r\n          }\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"publicationDownload\">\r\n      <a href=\"..\/files\/gaster-2013-openclc++.pdf\">\r\n        Download paper\r\n      <\/a>\r\n      |\r\n      <a href=\"..\/files\/gaster-2013-openclc++_presentation.pdf\">\r\n        Download presentation\r\n      <\/a>\r\n    <\/div>\r\n  <\/div>\r\n\r\n  <div class=\"publication\">\r\n    <div class=\"pubTitle\">\r\n       Formalizing Address Spaces with application to Cuda, OpenCL, and beyond\r\n    <\/div>\r\n    <div class=\"pubInfo\">\r\n      <span class=\"authors\">\r\n        Benedict R. Gaster, Lee Howes\r\n      <\/span> \r\n      <span class=\"publisher\">\r\n        Proceedings of the Sixth Workshop on General Purpose Processing Using GPUs (<a href=\"http:\/\/www.ece.neu.edu\/groups\/nucar\/GPGPU\/GPGPU6\/\">GPGPU-6<\/a>)\r\n      <\/span>\r\n      <span class=\"pubLocation\"> Houston, Texas.<\/span>\r\n      <span class=\"pubDate\">March 2013<\/span>\r\n    <\/div>    \r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'GPGPU6b_abstract')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Abstract<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"GPGPU6b_abstract\">\r\n        <p class=\"abstract\">\r\n          Cuda and OpenCL are aimed at programmers developing paral-\r\n          lel applications targeting GPUs and embedded micro-processors.\r\n          These systems often have explicitly managed memories exposed\r\n          directly though a notion of disjoint address spaces. OpenCL ad-\r\n          dress spaces are based on a similar concept found in Embedded C.\r\n          A limitation of OpenCL is that a speci\ufb01c pointer must be assigned\r\n          to a particular address space and thus functions, for example, must\r\n          say which pointer arguments point to which address spaces. This\r\n          leads to a loss of composability and moreover can lead to imple-\r\n          menting multiple versions of the same function. This problem is\r\n          compounded in the OpenCL C++ variant where a class\u2019 implicit\r\n          this pointer can be applied to multiple address spaces.\r\n        <\/p>\r\n        <p class=\"abstract\">\r\n          Modern GPUs, such as AMD\u2019s Graphics Core Next and Nvidia\u2019s\r\n          Fermi, support an additional generic address space that dynami-\r\n          cally determines an address\u2019 disjoint address space, submitting the\r\n          correct load\/store operation to the particular memory subsystem.\r\n          Generic address spaces allow for dynamic casting between generic\r\n          and non-generic address spaces that is similar to the dynamic sub-\r\n          typing found in objected oriented languages. The advantage of the\r\n          generic address space is it simpli\ufb01es the programming model but\r\n          sometimes at the cost of decreased performance, both dynamically\r\n          and due to the optimization a compiler can safely perform.\r\n        <\/p>\r\n        <p class=\"abstract\">\r\n          This paper describes a new type system for inferring Cuda and\r\n          OpenCL style address spaces. We show that the address space\r\n          system can be inferred. We extend this base system with a notion\r\n          of generic address space, including dynamic casting, and show that\r\n          there also exists a static translation to architectures without support\r\n          for generic address spaces but comes at a potential performance\r\n          cost. This performance cost can be reclaimed when an architecture\r\n          directly supports generic address space.\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'GPGPU6b_bibtex')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Bibtex<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"GPGPU6b_bibtex\">\r\n        <p class=\"bibtex\">\r\n          @inproceedings{GasterH13b,\r\n            author = {Benedict R. Gaster and Lee Howes},\r\n            title = {Formalizing Address Spaces with application to Cuda, OpenCL, and beyond},\r\n            booktitle = {\r\n              Proceedings of the Sixth Workshop on General Purpose Processing Using GPUs (GPGPU-6)},\r\n            publisher = {ACM},\r\n            year = {2013},\r\n            date = {16 March 2013},\r\n            location = {Houston, TX, USA},\r\n          }\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"publicationDownload\">\r\n      <a href=\"..\/files\/gaster-2013-formalizing_address_spaces.pdf\">\r\n        Download paper\r\n      <\/a>\r\n      |\r\n      <a href=\"..\/files\/gaster-2013-formalizing_address_spaces_presentation.pdf\">\r\n        Download presentation\r\n      <\/a>\r\n\r\n    <\/div>\r\n  <\/div>\r\n\r\n  <div class=\"publication\">\r\n    <div class=\"pubTitle\">\r\n       Towards metaprogramming for parallel systems on a chip.\r\n    <\/div>\r\n    <div class=\"pubInfo\">\r\n      <span class=\"authors\">\r\n        Lee Howes, Anton Lokhmotov, Alastair F. Donaldson, Paul H.J. Kelly.\r\n      <\/span> \r\n      <span class=\"publisher\">\r\n        Proceedings of the 3rd Euro-Par Workshop on Highly Parallel Processing on a Chip (<a href=\"http:\/\/www.hppc-workshop.org\/\">HPPC<\/a>, AR: 27.8%)\r\n      <\/span>\r\n      <span class=\"pubLocation\"> Delft, The Netherlands.<\/span>\r\n      <span class=\"pubDate\">August 2009<\/span>\r\n    <\/div>    \r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'HPPC_abstract')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Abstract<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"HPPC_abstract\">\r\n        <p class=\"abstract\">\r\n          We demonstrate that the performance of commodity parallel \r\n          systems significantly depends on low-level details, such \r\n          as storage layout and iteration space mapping, which \r\n          motivates the need for tools and techniques that separate \r\n          a high-level algorithm description from low-level mapping \r\n          and tuning. We propose to build a tool based on the concept \r\n          of decoupled Access\/Execute metadata which allow the \r\n          programmer to specify both execution constraints and memory \r\n          access pattern of a computation kernel.\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'HPPC_bibtex')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Bibtex<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"HPPC_bibtex\">\r\n        <p class=\"bibtex\">\r\n          @inproceedings{HowesLDK09b,\r\n            author = {Lee Howes and Anton Lokhmotov and Alastair F. Donaldson and Paul H.J. Kelly},\r\n            title = {Towards metaprogramming for parallel systems on a chip},\r\n            booktitle = {\r\n              Proceedings of the 3rd Euro-Par Workshop on Highly Parallel Processing on a Chip (HPPC)},\r\n            publisher = {Springer},\r\n            series = {Lecture Notes in Computer Science},\r\n            year = {2009},\r\n            date = {25 August 2009},\r\n            location = {Delft, The Netherlands},\r\n          }\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"publicationDownload\">\r\n      <a href=\"..\/files\/howes-2009-towards_metaprogramming_parallel_systems_on_a_chip.pdf\">\r\n        Download paper\r\n      <\/a>\r\n    <\/div>\r\n  <\/div>\r\n\r\n  <div class=\"publication\">\r\n    <div class=\"pubTitle\">\r\n       Decoupled Access\/Execute Metaprogramming for GPU-Accelerated Systems\r\n    <\/div>\r\n    <div class=\"pubInfo\">\r\n      <span class=\"authors\">\r\n        Lee Howes, Anton Lokhmotov, Paul H.J. Kelly, Alastair F. Donaldson.\r\n      <\/span> \r\n      <span class=\"publisher\">\r\n        Symposium on Application Accelerators in High Performance Computing (<a href=\"http:\/\/saahpc.ncsa.illinois.edu\/09\/\">SAAHPC<\/a>, AR: 27.8%)\r\n      <\/span>\r\n      <span class=\"pubLocation\">Urbana-Champaign, Illinois.<\/span>\r\n      <span class=\"pubDate\">July 2009.<\/span>\r\n    <\/div>    \r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'SAAHPC_bibtex')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Bibtex<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"SAAHPC_bibtex\">\r\n        <p class=\"bibtex\">\r\n          @inproceedings{howes-2009-decoupled_access_execute_metaprogramming,\r\n            author = {Lee Howes and Anton Lokhmotov and Paul H.J. Kelly and Alastair F. Donaldson},\r\n            title = {Decoupled Access\/Execute Metaprogramming for GPU-Accelerated Systems},\r\n            booktitle = {Symposium on Application Accelerators in High Performance Computing (SAAHPC)},\r\n            series = {},\r\n            year = {2009},\r\n            date = {27 July 2009},\r\n            location = {Delft, The Netherlands},\r\n          }\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n  <\/div>\r\n\r\n  <div class=\"publication\">\r\n    <div class=\"pubTitle\">\r\n       Optimising component composition using indexed dependence metadata.\r\n    <\/div>\r\n    <div class=\"pubInfo\">\r\n      <span class=\"authors\">\r\n        Lee W. Howes, Anton Lokhmotov, Paul H.J. Kelly, Anthony J. Field.\r\n      <\/span> \r\n      <span class=\"publisher\">\r\n        <a href=\"http:\/\/uvka.ubka.uni-karlsruhe.de\/shop\/isbn\/978-3-86644-298-6\" target=\"_top\" rel=\"noopener\">\r\n          Proceedings\r\n        <\/a> \r\n        of the 1st International Workshop on New Frontiers in High-performance \r\n        and Hardware-aware Computing (\r\n        <a href=\"http:\/\/www.hiphac.org\/\" target=\"_top\" rel=\"noopener\">HipHaC<\/a>)\r\n      <\/span>\r\n      <span class=\"pubLocation\">Lake Como, Italy.<\/span>\r\n      <span class=\"pubDate\">November 8, 2008.<\/span>\r\n    <\/div>    \r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'HipHaC_abstract')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Abstract<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"HipHaC_abstract\">\r\n        <p class=\"abstract\">\r\n          This paper explores the use of {\\em dependence metadata} \r\n          for optimising composition in component-based parallel \r\n          programs. The idea is for each component to carry additional \r\n          information about how points in its iteration space map to \r\n          memory locations associated with its input and output data \r\n          structures. When two components are composed this information \r\n          can be used to implement optimisations that would otherwise \r\n          require expensive analysis of the components&#8217; code at the time \r\n          of composition. This dependence metadata facilitates a number \r\n          of cross-component optimisations &#8212; in this paper we focus \r\n          on loop fusion and array contraction. We describe a prototype \r\n          framework, based on the CLooG loop generator tool, that \r\n          embodies these ideas and report experimental performance \r\n          results for three non-trivial parallel benchmarks. Our \r\n          results show execution time reductions of up to 50\\% \r\n          using the proposed framework on an eight-core Intel Xeon system.\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'HipHaC_bibtex')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Bibtex<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"HipHaC_bibtex\">\r\n        <p class=\"bibtex\">\r\n          @inproceedings{HowesLKF08,\r\n            author = {Lee W. Howes and Anton Lokhmotov and Paul H.J. Kelly and A. J. Field},\r\n            title = {Optimising component composition using indexed dependence metadata},\r\n            booktitle = {\r\n              Proceedings of the 1st International Workshop on New Frontiers \r\n              in High-performance and Hardware-aware Computing},\r\n            year = {2008},\r\n          }\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"publicationDownload\">\r\n      <a href=\"..\/files\/howes-2008-optimising_component_composition_using_indexed_dependence_metadata.pdf\">\r\n        Download paper\r\n      <\/a>\r\n    <\/div>\r\n  <\/div>\r\n\r\n\r\n<h1>Abstracts<\/h1>\r\n  <div class=\"publication\">\r\n    <div class=\"pubTitle\">\r\n       Automating generation of data movement code for processors with distributed memories\r\n    <\/div>\r\n    <div class=\"pubInfo\">\r\n      <span class=\"authors\">\r\n        Lee Howes, Anton Lokhmotov, Paul Kelly, Alastair Donaldson.\r\n      <\/span> \r\n      <span class=\"publisher\">\r\n        5th HiPEAC industrial workshop\r\n      <\/span>\r\n      <span class=\"pubLocation\">HP Labs, Barcelona, Spain.<\/span>\r\n      <span class=\"pubDate\">June 2008.<\/span>\r\n    <\/div>    \r\n  <\/div>\r\n\r\n  <div class=\"publication\">\r\n    <div class=\"pubTitle\">\r\n       FPGAs, GPUs and the PS2 &#8211; A Single Programming Methodolgy\r\n    <\/div>\r\n    <div class=\"pubInfo\">\r\n      <span class=\"authors\">\r\n        Lee W. Howes, Paul Price, Oskar Mencer, Olav Beckmann.\r\n      <\/span> \r\n      <span class=\"publisher\">\r\n        IEEE Symposium on Field Programmable Custom Computing Machines, 2006 (poster session).\r\n      <\/span>\r\n      <span class=\"pubLocation\">Napa Valley, California, USA,<\/span>\r\n      <span class=\"pubDate\">April 2006.<\/span>\r\n    <\/div>    \r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'FCCM_abstract')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Abstract<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"FCCM_abstract\">\r\n        <p class=\"abstract\">\r\n          Field programmable gate arrays (FPGAs), \r\n          graphics processing units (GPUs) and Sony&#8217;s Playstation 2 \r\n          vector units offer scope for hardware acceleration of \r\n          applications. Implementing algorithms on multiple \r\n          architectures can be a long and complicated process. \r\n          We demonstrate an approach to compiling for FPGAs, GPUs \r\n          and PS2 vector units using a unified description based \r\n          on A Stream Compiler (ASC) for FPGAs. As an example of \r\n          its use we implement a Montecarlo simulation using ASC. \r\n          The unified description allows us to evaluate optimisations \r\n          for specific architectures on top of a single base \r\n          description, saving time and effort.\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'FCCM_bibtex')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Bibtex<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"FCCM_bibtex\">\r\n        <p class=\"bibtex\">\r\n          @INPROCEEDINGS{ HowesPMB06,\r\n            booktitle = {\r\n              The Fourteenth Annual IEEE Symposium on Field-Programmable \r\n              Custom Computing Machines Napa Valley, CA, April 24-26, 2006},\r\n            title = {{FPGAs, GPUs and the PS2 -A Single Programming Methodology}},\r\n            author = {Howes, Lee and Price, Paul and Mencer, Oskar and Beckmann, Olav},\r\n            year = {2006},\r\n            url = {http:\/\/pubs.doc.ic.ac.uk\/fpgas-gpus-ps2\/}\r\n          }\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"publicationDownload\">\r\n      <a href=\"..\/files\/howes-2006-fpgs_gpus_ps2_single_programming_methodology.pdf\">\r\n        Download paper\r\n      <\/a> \r\n      | \r\n      <a href=\"..\/files\/howes-2006-fpgs_gpus_ps2_single_programming_methodology-poster.pdf\">\r\n        Download poster\r\n      <\/a>\r\n    <\/div>\r\n  <\/div>\r\n\r\n  <div class=\"publication\">\r\n    <div class=\"pubTitle\">\r\n       Accelerating the Development of Hardware Accelerators\r\n    <\/div>\r\n    <div class=\"pubInfo\">\r\n      <span class=\"authors\">\r\n        Lee W. Howes, Oliver Pell, Oskar Mencer, Olav Beckmann.\r\n      <\/span> \r\n      <span class=\"publisher\">\r\n        EDGE Workshop, 2006\r\n      <\/span>\r\n      <span class=\"pubLocation\">University of North Carolina, Chapel Hill.<\/span>\r\n      <span class=\"pubDate\">2006<\/span>\r\n    <\/div>    \r\n    <div class=\"publicationDownload\">\r\n      <a href=\"..\/files\/howes-2006-accelerating_the_development_of_hardware_accelerators.pdf\">\r\n        Download poster\r\n      <\/a>\r\n    <\/div>\r\n  <\/div>\r\n\r\n\r\n<h1>White Papers<\/h1>\r\n  <div class=\"publication\">\r\n    <div class=\"pubTitle\">\r\n       Kite: Braided Parallelism for Heterogeneous Systems\r\n    <\/div>\r\n    <div class=\"pubInfo\">\r\n      <span class=\"authors\">\r\n        J. Garrett Morris, Benedict R. Gaster, and Lee Howes.\r\n      <\/span> \r\n      <span class=\"publisher\">\r\n        AMD\r\n      <\/span>\r\n      <span class=\"pubDate\">2010.<\/span>\r\n    <\/div>    \r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'Kite_abstract')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Abstract<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"Kite_abstract\">\r\n        <p class=\"abstract\">\r\n          Modern processors are evolving into hybrid, \r\n          heterogeneous processors with both CPU and \r\n          GPU cores used for general purpose computation. \r\n          Several languages, such as BrookGPU, CUDA, and \r\n          more recently OpenCL, have been developed to \r\n          harness the potential of these processors. \r\n          These languages typically involve control code \r\n          running on a host CPU, while performance-critical, \r\n          massively data-parallel kernel code runs on the GPUs.\r\n        <\/p>\r\n        <p class=\"abstract\">\r\n          In this paper we present Kite, a rethinking of the \r\n          GPGPU programming model for heterogeneous braided \r\n          parallelism: a mix of task and data-parallelism that \r\n          executes code from a single source efficiently on \r\n          CPUs and\/or GPUs.\r\n        <\/p>\r\n        <p class=\"abstract\">\r\n          The Kite research programming language demonstrates \r\n          that despite the limitations of today\u2019s GPGPU \r\n          architectures, it is still possible to move beyond \r\n          the currently pervasive data-parallel models. We \r\n          qualitatively demonstrate that opening the GPGPU \r\n          programming model to braided-parallelism allows \r\n          the expression of yetunported algorithms, while \r\n          simultaneously improving programmer productivity \r\n          by raising the level of abstraction. We further \r\n          demonstrate Kite\u2019s usefulness as a theoretical \r\n          foundation for exploring alternative models for \r\n          GPGPU by deriving task extensions for the C-based \r\n          data-parallel programming language OpenCL.\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"hidingBlock\">\r\n      <div>\r\n        <img decoding=\"async\" class=\"blockOpener\" onclick=\"openBlock(this, 'Kite_bibtex')\" src=\"files\/closedBlock.png\" \/>\r\n        <span class=\"blockOpenerText\">Bibtex<\/span>\r\n      <\/div>\r\n      <div class=\"openingBlock\" id=\"Kite_bibtex\">\r\n        <p class=\"bibtex\">\r\n          @techreport{ Morris2012,\r\n            title = {Kite: Braided Parallelism for Heterogeneous Systems},\r\n            author = {Morris, J. Garrett and Gaster, Benedict R and Howes, Lee},\r\n            year = {2010},\r\n            institution = {Advanced Micro Devices, Inc.},\r\n            url = {http:\/\/developer.amd.com\/wordpress\/media\/2012\/10\/kite.pdf}\r\n          }\r\n        <\/p>\r\n      <\/div>\r\n    <\/div>\r\n    <div class=\"publicationDownload\">\r\n      <a href=\"..\/files\/morris-2010-kite.pdf\">Download paper<\/a>\r\n    <\/div>\r\n  <\/div>\r\n","protected":false},"excerpt":{"rendered":"<p>Books Heterogeneous System Architecture Yeh-Ching Chung, Benedict R. Gaster, Juan G\u00f3mez-Luna, Derek Hower, Lee Howes, Shih-Hao Hung, Thomas B. Jablin, David Kaeli, Phil Rogers, Ben Sander, I-Jui (Ray) Sung, Wen-Mei Hwu. Morgan Kaufman2015. Abstract Heterogeneous System Architecture &#8211; a new compute platform infrastructure presents a next-generation hardware platform, and associated software, that allows processors of [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"fullwidth-page.php","meta":{"footnotes":""},"class_list":["post-5","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"http:\/\/www.leehowes.com\/index.php?rest_route=\/wp\/v2\/pages\/5","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.leehowes.com\/index.php?rest_route=\/wp\/v2\/pages"}],"about":[{"href":"http:\/\/www.leehowes.com\/index.php?rest_route=\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"http:\/\/www.leehowes.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.leehowes.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=5"}],"version-history":[{"count":10,"href":"http:\/\/www.leehowes.com\/index.php?rest_route=\/wp\/v2\/pages\/5\/revisions"}],"predecessor-version":[{"id":184,"href":"http:\/\/www.leehowes.com\/index.php?rest_route=\/wp\/v2\/pages\/5\/revisions\/184"}],"wp:attachment":[{"href":"http:\/\/www.leehowes.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=5"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}