A multi-level Implementation of Image Amplification on the General Purpose Graphical Processing Unit

Poster Number

5

Lead Author Affiliation

Master's in Engineering Science/School of Electrical and Computer Engineering

Lead Author Status

Masters Student

Second Author Affiliation

School of Electrical and Computer Engineering

Second Author Status

Faculty

Introduction/Abstract

With rapid advances in high-impact fields such as medical imaging, dental imaging, navigation, and microbiology, the amount of information stored in images has increased drastically. Although the images generated by these fields are comprehensive, scientists are often interested in the smallest details. In this research, we efficiently parallelize a state-of-art image amplification algorithm that allows users to amplify images to obtain minute details. The proposed algorithm aims at preserving the edges of the image, thereby capturing a rich representation of the image. The algorithm comprises four computationally intensive stages: 1) edge detection via Canny algorithm; 2) Edge preservation in the vertical direction (vertical edge-keeping); 3) Edge preservation in the horizontal direction (horizontal edge-keeping); and 4) interpolation of the remaining pixels via mean-keeping. The computationally intensive nature of these steps makes the algorithm a better match for massively parallel architectures such as GPGPU devices. We construct an effective implementation hierarchy that maps the algorithm stages step-by-step to Nvidia Tesla GPGPU device using the Compute Unified Device Architecture (CUDA) programming model. Our step-by-step exposition of the algorithm stage mapping not only elucidates the various CUDA optimization techniques, but also enables users to relate parallelization strategies with their respective applications.

Purpose

The purpose of this research is to accelerate an artifact-free state-of-art image amplification algorithm, which consumes a lot of execution time on the normal desktop processor. The use of GPGPU devices is a popular approach in high-performance computing to solve problems that comprise complex calculations. With this research effort, users will be able to magnify images quickly and reliably to analyze minute details. The end users include medical specialists, biologists, and surveillance specialists, among many others.

Method

We mapped the four computationally intensive algorithm stages step-by-step to the Nvidia Telsa K20Xm GPGPU device by building device functions called kernels. This step-by-step stage mapping yields four implementations of the algorithm, thereby creating an implementation hierarchy. The first implementation, Implementation 1, builds the Canny edge detection kernel on the GPGPU device. Implementation 2 is built on top of Implementation 1 with the addition of vertical edge-keeping kernel. Implementation 3 is built on top of Implementation 2 with the inclusion of horizontal edge-keeping kernel. The last implementation of the hierarchy, Implementation 4, extends Implementation 3 by including the mean-keeping kernel. This complete mapping ensures that maximum focus is on the highly beneficial GPGPU computations while keeping the communication costs to the minimum. We devised an empirical method of identifying effective GPGPU kernel configurations to maximize the device utilization. To evaluate the benefits of the GPGPU implementations, we performed an end-to-end application runtime comparison between the CPU-only and GPGPU implementations. For testing, we used an image repository containing Lenna images of sizes: 1024×1024 (1 kilopixel), 2048×2048, 4096×4096, 5120×5120, 7680×7680, 8192×8192, and 10240×10240 (104 megapixels).

Results

For the largest image size 10240x10240, the Canny edge detection kernel (stage 1) achieves a speedup of 84x; the vertical edge-keeping kernel (stage 2) achieves 19x speedup; the horizontal edge-keeping kernel (stage 3) achieves 9x speedup; and the mean-keeping kernel (stage 4) achieves 90x speedup versus the CPU-only implementation. Additionally, the four hierarchical implementations achieve successively higher end-to-end application speed-up of 3.2x, 3.7x, 4.5x, and 19x, respectively.

Significance

Image amplification is one of the significant processes that benefits various high-impact fields including remote sensing, medical imaging, surveillance, and navigation. Due to various reasons, low-grade sensors for instance, the images taken may require magnification. Because researchers are always interested in minute details of the image, the current research provides a faster way of producing high-resolution images for rapid analysis.

Location

DUC Ballroom A&B

Format

Poster Presentation

Poster Session

Morning

This document is currently not available here.

Share

COinS
 
Apr 29th, 10:00 AM Apr 29th, 12:00 PM

A multi-level Implementation of Image Amplification on the General Purpose Graphical Processing Unit

DUC Ballroom A&B

With rapid advances in high-impact fields such as medical imaging, dental imaging, navigation, and microbiology, the amount of information stored in images has increased drastically. Although the images generated by these fields are comprehensive, scientists are often interested in the smallest details. In this research, we efficiently parallelize a state-of-art image amplification algorithm that allows users to amplify images to obtain minute details. The proposed algorithm aims at preserving the edges of the image, thereby capturing a rich representation of the image. The algorithm comprises four computationally intensive stages: 1) edge detection via Canny algorithm; 2) Edge preservation in the vertical direction (vertical edge-keeping); 3) Edge preservation in the horizontal direction (horizontal edge-keeping); and 4) interpolation of the remaining pixels via mean-keeping. The computationally intensive nature of these steps makes the algorithm a better match for massively parallel architectures such as GPGPU devices. We construct an effective implementation hierarchy that maps the algorithm stages step-by-step to Nvidia Tesla GPGPU device using the Compute Unified Device Architecture (CUDA) programming model. Our step-by-step exposition of the algorithm stage mapping not only elucidates the various CUDA optimization techniques, but also enables users to relate parallelization strategies with their respective applications.