Today Amazon Web Services takes another step on the continuous innovation path by announcing a new Amazon EC2 instance type: The Cluster GPU Instance. Based on the Cluster Compute instance type, the Cluster GPU instance adds two NVIDIA Telsa M2050 GPUs offering GPU-based computational power of over one TeraFLOPS per instance. This incredible power is available for anyone to use in the usual pay-as-you-go model, removing the investment barrier that has kept many organizations from adopting GPUs for their workloads even though they knew there would be significant performance benefit.

从金融加工和传统的石油和天然气勘探HPC应用程序将复杂的3D图形集成到网上和移动应用程序中,GPU处理的应用似乎是无限的。我们认为,使这些GPU资源可用于每个人以低成本使用将推动新的创新,以便在高度平行的编程模型中推动新的创新。

从CPU到GPU

Building general purpose architectures has always been hard; there are often so many conflicting requirements that you cannot derive an architecture that will serve all, so we have often ended up focusing on one side of the requirements that allow you to serve that area really well. For example, the most fundamental abstraction trade-off has always been latency versus throughput. These trade-offs have even impacted the way the lowest level building blocks in our computer architectures have been designed. Modern CPUs strongly favor lower latency of operations with clock cycles in the nanoseconds and we have built general purpose software architectures that can exploit these low latencies very well. Now that our ability to generate higher and higher clock rates has stalled and CPU architectural improvements have shifted focus towards multiple cores, we see that it is变得更加努力有效地使用这些计算机系统。

我们通用CPU没有表现不佳的一个权衡区域是大规模的细粒行活。图形处理是具有巨大计算要求的这样的区域,但是,其中每个任务相对较小,并且通常以管道形式的数据对数据执行一组操作。该管道的吞吐量比各个操作的延迟更重要。由于其焦点延迟,通用CPU产生了相当低效的图形处理系统。这导致图形处理单元(GPU)的诞生,其集中于提供非常细的粗糙并行模型,在多个阶段组织的处理,其中数据将流过。GPU的模型是描述管道中不同阶段的任务并行性,以及每个阶段内的数据并行性,导致高效,高吞吐量计算架构。

早期的GPU系统非常具体供应商,并且主要由在硬件中实现的图形运算符组成,能够并行地在数据流上运行。这产生了全新的计算机架构,其中突然相对简单的工作站可用于非常复杂的图形任务,例如计算机辅助设计。然而,对于顶点和片段操作的这些固定功能最终对下一代图形的演变最终变得过于限制,因此开发了新的GPU架构,其中用户特定程序可以在管道的每个阶段运行。必威体育精装版app官网由于这些程序中的每一个都变得更加复杂并且对几何处理的新操作的需求增加,因此GPU架构演变为一个长的前馈流水线,包括处理任务和数据并行性的通用32位处理单元。然后,不同的阶段在可用单元上加载平衡。

tesla_m2050.jpg.

通用GPU编程

编程GPU以类似的方式演变;它始于早期的API主要通过硬件编程的操作。GPU系统的第二代API仍然是面向图形的,但在封面下,在通用管道上实现了专用任务的动态分配。然而,第三代API留下了图形细节接口,而是专注于将管道暴露为通用高度并行引擎支持任务和数据并行性。

已经使用第二代APIS研究人员和工程师已经开始使用GPU进行通用计算,因为现代GPU的通用处理单元非常适合任何可以分解成细粒行平行任务的系统。但是通过第三代界面,通用GPU编程的真正力量被解锁。在传统并行性的分类中,管道的编程是SIMD(单个指令,多个数据)内部的组合和SPMD(单程程序,多个数据),了解结果如何在阶段之间进行路由。程序员将编写一系列线程,每个线程定义各个SIMD任务,然后是SPMD程序来执行这些线程并收集和存储/组合这些操作的结果。输入数据通常被组织为网格。

NVIDIA的CUDA SDK提供更高级别的界面,其中C语言中的扩展界面支持多线程和数据并行性。开发人员必威体育精装版app官网将单一C功能写入“kernel“在数据上运行并根据执行配置由多个线程执行。为了轻松促进不同的输入模型,可以组织线程线程块这是向量,矩阵和卷的一个,两维处理器的层次结构。记忆被组织成全局内存,每个线程块内存和每线程私有内存。

这种非常基本的基元组合驱动了一系列不同的编程样式:地图和减少,散射和收集和排序,以及流过滤和流扫描。所有以极端吞吐量运行作为高端GPU,如支持TESLA“费米”CUDA架构的近端,靠近500个核心,每个GPU产生超过500个Gigaflops。

NVIDIA“费米”架构在NVIDIA TESLA 20系列GPU(我们提供与TESLA M2050 GPU的情况)是从早期的GPU提供高性能双精度浮点操作(64FP)和ECC GDDR5内存。

TELSA_SIDE_2.PNG.

The Amazon EC2 Cluster GPU instance

上周据透露了这一点世界上最快的超级计算机现在是这一点天河1A峰值性能为4.701 petaflops。天河1A在14,336 Xeon X560处理器上运行,7,168个NVIDIA Tesla M2050通用GPU。系统中的每个节点由两个Xeon处理器和一个GPU组成。

EC2群集GPU实例在每个实例中提供更多电源:两个Xeon X5570处理器与两个NVIDIA Tesla M2050 GPU相结合。这为您提供了每个实例的Teraflops处理电源。默认情况下,我们允许任何客户实例化多达8个实例的集群,使8个Teraflops的令人难以置信的电源可供任何人使用。此实例限制是默认使用限制,而不是技术限制。如果您需要更大的群集,我们可以通过以下要求提供可根据要求提供的亚马逊EC2实例请求表单。如果您愿意切换到单精度浮动TESLA M2050将为您提供每GPU的TERAFLOP性能,使整体性能翻倍。

We have already seen early customers out of the life sciences, financial, oil & gas, movie studios and graphics industries becoming very excited about the power these instances give them. Although everyone in the industry has known for years that General Purpose GPU processing is a direction with amazing potential, making major investments has been seen as high-risk given how fast moving the technology and programming was.

云中的GPU编程与Amazon Web服务的全部更改。现在,世界上最先进的GPU的力量现在可以在没有任何预端投资的情况下使用,从而消除拥有自己的GPU基础设施的风险和不确定性将涉及。我们已经看到了“传统”HPC已解锁的EC2集群计算实例已为每个人解锁,但群集GPU实例需要这一步,进一步制作甚至以外的大多数专业人士的范围,即将用于每个人的创新资源以非常低的成本。一个8 Teraflops HPC支持GPU的节点群集现在只花费大约每小时17美元。

CPU and/or GPU

As exciting as it is to make GPU programming available for everyone to use, unlocking its amazing potential, it certainly doesn't mean that this is the start of the end of CPU based High Performance Computing. Both GPU and CPU architectures have their sweet spots and although I believe we will see a shift in the direction of GPU programming, CPU based HPC will remain very important.

gpu的工作最好在问题集是理想的lved using massive fine-grained parallelism, using for example at least 5,000 - 10,000 threads. To be able build applications that exploit this level of parallelism one needs to enter a very specific mindset of kernels, kernel functions, threads-blocks, grids of threads-blocks, mapping to hierarchical memory, etc. Configuring kernel execution is not a trivial exercise and requires GPU device specific knowledge. There are a number of techniques that every programmer has grown up with, such as branching, that are not available, or should be avoided on GPUs if one wants to truly exploit its power.

HPC programming for CPUs is very convenient compared to GPU programming as the power of traditional serial programming can be combined with that of using multiple powerful processors. Although efficient parallel programming on CPUs absolutely also requires a certain level of expertise its models and capabilities are closer to that of traditional programming. Where kernel functions on GPUs are best written as simple data operations combined with specific math operations, CPU based HPC programming can take on any level of complexity without any of the restrictions of for example the GPU memory models. Applications, libraries and the tools for CPU programming are plentiful and very mature, giving developers a wide range of options and programming paradigms.

One area where I expect progress will be made with the availability of the Cluster GPU instances is a combination of both HPC programming models which combines the power of CPUs and GPUs, as after all the Cluster GPU instances are based on the Cluster Compute Instances with their powerful quad core i7 processors.

Some good insight into the work that is needed to convert certain algorithms to run efficiently on GPUs is the UCB/NVIDIA "为多核GPU设计高效排序算法“纸张。

集群计算机,群集GPU和Amazon EMR

亚马逊弹性mapreduce.(EMR)使Hadoop的(MapReduce)非常容易运行大规模并行处理任务。Amazon EMR将处理工作负载并行化,节点配置和缩放,群集管理,使我们的客户可以专注于编写实际的HPC程序。

从今天开始,亚马逊EMR可以利用群集计算和群集GPU实例,为客户提供更强大的组件来基于大规模的数据处理和分析。这些依赖于大型网络I / O的程序也将从群集中的实例之间的低延迟,完整的Biscledry 10Gbps以太网网络中受益。

然后去哪儿?

有关亚马逊EC2的新群集GPU实例的更多信息,请访问高性能计算与亚马逊EC2页。有关使用HPC群集实例与Amazon Elastic MapReduce的更多信息,请参阅亚马逊EMR详细信息页面。还有更多细节可以找到AWS Developer blog。詹姆斯汉密尔顿有一些有意义的见解GPGPU.

评论