Massive Data Parallelism on the GPU with Microsoft's C++ AMP (Accelerated Massive Parallelism)

Over the last decade an exceedingly loud chorus of industry and academia experts has sounded the alarms bells: the free lunch is over! No longer can we, software developers, expect new hardware to benevolently run our code faster and faster, without us lifting a finger to help the hardware along. The era of speed gains which are transparent to the software, was over. It became clear that CPUs were not going to get much faster: they tick today at roughly the same rate they did about five years ago (and that would be about 3Ghz, or 3 billion times a second). That CPU speed would plateau in such an inconsiderate manner was something we didn’t really watch out for, and then it happened. It meant that we could no longer hope to author slow, piggish, applications and hope to be buoyed by a faster generation of hardware that was just around the corner.

How did that happen? Did chip makers simply fall asleep on the gears? What happened to Moore’s Law which has predicted that the number of transistors in silicon chips is going to double every eighteen months? Were we too optimistic? The answer to this question is that, remarkably, forty years after it was postulated, Moore’s law continues to hold. Chip manufacturers are increasing the number of transistors in the chips they make; they constantly come up with ingenious ways to make transistors smaller and thus are able to cram more and more of them into the same chip area. But what hardware vendors cannot do any longer, is make those more numerous transistors tick any faster.

Let’s try to visualize this with an analogy from a somewhat surprising domain: let’s imagine that our processor is a tractor (yes, a tractor!) with which we want to plow a field. For a long time, the tractor makers of our world have been making speedier tractors that cut down the time it takes to do the task, freeing us up to have more time for fun in the sun. But now, those tractors can’t get any speedier, because they are already going so fast that making them any faster would require fitting them with a jet engine which would overheat and explode. It’s not practical. So the tractor makers are telling us, “look, instead of giving you a single tractor that runs faster, we could give you multiple tractors for the price of one. Together, they’ll get your job finished faster, but you have to do your part: you have to train more tractor drivers and figure out how to divvy up the work between them, and make sure they do not crash into each other. If you do all of that, then you could benefit from multiple tractors and what’s more, if you find a general way of planning-out how to achieve the work with a variable number of tractors, then you’d be ready to take advantage of an ever-increasing number of those machines when they become available.”

The obvious analogy here is to the advent of the multi-core chip designs. Instead of squeezing additional performance out of individual cores, chip makers are instead shrinking the size of individual cores and fitting more of them on the same chip. More tractors for us! Our part of the deal is to make our code efficient by feeding work into those parallel cores while still keeping the code correct given the new hazards of races and deadlocks.

Fortunately, platform companies such as Microsoft were there to help with parallel programming models which take a lot of the guesswork out of the task of making your code parallel and scalable (meaning: it can adapt without rewrite to increasing levels of parallelism). Starting from Visual Studio 2010 Microsoft offered the Parallel Patterns Library (PPL) to native Visual C++ developers and the Task Parallel Library (TPL) and Parallel LINQ (PLINQ) to .NET developers. So, for example, in order to add two vectors into a third one using PPL a Visual C++ developer can write the following code:

#include <vector>
#include <ppl.h>

void example(const std::vector<int>& v1, const std::vector<int>& v2, std::vector<int>& v3)
{
  // add pairs of elements in v1 and v2 into v3 sequentially
  for (int i=0; i<v3.size(); i++) 
  {
    v3[i] = v1[i] + v2[i]; 
  }

  // add pairs of elements in v1 and v2 into v3 in parallel 
  Concurrency::parallel_for( 0, (int) v3.size(), [&] (int i) 
  {
    v3[i] = v1[i] + v2[i]; 
  });
}

The above example shows how to add the vectors, first sequentially, and then in parallel, using PPL.

The parallel_for function is akin to a parallel loop which is invoked one time for each integer in the range [0…v3.size()]. For each such integer, parallel_for calls the lambda which is passed as the third parameter to parallel_for. (Lambdas are a fairly recent addition to C++, they are somewhat similar to lambdas in .NET). That is, if we assume the vectors each have a hundred elements, then the body of the loop is invoked a hundred times, pretty much like a normal for loop would be, except that the different iterations of the loop could be executed in parallel. The PPL runtime manages OS threads in a smart manner to make sure the computation is spread across all cores, but also making sure that not too many threads are created, because OS threads are expensive, and switching control between them unnecessarily is also an extra cost we want to avoid.

This is all good, and developers have been excited about the ability to harness multiple cores, but what we were missing in Visual Studio 2010 was a specialization that would take advantage of massive data parallelism capabilities that are available in large doses the on Graphical Processing Units (GPUs). So what is massive data parallelism and how is it different from the previous version of PPL which we have been taking advantage of above?

The previous version of PPL has been very good at taking advantage of multi-core parallelism, but even when we have multiple cores in the same chip, each core is still fairly heavy-weight. Each core is, by design, independent to execute its own code, and synchronization between cores is expensive. But in reality, there is many times a more regular and homogenous structure to the tasks we want to accomplish, which is not very well exploited by just spreading the work across cores. For example, going back to our quaint agricultural example, we note that the task at hand---plowing a field---is extremely repetitive: all grooves look the same and they run in parallel lines. This means that a more efficient way to speed up the task of plowing an entire field would be to simply add more blades to the plow, rather than adding more tractors… That is, up to a certain point, a whole new tractor is overkill for the job. We just need one engine, one steering wheel, one driver, who controls a set of parallel blades, each applying the same logic and actions in order to plow many parallel grooves simultaneously. Of course, at some point such gigantic plows will become unwieldy, and at that point we would want to have additional tractors, each with its beefy parallel plow. So we need both types of parallelism, coarse multi-core parallelism and fine grained intra-core parallelism.

Our number adding example has very similar considerations: we want to apply the same logic to multiple data sets. We don’t really need multiple cores for each pair of numbers; instead it would have been more efficient if we had allowed each core to add-up multiple pairs of numbers simultaneously.

In order to achieve this goal, within each core, hardware vendors have built the ability to do smaller-scale things (such as, adding pairs of numbers) using vector instructions, such as SSE instructions on the CPU, or SIMT (Single Instruction Multiple Threads) execution modes on the GPU, which repeat the same operation across multiple data sets, in parallel. This form of parallelism is much more efficient for handling large sets of data which require uniform or mostly-uniform processing.

GPUs, in particular, excel in this form of fine-grained parallel processing. The story of the evolution of GPUs is quite remarkable and in many ways C++ AMP is a reaction to their success. GPUs, as their name clearly hints, were traditionally built to render graphics. But over time, they have become more general purpose, to the point that we can now run a lot of normal C++ code on GPUs. A contemporary GPU core can execute as many as sixteen multiplications followed by sixteen additions, all in a single cycle! And we get many GPU cores in a GPU chip. All of that parallelism results in a lot of computing power. GPUs hold current records in being most efficient (in number-crunching problems) when measured in computations per unit of electric power, or computations per hardware cost.

Unfortunately, programming GPUs thus far has been a little bit of an adventure requiring specialized tools and languages. Microsoft aims to simplify the situation for developers by bringing C++ AMP into Visual Studio and working with industry leaders and standard bodies to make C++ AMP an open specification. C++ AMP is built on top of the DirectX platform which supports a wide array and ever-expanding selection of hardware. Once you build your C++ AMP code, it will be able to execute on any DirectX 11 device or newer (11 is the current version of DirectX, and Windows 8 will introduce version 11.1), from any major hardware vendor, and fallback to the CPU if necessary.

Typically, your code will contain a mixture of opportunities to parallelize both across cores and within cores and now you also have a tool that lets you take advantage of these opportunities in an easy and cohesive manner: C++ AMP. This is how the vector addition example looks in C++ AMP:

1.  #include <vector>
2.  #include <amp.h>

3.  void example_amp(const std::vector<int>& v1, const std::vector<int>& v2, std::vector<int>& v3)
4.  {
5.    concurrency::array_view<const int> av1(v1.size(), v1);
6.    concurrency::array_view<const int> av2(v2.size(), v2);  
7.    concurrency::array_view<int> av3(v3.size(), v3);  

8.    // add pairs of elements in v1 and v2 into v3 in parallel 
9.    concurrency::parallel_for_each(av3.grid, [=] (concurrency::index<1> idx)  restrict(direct3d)
10.   {
11.     av3[idx] = av1[idx] + av2[idx]; 
12.   });

13.   av3.synchronize();
14. }

Lines 5 through 7 create array views on top of the std::vectors which were passed into the function. GPUs typically have their own memory and wrapping your CPU-side arrays or STD vectors in an array view is required in order to make the data accessible on the GPU. Once you wrap your data in array views, C++ AMP copies data as necessary between the CPU and the GPU, in a mostly automatic fashion. Like an std::vector or an std::array, class concurrency::array_view is a template on the element type, in our case, an integer. An array_view<const int> provides read-only access to the data wrapped under it, while an array_view<int> supports both reading and writing through the view.

Lines 9 through 13 contain an invocation of parallel_for_each. This newly added overload of parallel_for_each is the method using which C++ AMP injects parallelism into your program. It is similar to the parallel_for we have seen earlier in this article, but the main differences are:

  1. the code will execute on the fastest device you have in your system, which could very well be a GPU, rather than a CPU, and,
  2. the lambda that you pass in will be both parallelized across cores and vectorized within each core.

parallel_for_each takes two parameters:

  • The first parameter is of type grid, and it tells C++ AMP how many logical threads you want to launch and what their numerical thread ID’s are going to be. In the example above av3.grid results in an object of type grid<1>: i.e., a single dimensional grid. The grid object expresses the shape of the parallel loop to execute. Here, we get a thread for every index in the view av3. av3 was created with its size identical to that of the std::vector v3, so we get a thread per index in the original vector.
  • The second parameter is the lambda to execute, and as in the case of parallel_for, it is invoked for each index in the grid. Both the index and the grid (and in fact, the array view too), have their rank recorded in their type. Thus the type of the parameter passed to the lambda is an index<1>: i.e., a single-dimensional index. The body of the lambda is comprised of a single line (number 11) in which we index the array views and perform the element-wise addition.

Note that the lambda is annotated with new syntax: restrict(direct3d). This annotation tells the C++ compiler to ensure that the code inside the lambda can really be translated to finely-parallel execution on a DirectX device. We discussed earlier how the key to efficient data parallel execution is regularity of processing. Well what is then the opposite of regularity? Divergence. To a first approximation, the constraints of restrict(direct3d) functions ensure that there is no excessive divergence in the code. For example, function pointers cannot be used, and neither are their more advanced relatives: virtual functions, allowed. Any functions called from a function that has the direct3d restriction, must also be restricted. This ensures that the entire code which is supposed to execute on the accelerated device is adhering to the restriction rules. Other restrictions concern primitive data types which are not supported by DirectX, such as short integers.

While, like existing parallel_for* methods in PPL, the new parallel_for_each appears to be synchronous in terms of its side effects, it is actually performing a lot of the computations asynchronously and only synchronizes lazily, when the results are required. This is important since the GPU is many times remote from the CPU and synchronizing any access to it would have reduced overall system throughput. In the example above, we are calling av3.synchronize() in order to explicitly synchronize the contents of the vector v3 backing the array_view av3, with the results of the GPU computation. We could have omitted this call, and it would have been taken care of by the array_view destructor. However it is a good idea to call synchronize explicitly, since the destructor swallows exceptions, which you generally wouldn’t want to suppress.

Now that we are done with the historical context, motivation and simplest example possible with C++ AMP, let’s describe briefly what else is included in C++ AMP.

Accelerators

GPUs today come in two flavours. The first kind is discrete GPUs, those that are attached to your computer, i.e., they are a sort of a peripheral. The second kind are those which are embedded on the same chip as your CPU, i.e., they are embedded or fused GPUs, and the systems containing them are frequently referred to as System on a Chip (SoC). C++ AMP allows you to use both of these kinds of GPUs, the CPU, and in the future, maybe other computing resources, such as FPGA cards, on-premise or cloud-based clusters, etc.

Since these computing resources are varied, they may present different capabilities, in terms of their compute power, functionality, availability and reliability. To allow you to be successful with this heterogeneity, C++ AMP gives you a spectrum of control options over accelerators. You can simply omit all mention of accelerators from your program, as the example above did, and in that case you are signalling to C++ AMP to do its best to execute the code in the most efficient way possible. In its first version, C++ AMP will in such a case place your data and execute your code on the best accelerator available, but it will not try to automatically spread work across multiple accelerators. Alternatively, you can take matters into your hands, and code to accelerators explicitly: you can enumerate them, place data and computations on specific ones, copy data between them, etc.

Here is an example showing how to spread the above array addition computation between multiple GPUs.

#include <assert.h>
#include <vector>
#include <amp.h>

void example_amp_multiple_accelerators(const std::vector<int>& v1, 
  const std::vector<int>& v2, std::vector<int>& v3)
{
  using namespace concurrency;

  const std::vector<accelerator> all_accelerators = get_accelerators();
  std::vector<accelerator> selected_accelerators;

  std::for_each(all_accelerators.begin(), all_accelerators.end(), [&](const accelerator& acc) {
    if (!acc.is_emulated)
      selected_accelerators.insert(selected_accelerators.end(), acc);
  });

  const int series_length = v3.size();
  const int number_of_accelerators = selected_accelerators.size();

  // not really interesting otherwise
  assert(series_length > number_of_accelerators);

  const int target_chunk_size = 
    series_length/number_of_accelerators == 0 ? 1 :   series_length/number_of_accelerators;
  std::vector<array<int>*> result_arrays(number_of_accelerators);

  // Distribute the array addition across selected GPU's.
  parallel_for(0, number_of_accelerators, [&] (int i) {
    const int begin = target_chunk_size * i;
    const int end = i == (number_of_accelerators -1) ? series_length : begin + target_chunk_size;
    const int chunk_size = end - begin;

    const accelerator_view& acc = selected_accelerators[i].get_default_view();
    const array_view<const int> av1(chunk_size, v1.data()+begin);
    const array_view<const int> av2(chunk_size, v2.data()+begin);

    array<int> * array_out = new array<int>(chunk_size, acc);
    result_arrays[i] = array_out;

    array_view<int> av3(*array_out);

    parallel_for_each(av3.grid, [=] (index<1> idx) restrict(direct3d) {
      av3[idx] = av1[idx] + av2[idx]; 
    });
  });

  // Copy the results from the GPUs into the target vector
  parallel_for(0, number_of_accelerators, [&] (int i) {
    copy(*result_arrays[i], v3.data() + target_chunk_size * i);
    delete result_arrays[i];
  });
}

Describing this example in detail is outside the scope of this article, but do note these three items of import:

  1. parallel_for_each is invoked from within parallel_for. i.e., each CPU parallel activity invokes the GPU that was assigned to it, in parallel.
  2. Each parallel activity constructs a concurrency::array to hold the results computed by it, and it places this results array on the specific accelerator that was assigned to it.
  3. Likewise, parallel_for_each is instructed to execute on specific accelerators, by passing-in a first argument of type accelerator_view.

Taking advantage of GPUs programmable caches and thread tiles

Unlike CPUs where caches are mostly transparent to the developer, in the GPU world, developers can take advantage of programmable caches, i.e., caches in which the developer can explicitly tell the system which variables should reside and stay in the cache. Because GPUs are data parallel processors, it wouldn’t be interesting to have a cache per data stream. Instead, we would like to share the programmable cache between the data lanes which are processed by the same GPU core. In order to express that, C++ AMP allows you to group threads together such that they share the same set of programmable cache variables.

The concept of a programmable cache is exposed using the tile_static storage modifier, which you can attach to a local variable declaration. That tells the system that the variable is to be shared by all threads in a thread tile, and should be placed in the fast programmable cache. Thread tiles are the C++ AMP concept which supports grouping threads together such that they can share the state in tile_static variables, and additionally, it also allows them to employ barrier synchronization.

You can find several examples taking advantage of tile_static variables, thread tiles and tile barriers on our team’s blog. For example, here is a recent binomial options example.

Debugger support for C++ AMP

The Visual Studio debugger is being enhanced in a number of innovative ways to support data parallel debugging. In addition to the usual debugging facilities which every Visual C++ developer is already familiar with and thankful for, support is added to visualize and drill into large data sets and large numbers of concurrently executing C++ AMP threads. The debugger lets you tabulate, search and filter variables across the many threads active in a parallel_for_each call, as well as pick individual threads, examine their peers in the same thread tile, and much more. I encourage you to watch this demo of the upcoming parallel debugger.

To conclude…

The advent of massive parallelism is posing to the developer community challenges unlike seen before. However, due to the unprecedented processing power offered by the hardware, those who’ll be first to hone their skills and realize the potential of the hardware in their software will be able to create intelligent, rich and almost magical experiences. Microsoft is offering C++ AMP in order to ease the entry into the massive parallelism world by hiding current and future hardware differences and by making it a first class citizen of Visual C++, and by working with industry partners to make it an open specification.

What will you build, with the power of C++ AMP at your hands?

For additional C++ AMP resources, please visit the C++ AMP team blog, you can also download the Visual Studio Developer Preview, try out C++ AMP, and ask a question in the C++ AMP forum.

You might also like...

Comments

About the author

Yossi Levanoni

Yossi Levanoni United States

Yossi is a principal architect in Microsoft’s Developer Division. He has been working on Microsoft’s Parallel Computing Platform since 2006 and is responsible for the architecture of C++ AMP. He...

Interested in writing for us? Find out more.

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“An expert is a man who has made all the mistakes that can be made in a very narrow field” - Niels Bohr