CppCon 2017: Dietmar Kühl “C++17 Parallel Algorithms”

and that’s good So the reason things are not supported is, for example, these guys, they are all one algorithm, so they are just one-element operations, it doesn’t make any sense to support those The next thing are logarithmic operations Even if you have a tremendous range, you actually do not get enough operations to make it worthwhile to do these things in parallel There’s basically all the lock end algorithms, like filing something or inserting into a heap Then we get a number of heap algorithms These are both linear algorithms, or analog algorithms, I think they’re both linear But the thing over here is, the heap operations mess entirely with the entire range of the heap Trying to do that in parallel, I don’t see how you would do that It kind of probably also something you, I cannot see a use case for having these Sorry, similarly, the permutations is basically how do you permutate these things in parallel? It’s relatively unclear, and although some people have claimed they have used these, I’ve never seen any of these algorithms used in production I’ve seen them used in test drivers Then there are some algorithms which are not supported because they are overlap That is copy backward and move backward The copy and move operations are actually not allowed to overlap in the parallel stuff, because it doesn’t make sense to overlap them It’s okay if you are sequentially overlapping them because you can, assuming you move through the right direction, you copy before you reach this thing as a destination, so that’s okay, but in parallel, it doesn’t make any sense, so copy backward and move backward don’t make any sense Then we have accumulate, which actually, like when in partial sum, they are taken out because they’re renamed, they’re basically called, what is it, reduce And then we get these three oddballs, for which I have no reason why they’re not there I think sample in particular could be done, shuffle probably accesses everything, maybe that’s not a good idea, and iota is kind of a generational thing I believe these things are not entirely there, in particular shuffle and iota, because they just came too late in the standard of C++ 17, or 14 even, and were not, didn’t make it into the paper There’s my personal guess Okay, with that said, these are supported algorithms Now, not all of these supported algorithms are necessarily super-efficient, so let’s go the other way around and see how I would imagine these things can be implemented and which ones are expected to be fast So the obvious mass of parallelism, the embarrassing parallel stuff, that will be basically a variation of map This is a whole host of algorithms which actually can just execute in parallel, they have no real interaction with each other, although, as was pointed out, something like transform or copy is a destination that does not yet access, there is actually an interaction between these guys and they may need to fall back on a different algorithm But if you execute them on random access ranges which do exist, they should all be map, and should be really fairly efficient Assuming you have enough operations in them So the idea of map is very simple In the sequential thing, you just execute one at a time, in parallel, you just execute things in parallel, and just get a nice speedup There’s par unseq, you can basically execute multiple operations in parallel This just displays, almost certainly these things won’t be entirely in parallel, but rather slightly offset, but roughly speaking, there should be a nice speedup if that is actually attainable There are some additional constraints, as I mentioned The for each function, do not return a function object as you do with sequential stuff Copy and move are not allowed to overlap I didn’t find a constraint that copy n is not allowed to overlap, so you can copy n elements I don’t think that is intentional, I think that is more like an error And then some of these algorithms may do something more complicated, maybe it’s something like a reduce, maybe it’s something like a gather scatter thing if you execute them on non-random access iterators But the set of algorithms should be expected to be really fast And there’s a whole host of algorithms which can essentially be mapped to something like reduce Reduce, essentially what it does is, it can also execute things on parallel sequence, but after the sequences are executed,

you basically need to combine the result to achieve the final result There’s a whole bunch of these things which are find operations, I don’t know if the find operation can be done in general, nicer, in a way Probably there are other parallel algorithms which can deal with that, but reduce can certainly deal with that as well And one of the things which is kind of obvious from the idea of doing that, if you use one of the find operations, you probably only want to do that if you’re looking for the needle in the haystack If you expect to find something really soon, you don’t want to have the parallel find You will find one of the guys, and if you try to kick that off in parallel, it will at least search a number of ranges and possibly omit later ranges, but it’s almost certainly less efficient than actually doing things sequentially So if you really search for, in a huge range, the one element of it exists that may be reasonable, otherwise you’re probably better off doing a sequential stuff So reduce, similar, sequential does it sequential, and parallel basically combines just a little bit of, combines steps at the end, and in parallel, you basically execute on what’s in the thread some combinations of the different ranges and then some combine step So there’s, again, expected to be really fast There’s some changes, accumulate becomes reduce, I mentioned that The operations in there need to be associative, otherwise you get a wrong result, because there’s no promise that the operations are done in the same order at all They can be in a different order And there’s a little bit of a bummer, right? Basically the expectation, when we think accumulate, probably in most cases we accumulate things, maybe in the matrix multiplication, all of these things are doubled Double addition is clearly not associative Now, the algorithm requires that these things are associative I don’t think it says that it’s entirely disallowed, and basically the net effect of basically using non-associative algorithms may just be that you get a wrong error, sorry, wrong result On the other hand, if you do double additions, you already know that you get a wrong result You just get a different wrong result So there’s a little bit of a tradeoff You don’t get, necessarily, the result you expect, you get some other result that should still be fine However, formally, these things should be associative And I already mentioned that find may be tricky and stuff So another bunch of algorithms, exclusive scan and inclusive scan These are basically the same as the partial sum stuff, so the idea of partial sum is you basically compute not just in the accumulate or reduce the sum of all the elements and just have the final result, but you also compute the sum of all the intermediate results So you basically get the entire range Inclusive scan, exclusive scan, as well as partial sum seem to be algorithms you don’t use very often Interestingly, they were rather useful to implement other parallel algorithms So there’s actually, inside the parallel algorithms, an area, a good use of these things, in particular if you basically tried to do a filter, if you filter things in different ranges, you basically get a new range with a different number of elements and then you can do the partial sum to figure out where things really need to go The algorithm is a little bit more complicated, it’s still a decent parallel algorithms and you can do nice tricks that basically, if you have enough processes in logarithmic time instead of linear time, but yeah It’s a little bit more special case, there are two There’s inclusive scan and exclusive scan, and they’re slightly annoying So the idea is the inclusive scan is basically, or the difference between the inclusive scan and the exclusive scan is, what is the first element, which value is useful first element? For the exclusive scan, the first element is actually the initial value you pass into the algorithm For the inclusive scan, it is the initial value, it’s the first element of the sequence, plus the, or combined with the initial value Depending on what you try to do, one or the other is actually what you really want As a result, it makes actually sense to expose both of these algorithms You can, in theory, implement one in terms of the others, but this is, either direction, really, is good You lose a little bit of efficiency in both directions To make that a little bit more annoying,

similar to these algorithms actually, unfortunately, have a slightly different parameter order And that is similar to partial sum, these guys take a range, and then they have an operation and an initial value For the inclusive scan, typically the initial value is defaulted and you’re mostly interested in the operation For the exclusive scan, the initial value is the first element You’re probably very interested in passing that, and instead, often you have addition as the operation So that actually the last parameters are just reversed, the initial value on the operation, and they’re correspondingly defaulted So this, what we said about these guys Then this is a bunch of entirely new algorithms, transform exclusive scan, transform inclusive scan, and transform reduce Essentially, the idea over here is, you want to do as much work on elements as you can at a time You don’t want to basically run over the same sequence multiple times and apply multiple operations These guys basically fuse the operations into one, they basically fuse the transformation in addition to whatever other thing they do I could imagine that, in the future, with the range of stuff, we get a lot more of that We basically would produce a range which basically fuses many operations and you basically have a pipeline of operations and a small, like an expression template thing We could try to build up a big operation on the individual elements and then you execute these individually For now, we get these guys And then we get a number of gather algorithms or scatter, which basically combine the, well, determine where things go, and then distribute them So basically you could imagine that over here, if these things make sense, if any of these operations, like the condition for copy if is relatively expansive, then you could run the copy ifs in parallel, then find out where things go, and then distribute these things So to reduce things This is a relatively complicated algorithm, I have experimented with those, but I think they can be executed in parallel and should have a benefit And then there are a number of remaining algorithms which I left as specialized The intention over here is that all of the other guys, basically you have a few core algorithms and you basically map that into these algorithms and then there’s a number of algorithms left over which you probably have to implement, or the implementer has to implement individually But despite the relatively big starting set, there are not that many parallel algorithms which, in the end, really have to be implemented Most of these things are just mapping on other algorithms Okay, which leads me to availability When I proposed that thing originally, the idea was, all right, cool, I will whip out the compiler and run things and things should work Turns out none of the compilers currently, none of the standard libraries, actually ships with the parallel algorithms According to P00241, this is a paper which proposed adoption of the parallel TS, the parallel algorithms from parallel TS, there are multiple implementations of all of these algorithms I checked all of the implementations which were listed, at the point where I first checked, all of them were partial None of them actually implemented any of the interesting unseq stuff, as far as I know, there’s still nobody who implements the unseq stuff because most of this time, the only way I can imagine how these are implemented actually does require compiler support, and I haven’t seen any compiler actually getting around to implementing the unseq stuff In some stuff we basically pass in a function object and the compiler somehow figures out to create the new function object which can be executed on multiple elements in parallel, or something like that There’s an expectation that these things would happen, and some of the players in that field seem to be working on that But I haven’t seen any of that So the proposal was a little bit optimistic, in my opinion, but nonetheless we got the algorithms So the current implementation status is, if you look at that paper, you find I think six or seven implementation listed, all of them are partial None of them support par unseq Most of the implementations don’t even try to fall back to sequential implementation to say, all right, we have at least all of the things, you have the signatures, and it compiles, it just runs sequentially, which I would have hoped to at the very least to say, all right, you have tried to do these things, and we haven’t covered that as quality of implementation The only implementation which, for now,