CppCon 2017: Dietmar Kühl “C++17 Parallel Algorithms”

Okay, let’s get started I want to talk about parallel algorithms, that is not the first talk about parallel algorithms I hope I give a different spin Most of the others actually talked about the synchronization or how to implement that stuff I don’t want to talk about how these things are implemented I primarily come from a user perspective and I try to use these things based on the similar specification, so I want to talk a little bit about how these things are used, what requirements there are, what they experience so far, and things like that So yeah, a little bit of copyright We all know, lots of concurrency around Unfortunately, also in different ways, so far C++ has basically standardized concurrency on different cores on the same system, and the other concurrency using things like A70 or GPU or FPGAs is rightfully left uncovered However, the idea of the parallel algorithms is partly to say implementations have a hook where they could pluck in a hint that things should be executed maybe on a GPU or on some other device So that is something which hopefully these things go one step towards unleashing That said, C++ is unfortunately entirely sequential If you specify something, there’s no concurrency in there, we basically have to expose the concurrency to the compiler so the compiler has a chance to do something That goes in a number of ways The first thing is if you actually just provide some code to the compiler, the compiler often has a hard time to detect whether he can possibly make things parallel because he doesn’t see all the implementation He may not be able to analyze whether things are actually, wouldn’t introduce data races, are race-free, and may also not be able to actually detect whether it’s worthwhile to actually try to do these things in parallel So we basically need to help the compiler out, at least for the time being Maybe some of these things go away in the future when we have a better idea how to do things, similar to inlining, where compilers nowadays have a much better idea how inlining worked In the beginning, you needed to specify that, goes more and more away But basically for now we need to express the ability of asynchronicity, and the parallel algorithms are in that space So here’s a simple example Everybody has written loops which do a lot of work, and the thing can be made nicely in parallel of the function is well First of all, typically doesn’t have a side effect If it has side effect, it probably has data races, so it can have side effect to some extent, but you need to synchronize on these, possibly Then accesses to the iterators shouldn’t introduce data races, and there’s another requirement over here and that is either size is reasonably large or function takes a lot of time If neither of these are given, if the size were small or the function is actually relatively fast, almost certainly trying to do these things in parallel doesn’t really pay off And we may be able to tell the compiler to do that There are already ways to do that, for example, OpenMP I’ve never really used OpenMP, but there is a way to use OpenMP There is a subtle change over here, that this loop uses not equals, which implies it would work for any kind of iterator Whereas this guy uses less than, indicating that it’s actually a random access iterator, and OpenMP, as far as I know, only works with random access iterators where the parallel algorithms target also non-random access iterators So we can use them as bi-direction iterators or forward iterators, input iterators are currently not required to be supported, but they can work on possibly other iterators Whether that is a good idea to use is a separate question Of course, the system still needs to iterate over these guys, it cannot jump and basically batch things up The thing in OpenMP, well, it doesn’t make a promise that things do execute in parallel, and it doesn’t make any promises about things possibly being able to nest, in particular, the nesting bit is if your base is unspecified, that is, if you have an algorithm or loop which does work and you actually, inside that, you may be able to parallelize more, that doesn’t cover that So there are a number of constraints, and also, it doesn’t really fit the language, kind of slipping these pragmas on top of it is not a really nice approach We could do something like that That is rather hacky code It works, right, we all know there are threads, just kick off a thread for everything This uses a batch size The idea over here is doing every function

in just one individual call is probably not a good idea We probably want to batch these things up and basically process whatever, have 1000 calls or 8000 calls or how many calls you have in each run Stick these threads into a loop, then we eventually, at the bottom, we need to essentially synchronize and get rid of these things If we didn’t have the for loop where we join all these guys we would terminate when destroying that stuff That looks horrible, I don’t write any of that stuff because normally my function and my loops are not as structured as that There’s normally a little bit more inside the function, and it becomes messy if I start putting anything in there And it’s also not necessarily efficient If you’re trying to use that on a different system, it actually executes not faster than sequential, sometimes even slower, even if you have a lot of work to do, which is basically down to some systems actually really create a thread and destroy a thread every time So it gets slow, where some other systems actually kind of have a pool of operating system threads and actually just schedule them So this is something we could do, but we probably don’t want to do Of course, we all get told we can use async Well, that’s not a big difference, right? We can use async, we would do get, I also don’t want to do that Something which is more in the direction of what I actually would want to do is what thread building blocks do I know where anybody basically used thread building blocks as basically a parallel for, where they stick in the function and you specify a range One of the things the parallel for does is, it basically calls the function object, which you stick in with the range itself The range basically is adapting to some grain size where the algorithm somehow figures out how big should the grain size be so that it can, on the one hand, schedule the things to avoid imbalance things, while not creating too much overhead to make things inefficient For some algorithms, that is actually, well, there are some algorithms like that, and that makes it somewhat easier to use than the thread stuff, but there’s still quite a bit of stuff around that And would actually be nice if the inner function block would be done inside an algorithm So what I would like to write is more something along those lines, which basically says transform, and I say okay, I stick in what is called an execution policy, and otherwise nothing changed So I had my transform, which I had before, again, and destination and function, and I just execute these things And that is the direction which the C++ 17 standards took for the parallel algorithms There are still problems with these guys One of the things is these algorithms basically return, so basically there’s a similar synchronization on all of these threads, so basically, that guy has finished, it returns on the same thread, instead of possibly being continuing elsewhere So that’s a synchronization similar to the OpenMP stuff at the end But as a reasonable direction, which is expected to be extended upon, built upon And actually, if you went to Hartmut’s talk, you already talked about these things, saying, all right, maybe transform shouldn’t really, basically, synchronize and return the result, but rather return a future, so it basically can schedule the future elsewhere and keep your system busy This is currently not part of the C++ standard, but that is an expected direction we’d expect things would go So there’s some expectations once we start passing in an execution policy to say what our algorithm actually is constrained to, and in particular these assumptions are that the functions somehow, whatever that exactly is, do not introduce any data races and the other part is that the arguments typically can be copied so that you actually can stick them onto different function objects and send off to different threads So hopefully, everybody’s using algorithms a lot If people use algorithms a lot, the change from the status quo to just getting things faster can be very easy, because all we have is that, and we move it and just stick in parallel a couple of execution policies, and everything hopefully just works (audience member speaking indistinctly) Using, sorry, what? Oh, okay, okay So the question is, what happens if these things actually do not write to a destination range, like in a vector, but rather it’s writing in an empty vector and it’s basically back in

that portion to the back The answer is that algorithm has to work that somehow out I’m relatively sure that transform will not perform particularly well in parallel if you actually build a range It can do something reasonably smart, possibly, if it understands what back inserter is, to basically say, all right, I know how that works, if I basically batch these things up, I know how many batches will be produced and basically to coerce the vector and back inserter into not really doing back insertion And basically go and say, all right, we basically trick the back inserter into kind of, well, pre-allocating all the memory and then passing in the buffers The algorithms are relatively flexible, but as to some extent, or to a large extent, the responsibility of the user to use them in a reasonable way Otherwise, the parallel algorithms still come out to anything So that is the easy approach, that obviously is probably not so easy I don’t know about your place, but where I work, hardly anybody ever uses algorithms So just slapping anything in doesn’t quite work, because typically people actually do write the loops, which is rather annoying, and over here’s the real reason why we could do differently The other thing is, just sprinkling execution policies over your code also probably won’t work, because you actually need to understand where the loops execute a reasonable amount of work or not It does not pay off, and I hope to show that a little bit later to try to parallelize small, fast-executing loops Executing heavy loops, which actually do a lot of work, certainly does pay off, but if they are too small, it won’t And the other thing is we need to be careful that the execution of the function, for example, doesn’t introduce data races If it introduces data races all bets off, that’s basically undefined behavior on the compiler The system cannot necessarily tell you ahead of time whether that would actually work, or whether there are actually issues So just sprinkling things over is not the thing So, which gets me to the concurrency model In general, the standard is specified to say all right, we work on basically a broad range of hardware All of these algorithms, all of the functionality, should be available on all of that hardware Some of the hardware actually doesn’t have any concurrency at all, and actually probably would make things just slower trying to eliminate concurrency on systems like that As a result, there’s no promise that a parallel execution, even if you say, all right, a pass on the execution policy which says run in parallel does anything in parallel It is giving the algorithm the opportunity to execute in parallel, but it doesn’t make a guarantee that it will execute in parallel It can go, well, that’s really, really bad, so implementers do a bad job Well, this is what the marketplace is for It’s quality of implementation, and you hopefully choose a decent implementation and just go tell the others to improve or you won’t buy their free product, or hang on But there’s a full expectation that there will be decent implementations The other thing is, by passing whatever execution policy you pass in, there will be execution policies, specific constraints, on what the operations or requirements on what the operations have to do I will go a little bit more into detail when talking about the specific execution policies To actually talk about the constraints is important to understand what an element access function is There’s a new term in this introduction of the parallel algorithms, and basically, an element access function is any function which is passed in as parameter and is specified by the algorithm to be used or is specified by something to be part of being used in the algorithm for some purpose So the typical element access functions are the operation on iterators being passed in or function calls, any operation, or function objects being passed in There may sometimes be other element access functions like swap, where the algorithm explicitly says let’s use a swap, and the constraints for the corresponding execution policy applied to all of the element access functions What is important over here is the iterator operations often are specified as well, most of the algorithms are specified in relatively relaxed iterators For example, they take forward iterators However, the element access functions are not just the forward iterator operations, but they are the operations of

the iterator category you pass in So if you pass in a random access iterator, all of the random access functions qualify as element access functions, like with all the operational function objects What does not qualify as element access function is anything the algorithm may want to use and may try to sniff out, whether maybe the central function exists We can do that quite easily with meta programming However, algorithms, basically the parallel algorithms are not allowed to sniff out whether you have some other operations which it may consider to be useful because there’s no guarantee that that function actually would be data race free The guarantee is the promises you are making about your arguments only apply to the functions which are specified or the iterators passed in according to their respective category Okay, so that’s a little bit tricky And based on that, we get a number of execution policies Currently, the standard defines three sequential, or sequenced parallel, and parallel unsequenced There is also an execution policy type trait, which is intended for detection It does not work and it’s unspecified what happens if you were to define your own execution policy and try to pass that into the standard library algorithms The reason is, each one of these execution policies actually does something entirely different That’s basically just a way to give the same interface to essentially entirely different functions with a relatively easy hook how to change these functions without having them in different namespace or different name So the execution policies are specified by the implementation, and you cannot really hook your own You can create your own algorithms and pass in whatever execution policy you think you want to do over there, including the standard library ones, but that algorithm is your business The standard library algorithms only work with the ones which are specified, and whatever the implementation, in which case, you become dependent on the implementation, whatever the implementation offers to use The expectation is that implementations will prototype additional execution policies, for example, to execute on an FPGA or maybe execute on a remote machine or whatever crazy idea these people have and they think they can do These are the ones which our standard currently specifies and future standards hopefully extend on that The execution policies are objects The future, one of the expected changes in the future is that the execution policy gets replaced by something like an execute interface, where it can pass on an executor and the implementation will be in terms of that executor So you have, there’s some future expectation that you can change how the parallel algorithms would execute in parallel, but that is not there yet, partly because the concurrency group has not finalized an executor to sign In fact, there’s as far as I can tell from the outside still a lot of contention on different ideas what executors should do and how they should look like So these are the three execution policies, and we’ll quickly go over the constraints There’s execution seq, which is basically for sequential execution That’s a little bit weird Why would I have an execution policy for sequential execution? One of the answers is, it’s for debugging So if you actually want to have something which behaves roughly the same with respect to all the other constraints as the parallel execution policies, you can use seq Another reason, maybe, that you have fallback, and you have some generic way to determine, oh, this is how big the range is, and so you basically decide at compile time which execution policy this particular part of the algorithm should use, and sometimes it should be sequential and sometimes it should be parallel This is what you can use The execution policy still poses, although it is basically just the same as the sequential algorithm, some changes in what the algorithm does, in particular any exception which escapes from any of the, what do they call them? Sorry, the element access functions, is basically resulting in terminate There’s no way, in parallel algorithms, to really report exceptions There may be multiple exceptions thrown from different threads once things get into something like more GPA or similar execution, there’s no way to deal with exceptions anyway So basically exceptions, at least for now, left open, its results are terminated There’s no required for input iterator support You can clearly implement something like a sequential transform on input and output iterators, however, there’s no real way how we could actually

deal with input iterators, which yield a move-only type, and make that parallel There are some things which are basically very unclear how that can be done Implementations, if they figure out how to do that, are free to support input iterators, but there’s no requirement for them So basically if you want to have portable code, you can use the parallel algorithms that best use forward iterators, to some extent, I think most of these executions are relatively academic because most of the time you want to execute on random access iterators And finally, there are a number of algorithms which have slightly changed constraints or slightly changed return times In particular, for each, it’s normal for each return function object in a parallel context, you actually want to have for each just executing as fast as it can If you do things in parallel of the function object, so you want to return, you would need to combine them There’s an algorithm which combines the result, to some extent, which is called reduce, rather than for each, and you would use that So there’s some changes in the interfaces and requirements on the arguments, and I will try to point them out So these are the constraints for sequential That’s not particularly interesting, but the common interface changes are common to all of the three standard execution policies What is way more interesting and certainly is in the space which kind of currently, well, which currently works for some implementations, and something which I’m interested in, is par This is basically executing things in parallel This is basically thread parallelism, where the constraint is none of the element access functions can introduce a data race If you copy any of these guys or execute any of these things, there shall be no data race In theory, you can use locks inside any of these functions, if you have to I need to synchronize and I take a lock, it is probably a bad idea, because you want to execute things in parallel and locks basically, well, you can call them slowdowns or something instead of locks, that’s more descriptive Every time you acquire a lock you basically synchronize things, and everybody wants lock stops But you can The main thing in comparison to the next one, there’s no interleaved execution So basically within a thread, just a normal execution of your element access functions And then there is something which is called par unseq The idea over here is, now the element access functions can execute in parallel You may have actually multiple editions happening at the same time, or multiple elements, functions being called, in theory, concurrently, well, concurrently or in parallel on the same thread so that is targeting things like assembly parallelism or GPU parallelism rather than thread parallelism Obviously, over here, in addition to not having any data races, you cannot have any lock Of course, the thread would deadlock with itself, possibly, two separate threads executing, well, two executors on the same thread could just both acquire the lock and you would have a deadlock So these are the execution policies, and their requirements Moving on quickly to what algorithms are supported Yes (audience member speaking indistinctly) What is the default execution policy? Well, there’s no default execution policy It’s actually the first argument you pass into an algorithm, so if you do not pass in anything, you get the existing sequential algorithms and none of that stuff applies If you pass in a sequence execution policy, you know which execution policy you pass in Does that make sense? Okay, so The next question is, what algorithms are supported for the parallelism? There’s a huge number of parallel operations supported, but not all of the algorithms from the standard library are supported Roughly speaking, all the things which make any sense to be in parallel, so basically, not just sub-linear, like anything with is logarithmic or just a constant operation is probably not viable for, or not reasonable to do things in parallel Some algorithms seem to be unsupported but are really supported on a different name And then there are some algorithms which are relatively rarely used and fairly complicated, and I’m not sure we can make them in parallel They are not there, and then there are some oddballs which are not supported and which I know have not found a good reason why they’re not there But I will go over those in detail So you’re not expected to read that, this is just, qualitatively, everything which is green is supported and everything which is red is unsupported So it’s actually the majority of things are supported,

and that’s good So the reason things are not supported is, for example, these guys, they are all one algorithm, so they are just one-element operations, it doesn’t make any sense to support those The next thing are logarithmic operations Even if you have a tremendous range, you actually do not get enough operations to make it worthwhile to do these things in parallel There’s basically all the lock end algorithms, like filing something or inserting into a heap Then we get a number of heap algorithms These are both linear algorithms, or analog algorithms, I think they’re both linear But the thing over here is, the heap operations mess entirely with the entire range of the heap Trying to do that in parallel, I don’t see how you would do that It kind of probably also something you, I cannot see a use case for having these Sorry, similarly, the permutations is basically how do you permutate these things in parallel? It’s relatively unclear, and although some people have claimed they have used these, I’ve never seen any of these algorithms used in production I’ve seen them used in test drivers Then there are some algorithms which are not supported because they are overlap That is copy backward and move backward The copy and move operations are actually not allowed to overlap in the parallel stuff, because it doesn’t make sense to overlap them It’s okay if you are sequentially overlapping them because you can, assuming you move through the right direction, you copy before you reach this thing as a destination, so that’s okay, but in parallel, it doesn’t make any sense, so copy backward and move backward don’t make any sense Then we have accumulate, which actually, like when in partial sum, they are taken out because they’re renamed, they’re basically called, what is it, reduce And then we get these three oddballs, for which I have no reason why they’re not there I think sample in particular could be done, shuffle probably accesses everything, maybe that’s not a good idea, and iota is kind of a generational thing I believe these things are not entirely there, in particular shuffle and iota, because they just came too late in the standard of C++ 17, or 14 even, and were not, didn’t make it into the paper There’s my personal guess Okay, with that said, these are supported algorithms Now, not all of these supported algorithms are necessarily super-efficient, so let’s go the other way around and see how I would imagine these things can be implemented and which ones are expected to be fast So the obvious mass of parallelism, the embarrassing parallel stuff, that will be basically a variation of map This is a whole host of algorithms which actually can just execute in parallel, they have no real interaction with each other, although, as was pointed out, something like transform or copy is a destination that does not yet access, there is actually an interaction between these guys and they may need to fall back on a different algorithm But if you execute them on random access ranges which do exist, they should all be map, and should be really fairly efficient Assuming you have enough operations in them So the idea of map is very simple In the sequential thing, you just execute one at a time, in parallel, you just execute things in parallel, and just get a nice speedup There’s par unseq, you can basically execute multiple operations in parallel This just displays, almost certainly these things won’t be entirely in parallel, but rather slightly offset, but roughly speaking, there should be a nice speedup if that is actually attainable There are some additional constraints, as I mentioned The for each function, do not return a function object as you do with sequential stuff Copy and move are not allowed to overlap I didn’t find a constraint that copy n is not allowed to overlap, so you can copy n elements I don’t think that is intentional, I think that is more like an error And then some of these algorithms may do something more complicated, maybe it’s something like a reduce, maybe it’s something like a gather scatter thing if you execute them on non-random access iterators But the set of algorithms should be expected to be really fast And there’s a whole host of algorithms which can essentially be mapped to something like reduce Reduce, essentially what it does is, it can also execute things on parallel sequence, but after the sequences are executed,

you basically need to combine the result to achieve the final result There’s a whole bunch of these things which are find operations, I don’t know if the find operation can be done in general, nicer, in a way Probably there are other parallel algorithms which can deal with that, but reduce can certainly deal with that as well And one of the things which is kind of obvious from the idea of doing that, if you use one of the find operations, you probably only want to do that if you’re looking for the needle in the haystack If you expect to find something really soon, you don’t want to have the parallel find You will find one of the guys, and if you try to kick that off in parallel, it will at least search a number of ranges and possibly omit later ranges, but it’s almost certainly less efficient than actually doing things sequentially So if you really search for, in a huge range, the one element of it exists that may be reasonable, otherwise you’re probably better off doing a sequential stuff So reduce, similar, sequential does it sequential, and parallel basically combines just a little bit of, combines steps at the end, and in parallel, you basically execute on what’s in the thread some combinations of the different ranges and then some combine step So there’s, again, expected to be really fast There’s some changes, accumulate becomes reduce, I mentioned that The operations in there need to be associative, otherwise you get a wrong result, because there’s no promise that the operations are done in the same order at all They can be in a different order And there’s a little bit of a bummer, right? Basically the expectation, when we think accumulate, probably in most cases we accumulate things, maybe in the matrix multiplication, all of these things are doubled Double addition is clearly not associative Now, the algorithm requires that these things are associative I don’t think it says that it’s entirely disallowed, and basically the net effect of basically using non-associative algorithms may just be that you get a wrong error, sorry, wrong result On the other hand, if you do double additions, you already know that you get a wrong result You just get a different wrong result So there’s a little bit of a tradeoff You don’t get, necessarily, the result you expect, you get some other result that should still be fine However, formally, these things should be associative And I already mentioned that find may be tricky and stuff So another bunch of algorithms, exclusive scan and inclusive scan These are basically the same as the partial sum stuff, so the idea of partial sum is you basically compute not just in the accumulate or reduce the sum of all the elements and just have the final result, but you also compute the sum of all the intermediate results So you basically get the entire range Inclusive scan, exclusive scan, as well as partial sum seem to be algorithms you don’t use very often Interestingly, they were rather useful to implement other parallel algorithms So there’s actually, inside the parallel algorithms, an area, a good use of these things, in particular if you basically tried to do a filter, if you filter things in different ranges, you basically get a new range with a different number of elements and then you can do the partial sum to figure out where things really need to go The algorithm is a little bit more complicated, it’s still a decent parallel algorithms and you can do nice tricks that basically, if you have enough processes in logarithmic time instead of linear time, but yeah It’s a little bit more special case, there are two There’s inclusive scan and exclusive scan, and they’re slightly annoying So the idea is the inclusive scan is basically, or the difference between the inclusive scan and the exclusive scan is, what is the first element, which value is useful first element? For the exclusive scan, the first element is actually the initial value you pass into the algorithm For the inclusive scan, it is the initial value, it’s the first element of the sequence, plus the, or combined with the initial value Depending on what you try to do, one or the other is actually what you really want As a result, it makes actually sense to expose both of these algorithms You can, in theory, implement one in terms of the others, but this is, either direction, really, is good You lose a little bit of efficiency in both directions To make that a little bit more annoying,

similar to these algorithms actually, unfortunately, have a slightly different parameter order And that is similar to partial sum, these guys take a range, and then they have an operation and an initial value For the inclusive scan, typically the initial value is defaulted and you’re mostly interested in the operation For the exclusive scan, the initial value is the first element You’re probably very interested in passing that, and instead, often you have addition as the operation So that actually the last parameters are just reversed, the initial value on the operation, and they’re correspondingly defaulted So this, what we said about these guys Then this is a bunch of entirely new algorithms, transform exclusive scan, transform inclusive scan, and transform reduce Essentially, the idea over here is, you want to do as much work on elements as you can at a time You don’t want to basically run over the same sequence multiple times and apply multiple operations These guys basically fuse the operations into one, they basically fuse the transformation in addition to whatever other thing they do I could imagine that, in the future, with the range of stuff, we get a lot more of that We basically would produce a range which basically fuses many operations and you basically have a pipeline of operations and a small, like an expression template thing We could try to build up a big operation on the individual elements and then you execute these individually For now, we get these guys And then we get a number of gather algorithms or scatter, which basically combine the, well, determine where things go, and then distribute them So basically you could imagine that over here, if these things make sense, if any of these operations, like the condition for copy if is relatively expansive, then you could run the copy ifs in parallel, then find out where things go, and then distribute these things So to reduce things This is a relatively complicated algorithm, I have experimented with those, but I think they can be executed in parallel and should have a benefit And then there are a number of remaining algorithms which I left as specialized The intention over here is that all of the other guys, basically you have a few core algorithms and you basically map that into these algorithms and then there’s a number of algorithms left over which you probably have to implement, or the implementer has to implement individually But despite the relatively big starting set, there are not that many parallel algorithms which, in the end, really have to be implemented Most of these things are just mapping on other algorithms Okay, which leads me to availability When I proposed that thing originally, the idea was, all right, cool, I will whip out the compiler and run things and things should work Turns out none of the compilers currently, none of the standard libraries, actually ships with the parallel algorithms According to P00241, this is a paper which proposed adoption of the parallel TS, the parallel algorithms from parallel TS, there are multiple implementations of all of these algorithms I checked all of the implementations which were listed, at the point where I first checked, all of them were partial None of them actually implemented any of the interesting unseq stuff, as far as I know, there’s still nobody who implements the unseq stuff because most of this time, the only way I can imagine how these are implemented actually does require compiler support, and I haven’t seen any compiler actually getting around to implementing the unseq stuff In some stuff we basically pass in a function object and the compiler somehow figures out to create the new function object which can be executed on multiple elements in parallel, or something like that There’s an expectation that these things would happen, and some of the players in that field seem to be working on that But I haven’t seen any of that So the proposal was a little bit optimistic, in my opinion, but nonetheless we got the algorithms So the current implementation status is, if you look at that paper, you find I think six or seven implementation listed, all of them are partial None of them support par unseq Most of the implementations don’t even try to fall back to sequential implementation to say, all right, we have at least all of the things, you have the signatures, and it compiles, it just runs sequentially, which I would have hoped to at the very least to say, all right, you have tried to do these things, and we haven’t covered that as quality of implementation The only implementation which, for now,

seems to be reasonably complete, at least of all algorithms I’ve tried, is HPX HPX is not living in the correct standard namespace, it’s basically an implementation, it’s an open-source library, it can download, compile It provides all the interfaces with correct semantics, and it seems to be complete on x86 or the 64-bit version on Linux I’ve not managed to get it to work on Mac OS, and it outright says it cannot work on ARMs, so The implementation’s around, but they are incomplete If you try to use HPX, it kind of tries to set up its own environment, and you basically need to do a little bit of faffing about, like main needs to delegate after initialization or this initialization to a different kind of HPX main And the algorithms are living in HPX parallel instead of the normal standard namespace Okay, with that, this is actually worse, considering all that stuff Yes? (audience member speaking indistinctly) Okay, so the question over there is, how do you configure how many threads are to be used? The answer to that is, at the moment, there’s no customization ability to specify how many threads you can use In the future, the expectation is, as I said, that we basically hook these algorithms up with the executors, and you would control the number of threads or the available parallelization resource via the executor The current standard entirely says, it doesn’t speak about that, it just says, right, here are the algorithms, here are the execution policies, this is what you get Okay, so I tried these things on a couple of machines One of the machines has plenty of cores, it’s a Xeon Phi, 64 cores, four times hyperthreaded, lots of memory, so that is kind of a surrogate for something which is server-like Then the middle guy is a kind of reasonable-size desktop machine, actually it is kind of a Mac with four cores, two times hyperthreaded, plenty of memory, and then the last one is an architecture which always is too slow That’s just a Raspberry Pi, but it has four cores, right? Why not try it over there, let’s see whether things work And the first obvious test I wanted to do is, let’s try a map So basically the more equivalent of this loop, just using a different parallel implementations Given the state of the available implementations, I cooked up my own, plus used some of these things as wrappers around a TBB or OpenMP So for all of these things, this is not what the standard necessarily does It uses a standard interface, or something very close to the standard interface, just living in a different namespace, and standard library implementation should outperform all of these guys, hopefully But this is what I could test with, so basically, all of these guys will have basically a blue line which is the benchmark, this is the sequential implementation of the, in this case, stood for There is a thread implementation and an async implementation, which is roughly speaking the code which I showed you earlier The code earlier was for transform, this is just for for each, but roughly doing the same There is the power implementation around the home grown thread pool, and a TBB implementation, which is basically doing the same thing, trying to wrap TBB So the results look something like that on a Xeon Phi, and the red line down there is actually the TBB implementation The green and yellow line, async and thread, that’s basically using threads, and then my implementation, trying to do a thread pool, is in the middle The good cool thing over here is saying, all right, the good implementation easily outperforms the thread stuff In fact, if I tried not to run that with GCC but rather with an Intel compiler on that machine or with Clang and their respective standard libraries, the results for the green and yellow line would be even worse They actually seem to create a new thread every time, and they actually do not outperform, they’re actually slower than the sequential limitation But over here you can see there’s actually quite a nice benefit already, although there’s not a lot of concurrency available over here We don’t really do, most of the stuff is just traveling around memory

So on a desktop machine, none of these things work It’s just, basically, if you look at the sizes, they’re already fairly large, and getting anything bigger, there’s not enough memory to actually make any sense, it would basically start going to disk, making things yet slower So these guys on these desktop machines don’t help Interestingly, on a Raspberry Pi, doing things in parallel does speed up a little bit At least we get kind of double the speed in the end where these things go up, I think it probably hits the memory limit already, these ranges are substantially smaller, but it doesn’t have much memory Now, doing something a little bit more interesting, this is a nice computation using very little of data If you look at what that is, it’s kind of doing this menu of computations, in the past, when I grew up, I did compute a screen full of dots with a bad printer, and it took hours This program now, with a huge amount of things, takes still just seconds, but it’s basically a nice big workload for each thing It’s still the same for each thingy, it’s still the same implementations, and the results look actually a lot more promising, as there’s a huge speedup You don’t get that line on the bottom, looks as if, oh, it takes zero time, it’s not 256 times faster but it’s about, roughly speaking, in the best case, something like 100 times faster So it actually benefits from the hyperthreading so that it doesn’t access any memory, it’s kind of executing nice and fast I have not been able to try to actually use assembly operations to do the same thing I would except that that would actually get things yet faster So there’s, if you have a nice workload, there is a potential of actually improving performance dramatically The same is true for the desktop machine If you actually do not, make it not memory bound, you basically get all four cores busy and you execute four times as fast, which is nice On the Raspberry Pi it’s a little bit more confused, there’s no consistent result, but it’s still faster, and it keeps the machine busy Another interesting result may be sort, if you ever need to sort huge ranges Unfortunately I didn’t manage to get any other parallel sorting algorithm to work, like, the HPX stuff does work, but only on the Linux machine So but there’s a nice speedup of these guys Again, this is using 64 cores, but it only gets about, I think, in the end, around 16 times faster No, sorry, this has not got quite 16 times This seems to be eight times-ish Which is still a nice speedup if you need to sort huge areas, it does benefit from that Likewise on the desktop machine and on the Raspberry, again, a little bit more confused, but you get a speedup And then finally, reduce, where basically you need to compute first and then you need to combine these things Again, relatively simple, it just reduces, just computes a sum of an arrange Again, a number of implementations there which I tried to use, which is accumulate a benchmark, a par using home grown thread pool, the OpenMP implementation, which is basically a standard interface around a loop labeled with OMP, and the thread building block version They also, even if you just do relatively simple operation, get a nice speedup So I think these things are kind of all, so on the desktop machine not as much, but still it gets faster, and on the Raspberry Pi, it’s a little bit confused So all of these things do make sense, they do get faster, but you need to actually give nice workloads and relatively large ranges One of the things which I didn’t show is most of these ranges start at already relatively large sizes If you make them actually just a bit smaller, the algorithms tend to be way slower So even minor implementation is sometimes a few thousand times slower, sure At that range, the algorithm doesn’t take long at all, but it’s still slower The OpenMP and TBB versions tend to be also quite substantially slower Some are just a bit slower, but massively slower So I would hope that proper standard implementation actually covers this area more nicely to basically

decide that, oh, this range is too small, I will do a sequential version rather than trying to schedule things Okay, so as general guidance, I have tried some of these things with the HPX implementation with forward by direction iterators, and it’s a lot worse So all of these things do not get as much speedup at all, so in general, try to use random access iterators if you want to use these things, and your operations need to be worth parallelizing So basically you need to have a sequence with large ranges or really expensive operations There are a number of future directions, I already mentioned some of those, that is, we do expect that the algorithms will be augmented to take an executor as the first parameter where the executors actually use a specifiable customization point where you basically say how do we execute things where Hartmut already mentioned that in HPX there’s often a version which implements future support, so instead of basically waiting on the thread you get back a future by basically passing another parameter or different parameter’s execution policy to say, all right, I want a future back, and then you basically don’t wait on that thread, you just basically combine these things and wait somewhere else, which basically allows for much nicer balancing utilization of the cores I would expect that there’s also eventually some control over chunking, so you basically can decide what grain size should be used, but none of these things do exist, and there is work on the executors, but I haven’t seen any proposed in any of the other spaces One of the things which I’m personally interested in is saying, yes, I would like to use an interesting execution policy which stays constant A lot of people ask for constexpr algorithms Now that we have a hook, maybe if you stick in const and make these const-like expressions, that could be a nice way to avoid having to do kind of the general purpose algorithms be constexpr and basically having to avoid certain algorithms, sorry, certain optimizations Of course, if you look at something like sort that may actually want to evoke mem copy on integers rather than trying to swap these guys, but a const expression version would not allow that So that’s a potential idea to actually kind of abuse execution policy And yeah So in my opinion, using STL algorithms is good, you can then kind of sprinkle the execution policies and the parallel algorithms work best on large and random access stuff And with that, I think we have some time for questions Yes (audience member speaking indistinctly) Yes, okay So the question is, I mentioned that, just on the sequential thing, that it may call terminate And there were a number of other constraints, like the interface changes All of these changes I mentioned on the sequential stuff to get them out of the way, they apply to all of the standard library execution policies There’s a potential in the future that we may have execution policies which say, oh, if you throw an exception, it won’t terminate, but does something else But at the moment, the element access functions are constrained to not throw exceptions, otherwise they will terminate (audience member speaking indistinctly) The question is, when will all the features be available to standard compiler format libraries The answer to that is, I actually have no idea I could imagine that some implementations decide to use, actually, HPX, until they get to HPX, which would at least give the parallel stuff and potentially some level of a par unseq where they can figure out some assembly stuff Any compiler support, I have no idea The compiler vendors tend to be a little bit, well, I cannot look at what they work on in general Maybe I could, maybe we can ask Richard, but I don’t think Richard is working on that area and he’s shaking wildly his head, so he isn’t So I hope that they would emerge soon Part of the reason to do the presentation is that people ask the compiler vendors or drum up interest, because there’s one thing, if nobody is interested in that stuff, most vendors probably go,

well, there’s no user interest We would implement it once there’s user interest, but if there’s no user interest, so it’s a little bit of a chicken and egg problem You guys have to be interested to get that stuff to give incentive to the vendors to produce it The vendors need to produce it so that you guys get interested (audience member speaking indistinctly) Okay, so the question is, a set of guarantees the caller can give to the algorithm so that the algorithm by itself doesn’t decide to throw an exception I think none of the standard library algorithms actually currently specify to ever throw an exception Any exceptions thrown would be kind of on having undefined behavior There would be no guarantee of what happens in undefined behavior, so for example, if something accesses out of bounds stuff as part of a function being called, and otherwise I think the standard actually would have to list if it throws an exceptions willingly, so I would expect that there’s no exception being thrown I would need to go through all the standard library algorithms to figure out what they exactly say, but I don’t think they allow the standard library algorithms are allowed to throw exceptions just because they feel like it Okay, any other questions? Yes (audience member speaking indistinctly) The question is whether a corresponding feature is available in other languages I’m sure that there are some languages which provide some parallel algorithms, but I’m not an expert on any of those, so I’m not aware of any mainstream language which provides a corresponding set of parallel algorithms Okay? If there are no more questions, then thank you very much (applause)