Cuch, you are obsessed with GPU's. You do realise that GPU's are the tiniest niche (though growing) of the HPC market. It's all about MPI, and to a lesser extent OpenMP.
Side note: One should not think about accelerators (GPU, FPGA, etc.) in "SPMD" and alike archaic terms - this whole classification invented by Michael J. Quinn is way outdated
Well, no. The book by Mattson et al discuss many parallel design patterns.
NO.
First C. Then C++.
// It's Michael J. Flynn, BTW :D
While you can utilize MIMD parallelism in the SIMT model, it is not the same thing (see my above post). And you need to know what SIMD is to explain what SIMT is and how it differs (again, see above). I'd call that quite helpful.Edit: I see now that Daniel was faster in mentioning where Michael J. Quinn name came from. But while we're at it, let me quote what Quinn had to say about Flynn (note that my copy of the book is published 2004.):
"Most contemporary parallel computers fall into Flynn's MIMD category. Hence the MIMD designation is not particularly helpful when describing modern parallel architectures".
CUDA, on the other hand, is basically SIMD at its top level: You issue an instruction, and many units execute that same instruction. There is an ability to partition those units into separate collections, each of which runs its own instruction stream, but there aren’t a lot of those (4, 8, or so). Nvidia calls that SIMT, where the “T” stands for “thread” and I refuse to look up the rest because this has a perfectly good term already existing: MSIMD, for Multiple SIMD.
Unfortunately, some of the folks there appear to have been confused by Nvidia-speak. But you don’t have to take my word for that. See this excellent tutorial from SIGGRAPH 09 by Kayvon Fatahalian of Stanford. On pp. 49-53, he explains that in “generic-speak” (as opposed to “Nvidia-speak”) the Nvidia GeForce GTX 285 does have 30 independent MIMD cores, but each of those cores is effectively 1024-way SIMD: It has with groups of 32 “fragments” running as I described above, multiplied by 32 contexts also sharing the same instruction stream for memory stall overlap. So, to get performance you have to think SIMD out to 1024 if you want to get the parallel performance that is theoretically possible. Yes, then you have to use MIMD (SPMD) 30-way on top of that, but if you don’t have a lot of SIMD you just won’t exploit the hardware.
The GPU itself replaced SIMD array with MIMD-capable Compute Units (CU), which bring support for C++ in the same way NVIDIA did with Fermi, but AMD went beyond Fermi's capabilities with aforementioned IOMMU.
Graphics Core Next: A True MIMD
AMD adopted a smart compute approach. Graphic Core Next is a true MIMD (Multiple-Instruction, Multiple Data) architecture. With the new design, the company opted for "fat and rich" processing cores that occupy more die space, but can handle more data. AMD is citing loading the CU with multiple command streams, instead of conventional GPU load: "fire a billion instructions off, wait until they all complete". Single Compute Unit can handle 64 FMAD (Fused Multiply Add) or 40 SMT (Simultaneous Multi-Thread) waves. Wonder how much MIMD instructions can GCN take? Four threads. Four thread MIMD or 64 SIMD instructions, your call. As Eric explained, Southern Islands is a "MIMD architecture with a SIMD array".
// BTW, 2004 would be before CUDA 1.0 (2006), right?
Right, and this is exactly why I mentioned the publication year in my previous post: even 8 years time ago, many parallel programming practitioners considered Flynn categorization obsolete.
I claim that for GPU programming in particular, this categorization is not helpful in clarifying programming model (albeit I have to admit that Patterson/Hennessy are pretty much right on in planting GPU's place in this scheme of things), and that on the other side there exist much more important characteristics of GPU programming model when one has to decide would it be worth trying to implement given algorithm on GPU architecture, or not. Or, to put it simply and back to where this whole excursion started: for GPUs and other accelerators (as well for any type of parallel programming architecture today), things are much more complex than "is it SIMD, or MIMD, or whatever".
Oh, and - you definitely shouldn't cite Greg Pfister about anything GPU related: this guy has some good info to share from time to time (like about MIC, etc.), but is mostly an Intel fanboy, and a poster example of clueless writing about GPU architecture. For example (albeit this blog post that you cited is rather old), this statement that "...to get performance you have to think SIMD out to 1024..." is completely wrong - as mentioned above, "SIMD granularity" for CUDA is 32 at the moment, in case if you want to extract full performance (still could get parallelism with smaller granularity, but at the expense of sacrificing performance).
All those refinements build on and can be contrasted with SIMD and MIMD -- and, frankly, I wouldn't trust someone who doesn't know the basic building blocks and their role to be competent with anything "much more complex."Architectural Design Patterns for Data-Parallel Accelerators
- MIMD Architectural Design Pattern
- Vector-SIMD Architectural Design Pattern
- Subword-SIMD Architectural Design Pattern
- SIMT Architectural Design Pattern
- VT Architectural Design Pattern
We first introduce a set of five architectural design patterns for DLP cores in Section 2, quali-
tatively comparing their expected programmability and efficiency. The MIMD pattern [10] flexibly
supports mapping data-parallel tasks to a collection of simple scalar or multithreaded cores, but
lacks mechanisms for efficient execution of regular DLP. The vector-SIMD [28, 31] and subword-
SIMD [8] patterns can significantly reduce the energy on regular DLP, but can require complicated
programming for irregular DLP. The single-instruction multiple-thread (SIMT) [17] and vector-
thread (VT) [15] patterns are hybrids between the MIMD and vector-SIMD patterns that attempt to
offer alternative tradeoffs between programmability and efficiency.
. . .
The single-instruction multiple-thread (SIMT) pattern is a hybrid pattern with a programmer’s
logical view similar to the MIMD pattern but an implementation similar to the vector-SIMD pat-
tern.
. . .
The NVIDIA Fermi graphics processor is a good example of this pattern with 32 SIMT cores
each with 16 lanes suitable for graphics as well as more general data-parallel applications [23]
Further, I can't really see how "this blog post that you cited is rather old" (2008) is relevant to the fact that a warp (thread of SIMD instructions) size has been fixed at 32 since CUDA compute capability version 1.0 (2006) and remains fixed at 32 in CUDA compute capability version 3.5 (Kepler GK110; 2012). If anything, comments like that look a bit silly and don't make for a very convincing argument ;-)
When I mentioned that Pfister's blog post is rather old, it was sort of apology on his behalf - I guess he understood CUDA architecture better in the meantime, and that these days he wouldn't make factual errors like one I pointed in the excerpt you cited. Maybe I should have been more clear about that, but otherwise I don't get what is exactly that you found silly in my previous message...
Other than that, I really had no intention to comment more on this topic, mostly as I said what I had to say, and as I'm not sure any more that I understand what is your exact point. Mine was, from the beginning, that I think that putting too much weight in categorizations of parallel architectures (in particular according to Flynn scheme), that seemed to me Daniel is doing in his comments, is not much helpful, or productive. I'm of course drawing from my personal experience for this opinion: I'm doing mostly HPC work (in oil&gas, nuclear engineering, little bit of quant finance, etc.) for my living for about 15 years, MPI all the time, OpenMP about 10 years, and CUDA about 5 years, and never ever throughout that time I've seen any particular use for Flynn or any other categorizations. Indeed, for many of these projects I had no word on architecture/API to be used, but for some I actually did have to make this type of decision. But I would never start from, or use, categorizations in the decision process, as these are just too broad; instead, I'd typically examine the algorithm, and programming models for architectures taken into the consideration, then did some sort of performance analysis, maybe build prototypes, then of course did cost analysis, etc. - and then make my decision.
I don't know - maybe your experience is different, maybe you found taking categorizations into account very important in your work? In any case, I certainly wasn't claiming that categorizations have no value at all - for example, I guess at least for novices in the field these may have educational value, but still I think that at least having some sort of agreement on extension of Flynn categorization would be needed to start with, but I don't see that this is happening in the field (which probably speaks for itself about perceived importance of all of this among practitioners).
What is the point of arguing what to learn first?
Because: get the fundamentals right; later is too late.
I use C# a lot but it is not all that useful if you ever wanted to learn C++.
I think the problem with Quant wannabe is that they want to skip through the essential part to gain immediate entry.
Quant does not have a monopoly in this regard; it's endemic.
besides all the previous posts, if you want to be part of the HFT, there is no option, it must be C/C++
First of all, you don't know what I use C# for....then you say is not useful. Don't you think you are prejudice on something with no relevant facts to support your claim.
Secondly, C/C++ is the core of most FE stuff but there are companies also using Java and C#.
I think the key point - it's easier to land a job in FE areas if someone knows how to use C++ in the most efficient way.
With due respect, you mentioned that C# is not that useful or because you have not master it yet.
I think we live in the modern internet era. Think about if C# is born in the same year as C++, perhaps the history is difference. Because of the old school and the current quants are mostly using C++ therefore they have resistance to replace C++.
[...]If you want good parallel code, you can wave goodbye to objects. C++ doesn't even have a proper MPI library.