Best Programming first language

  • Thread starter Thread starter Dtm
  • Start date Start date
Cuch, you are obsessed with GPU's. You do realise that GPU's are the tiniest niche (though growing) of the HPC market. It's all about MPI, and to a lesser extent OpenMP. If you want good parallel code, you can wave goodbye to objects. C++ doesn't even have a proper MPI library.
 
Cuch, you are obsessed with GPU's. You do realise that GPU's are the tiniest niche (though growing) of the HPC market. It's all about MPI, and to a lesser extent OpenMP.

Thanks, Barny. Now that you mention it you might have a point.
 
Side note: One should not think about accelerators (GPU, FPGA, etc.) in "SPMD" and alike archaic terms - this whole classification invented by Michael J. Quinn is way outdated

Well, no. The book by Mattson et al discuss many parallel design patterns.

Strictly speaking, it's SIMT -- http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html -- a bit more flexible than SIMD, but it's not exactly a revolutionary shift that "changes everything." And Flynn's taxonomy is still useful and applicable in explaining how much can we achieve (relating SIMT to SIMD, among others) and cut through the NVIDIA's idiosyncratic terminology (reinventing the terms for the concepts that were well-known in the computer architecture community well before anyone has even heard of GPUs) -- for more on that, see the paragraph that starts with "NVIDIA decided that the unifying theme of all these forms of parallelism is the CUDA Thread": http://books.google.com/books?id=v3-1hVwHnHwC&pg=PA289
In particular: http://books.google.com/books?id=v3-1hVwHnHwC&pg=PA292
A "CUDA Thread" is nothing else than a sequence of SIMD lane operations. So, yeah, you still need the same taxonomy to describe what SIMT is.

// A useful higher-level concept in this context is data-level parallelism.

// It's Michael J. Flynn, BTW :D

NO.
First C. Then C++.

Well, here I'd disagree. IMHO you shouldn't even know about stuff like "new" (let alone "malloc") before you've become proficient at smart pointers -- it's not 1998 anymore, we have C++11 now. Right now, the only book that I think does it right is C++ Primer (5th Edition); here's the relevant chapter: http://www.informit.com/articles/article.aspx?p=1944072
 
// It's Michael J. Flynn, BTW :D

Indeed. This is discussed in the book by Michael J. Quinn :D What's in a name?

NVIDIA's idiosyncratic terminology (reinventing the terms for the concepts that were well-known in the computer architecture community well before anyone has even heard of GPUs) -- for more on that, see the paragraph that starts with "NVIDIA decided that the unifying theme of all these forms of parallelism is the CUDA Thread":

A little knowledge is a dangerous thing (Alexander Pope).
 
// It's Michael J. Flynn, BTW :D

Indeed, thanks for the correction. I guess it was subconscious typo, as I find Michael J. Quinn work more useful exactly in the context that is discussed above - for example his book is great both as tutorial and reference for MPI (and, to some lesser extent, for OpenMP), but also surprisingly useful for its parallel program design and analysis (following Ian Foster methodology) overview.

Edit: I see now that Daniel was faster in mentioning where Michael J. Quinn name came from. But while we're at it, let me quote what Quinn had to say about Flynn (note that my copy of the book is published 2004.):
"Most contemporary parallel computers fall into Flynn's MIMD category. Hence the MIMD designation is not particularly helpful when describing modern parallel architectures".
 
Edit: I see now that Daniel was faster in mentioning where Michael J. Quinn name came from. But while we're at it, let me quote what Quinn had to say about Flynn (note that my copy of the book is published 2004.):
"Most contemporary parallel computers fall into Flynn's MIMD category. Hence the MIMD designation is not particularly helpful when describing modern parallel architectures".
While you can utilize MIMD parallelism in the SIMT model, it is not the same thing (see my above post). And you need to know what SIMD is to explain what SIMT is and how it differs (again, see above). I'd call that quite helpful.

Some would use "MSIMD" term here:
CUDA, on the other hand, is basically SIMD at its top level: You issue an instruction, and many units execute that same instruction. There is an ability to partition those units into separate collections, each of which runs its own instruction stream, but there aren’t a lot of those (4, 8, or so). Nvidia calls that SIMT, where the “T” stands for “thread” and I refuse to look up the rest because this has a perfectly good term already existing: MSIMD, for Multiple SIMD.

Unfortunately, some of the folks there appear to have been confused by Nvidia-speak. But you don’t have to take my word for that. See this excellent tutorial from SIGGRAPH 09 by Kayvon Fatahalian of Stanford. On pp. 49-53, he explains that in “generic-speak” (as opposed to “Nvidia-speak”) the Nvidia GeForce GTX 285 does have 30 independent MIMD cores, but each of those cores is effectively 1024-way SIMD: It has with groups of 32 “fragments” running as I described above, multiplied by 32 contexts also sharing the same instruction stream for memory stall overlap. So, to get performance you have to think SIMD out to 1024 if you want to get the parallel performance that is theoretically possible. Yes, then you have to use MIMD (SPMD) 30-way on top of that, but if you don’t have a lot of SIMD you just won’t exploit the hardware.

For comparison, consider the Graphics Core Next (GCN) architecture of the recent HD 77xx-79xx Southern Islands' architecture family cards:
http://www.brightsideofnews.com/new...-mix-gcn-with-vliw4--vliw5-architectures.aspx
The GPU itself replaced SIMD array with MIMD-capable Compute Units (CU), which bring support for C++ in the same way NVIDIA did with Fermi, but AMD went beyond Fermi's capabilities with aforementioned IOMMU.

Graphics Core Next: A True MIMD
AMD adopted a smart compute approach. Graphic Core Next is a true MIMD (Multiple-Instruction, Multiple Data) architecture. With the new design, the company opted for "fat and rich" processing cores that occupy more die space, but can handle more data. AMD is citing loading the CU with multiple command streams, instead of conventional GPU load: "fire a billion instructions off, wait until they all complete". Single Compute Unit can handle 64 FMAD (Fused Multiply Add) or 40 SMT (Simultaneous Multi-Thread) waves. Wonder how much MIMD instructions can GCN take? Four threads. Four thread MIMD or 64 SIMD instructions, your call. As Eric explained, Southern Islands is a "MIMD architecture with a SIMD array".

The first card from this series was released in February 2012 -- so it's a relatively recent development -- and, as you can see, there's still a trade-off between SIMD and MIMD utilization.

// BTW, 2004 would be before CUDA 1.0 (2006), right?
 
// BTW, 2004 would be before CUDA 1.0 (2006), right?

Right, and this is exactly why I mentioned the publication year in my previous post: even 8 years time ago, many parallel programming practitioners considered Flynn categorization obsolete.

I claim that for GPU programming in particular, this categorization is not helpful in clarifying programming model (albeit I have to admit that Patterson/Hennessy are pretty much right on in planting GPU's place in this scheme of things), and that on the other side there exist much more important characteristics of GPU programming model when one has to decide would it be worth trying to implement given algorithm on GPU architecture, or not. Or, to put it simply and back to where this whole excursion started: for GPUs and other accelerators (as well for any type of parallel programming architecture today), things are much more complex than "is it SIMD, or MIMD, or whatever".

Oh, and - you definitely shouldn't cite Greg Pfister about anything GPU related: this guy has some good info to share from time to time (like about MIC, etc.), but is mostly an Intel fanboy, and a poster example of clueless writing about GPU architecture. For example (albeit this blog post that you cited is rather old), this statement that "...to get performance you have to think SIMD out to 1024..." is completely wrong - as mentioned above, "SIMD granularity" for CUDA is 32 at the moment, in case if you want to extract full performance (still could get parallelism with smaller granularity, but at the expense of sacrificing performance).
 
Right, and this is exactly why I mentioned the publication year in my previous post: even 8 years time ago, many parallel programming practitioners considered Flynn categorization obsolete.

I claim that for GPU programming in particular, this categorization is not helpful in clarifying programming model (albeit I have to admit that Patterson/Hennessy are pretty much right on in planting GPU's place in this scheme of things), and that on the other side there exist much more important characteristics of GPU programming model when one has to decide would it be worth trying to implement given algorithm on GPU architecture, or not. Or, to put it simply and back to where this whole excursion started: for GPUs and other accelerators (as well for any type of parallel programming architecture today), things are much more complex than "is it SIMD, or MIMD, or whatever".

Oh, and - you definitely shouldn't cite Greg Pfister about anything GPU related: this guy has some good info to share from time to time (like about MIC, etc.), but is mostly an Intel fanboy, and a poster example of clueless writing about GPU architecture. For example (albeit this blog post that you cited is rather old), this statement that "...to get performance you have to think SIMD out to 1024..." is completely wrong - as mentioned above, "SIMD granularity" for CUDA is 32 at the moment, in case if you want to extract full performance (still could get parallelism with smaller granularity, but at the expense of sacrificing performance).

I think you're reading too much into Quinn's 2004 statement and (wrongly) extrapolating to parallel architectures not described in his book (which indeed focuses mostly on the MIMD architectures popular at that time); in particular, today there's even less evidence of this "obsolescence", with further refinements in heterogeneous computing architectures (see below).

Yeah, I definitely prefer Hennessy & Patterson terminology. "Much more complex" is a relative thing, but I think you still need to know about SIMD and MIMD in order to understand SIMT and, consequently, strengths and limitations of CUDA-style GPGPU. So, there's nothing obsolete about it. If anything, you may consider even further refinements along the data-parallel line, as in the ones described by Lee and others:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-129.html
http://www.icsi.berkeley.edu/pubs/arch/exploringthetradeoffs11.pdf

Architectural Design Patterns for Data-Parallel Accelerators
- MIMD Architectural Design Pattern
- Vector-SIMD Architectural Design Pattern
- Subword-SIMD Architectural Design Pattern
- SIMT Architectural Design Pattern
- VT Architectural Design Pattern

We first introduce a set of five architectural design patterns for DLP cores in Section 2, quali-
tatively comparing their expected programmability and efficiency. The MIMD pattern [10] flexibly
supports mapping data-parallel tasks to a collection of simple scalar or multithreaded cores, but
lacks mechanisms for efficient execution of regular DLP. The vector-SIMD [28, 31] and subword-
SIMD [8] patterns can significantly reduce the energy on regular DLP, but can require complicated
programming for irregular DLP. The single-instruction multiple-thread (SIMT) [17] and vector-
thread (VT) [15] patterns are hybrids between the MIMD and vector-SIMD patterns that attempt to
offer alternative tradeoffs between programmability and efficiency.

. . .

The single-instruction multiple-thread (SIMT) pattern is a hybrid pattern with a programmer’s
logical view similar to the MIMD pattern but an implementation similar to the vector-SIMD pat-
tern.

. . .

The NVIDIA Fermi graphics processor is a good example of this pattern with 32 SIMT cores
each with 16 lanes suitable for graphics as well as more general data-parallel applications [23]
All those refinements build on and can be contrasted with SIMD and MIMD -- and, frankly, I wouldn't trust someone who doesn't know the basic building blocks and their role to be competent with anything "much more complex."

As for your comment "as mentioned above, "SIMD granularity" for CUDA is 32 at the moment, in case if you want to extract full performance (still could get parallelism with smaller granularity, but at the expense of sacrificing performance)" -- yeah, exactly, as in: NOT a MIMD.

BTW, one more note -- the fact that you feel that "Greg Pfister is . . . mostly an Intel fanboy" doesn't change the fact that MSIMD is a good descriptive alternative to SIMT (although we may preferably stay with H&P terminology). (Although I do agree that the "think SIMD out to 1024" part isn't well formulated). Further, I can't really see how "this blog post that you cited is rather old" (2008) is relevant to the fact that a warp (thread of SIMD instructions) size has been fixed at 32 since CUDA compute capability version 1.0 (2006) and remains fixed at 32 in CUDA compute capability version 3.5 (Kepler GK110; 2012). If anything, comments like that look a bit silly and don't make for a very convincing argument ;-)
 
Further, I can't really see how "this blog post that you cited is rather old" (2008) is relevant to the fact that a warp (thread of SIMD instructions) size has been fixed at 32 since CUDA compute capability version 1.0 (2006) and remains fixed at 32 in CUDA compute capability version 3.5 (Kepler GK110; 2012). If anything, comments like that look a bit silly and don't make for a very convincing argument ;-)

When I mentioned that Pfister's blog post is rather old, it was sort of apology on his behalf - I guess he understood CUDA architecture better in the meantime, and that these days he wouldn't make factual errors like one I pointed in the excerpt you cited. Maybe I should have been more clear about that, but otherwise I don't get what is exactly that you found silly in my previous message...

Other than that, I really had no intention to comment more on this topic, mostly as I said what I had to say, and as I'm not sure any more that I understand what is your exact point. Mine was, from the beginning, that I think that putting too much weight in categorizations of parallel architectures (in particular according to Flynn scheme), that seemed to me Daniel is doing in his comments, is not much helpful, or productive. I'm of course drawing from my personal experience for this opinion: I'm doing mostly HPC work (in oil&gas, nuclear engineering, little bit of quant finance, etc.) for my living for about 15 years, MPI all the time, OpenMP about 10 years, and CUDA about 5 years, and never ever throughout that time I've seen any particular use for Flynn or any other categorizations. Indeed, for many of these projects I had no word on architecture/API to be used, but for some I actually did have to make this type of decision. But I would never start from, or use, categorizations in the decision process, as these are just too broad; instead, I'd typically examine the algorithm, and programming models for architectures taken into the consideration, then did some sort of performance analysis, maybe build prototypes, then of course did cost analysis, etc. - and then make my decision.

I don't know - maybe your experience is different, maybe you found taking categorizations into account very important in your work? In any case, I certainly wasn't claiming that categorizations have no value at all - for example, I guess at least for novices in the field these may have educational value, but still I think that at least having some sort of agreement on extension of Flynn categorization would be needed to start with, but I don't see that this is happening in the field (which probably speaks for itself about perceived importance of all of this among practitioners).

Finally, I'd like also to stress that I'm certainly not advocating that GPUs are some sort of "best" parallel architecture today. It's just that GPU programming specialist are in high demand and well paid, and admittedly this is the main reason I'm doing GPU programming these days. But what matters for me is just that programming model is bearable (what matters much more for me is that the work is interesting and challenging), and I'd happily switch either to MIC or any other accelerator or whatever architecture that may come up as better alternative in the future.
 
When I mentioned that Pfister's blog post is rather old, it was sort of apology on his behalf - I guess he understood CUDA architecture better in the meantime, and that these days he wouldn't make factual errors like one I pointed in the excerpt you cited. Maybe I should have been more clear about that, but otherwise I don't get what is exactly that you found silly in my previous message...

I think you might have a pretty good idea it wasn't just the date comment -- if you really need it spelled out for you, however, I'd helpfully point out the ad hominem parts ("mostly an Intel fanboy, " "a poster example of clueless writing") are somewhat unfitting a respectable discourse level. Especially coming from a source apparently confusing MIMD with SIMT (I also found it a bit curious if not amusing you'd focus on one badly formulated sentence in an excerpt explicitly included as a comment on the terminology / Multiple SIMD -- esp. given that the correctness/incorrectness of the given sentence had no bearing on the (correct) architectural designation). Oh well, it is an Internet forum after all, perhaps I should just lower my expectations w.r.t. the discussion level...

Other than that, I really had no intention to comment more on this topic, mostly as I said what I had to say, and as I'm not sure any more that I understand what is your exact point. Mine was, from the beginning, that I think that putting too much weight in categorizations of parallel architectures (in particular according to Flynn scheme), that seemed to me Daniel is doing in his comments, is not much helpful, or productive. I'm of course drawing from my personal experience for this opinion: I'm doing mostly HPC work (in oil&gas, nuclear engineering, little bit of quant finance, etc.) for my living for about 15 years, MPI all the time, OpenMP about 10 years, and CUDA about 5 years, and never ever throughout that time I've seen any particular use for Flynn or any other categorizations. Indeed, for many of these projects I had no word on architecture/API to be used, but for some I actually did have to make this type of decision. But I would never start from, or use, categorizations in the decision process, as these are just too broad; instead, I'd typically examine the algorithm, and programming models for architectures taken into the consideration, then did some sort of performance analysis, maybe build prototypes, then of course did cost analysis, etc. - and then make my decision.

"Too much weight" -- sure, that's a bad idea by definition (otherwise it wouldn't be "too much"). I'm not sure if that's what Daniel thinks, though (let's not put words into his mouth) -- what I think is that it is still true that for certain styles of parallelism GPUs are more ideal than for the others. So, "it's all MIMD anyway" would be just another extreme position from my POV (and also factually incorrect) which I don't find very helpful at all.

I don't know - maybe your experience is different, maybe you found taking categorizations into account very important in your work? In any case, I certainly wasn't claiming that categorizations have no value at all - for example, I guess at least for novices in the field these may have educational value, but still I think that at least having some sort of agreement on extension of Flynn categorization would be needed to start with, but I don't see that this is happening in the field (which probably speaks for itself about perceived importance of all of this among practitioners).

It is useful, e.g., for ruling out certain algorithms and for ruling certain others in. To give an example: take the ziggurat algorithm for instance. It's a pretty good choice for a CPU, not a good choice for a GPU. The reason is the cost of branching, reasons of which in turn are the trade-offs SIMT architecture makes. This also makes Wallace algorithm an interesting choice to consider. "Very important" is perhaps too strong a way to put it -- but I think it's fair to call it "good to know."
 
Strictly speaking, it's SIMT -- http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html -- a bit more flexible than SIMD, but it's not exactly a revolutionary shift that "changes everything." And Flynn's taxonomy is still useful and applicable in explaining how much can we achieve (relating SIMT to SIMD,

I like the term SIMT, it is at task level at it could be at the highest taxonomy level? Special cases aren then shared data, replicated data(thread local storage), data decomposition, loop parallelism. A special case is SIST?


Does the term MIMT exist? Special cases are then then Blackboard, master worker, pipes and filter, Producer-Consumer. This is more difficult than SIMT. PPL, TBB and TBL support these (btw will Boost have concurrent concurrent containers?)
 
I'm doing mostly HPC work (in oil&gas, nuclear engineering, little bit of quant finance, etc.) for my living for about 15 years, MPI all the time, OpenMP about 10 years, and CUDA about 5 years, and never ever throughout that time I've seen any particular use for Flynn or any other categorizations.


Most of the FEM/FDM problems in oil and gas boil down to solving large sparse matrix systems (at least in the past) and MPI supports these well. In very global terms, I would say that these are SIMT with domain decomposition (with some common surface data). The data is homogenous and can be decomposed and/or replicated. Maybe oil and gas apps are different now.

Other application areas are not so easy to prallelise, for example a large graphics design file with 10 layers. Then the algorithm is to do a hidden line removal. The data must be shared and we must improve responsiveness while avoiding race conditions.

With heterogeneous data it's not as easy.

I like OpenMP but it does not scale and load balancing is problematic(e.g. sections are hard-coded). It is very useful for loop parallelism. That's what it was built for. PPL does not have this scalability restriction.
 
But I would never start from, or use, categorizations in the decision process, as these are just too broad; instead, I'd typically examine the algorithm, and programming models for architectures taken into the consideration, then did some sort of performance analysis, maybe build prototypes, then of course did cost analysis, etc. - and then make my decision.

This process represents knowledge acquisition before it has been codified, formalised in a form that others can use. At some stage patterns and theories will gel.

I suppose that's the way GOF patterns emerged.
 
What is the point of arguing what to learn first? Knowing one programming language can lead to another one.

The key issue is - can you master the programming language that you decide to learn?

I started C#. I have been using C# for a few years and am still learning. Now I learn F#and try to integrate into .Net framework with efficient way. I know C/C++ and Java but I am not a master...does this matter?

It's all depend what you need and can you master the programming language!!! I think the problem with Quant wannabe is that they want to skip through the essential part to gain immediate entry. There is no easy pathway to the success route. At the end of the day, you need to pick one. Remember that someone in this forum started C++ in 1988. My two bits.
 
What is the point of arguing what to learn first?

Because: get the fundamentals right; later is too late.

I use C# a lot but it is not all that useful if you ever wanted to learn C++.

I think the problem with Quant wannabe is that they want to skip through the essential part to gain immediate entry.

Quant does not have a monopoly in this regard; it's endemic.
 
What is the point of arguing what to learn first?

Because: get the fundamentals right; later is too late.

I use C# a lot but it is not all that useful if you ever wanted to learn C++.

I think the problem with Quant wannabe is that they want to skip through the essential part to gain immediate entry.

Quant does not have a monopoly in this regard; it's endemic.


First of all, you don't know what I use C# for....then you say is not useful. Don't you think you are prejudice on something with no relevant facts to support your claim.

Secondly, C/C++ is the core of most FE stuff but there are companies also using Java and C#.

I think the key point - it's easier to land a job in FE areas if someone knows how to use C++ in the most efficient way.

With due respect, you mentioned that C# is not that useful or because you have not master it yet.

I think we live in the modern internet era. Think about if C# is born in the same year as C++, perhaps the history is difference. Because of the old school and the current quants are mostly using C++ therefore they have resistance to replace C++.
 
besides all the previous posts, if you want to be part of the HFT, there is no option, it must be C/C++


When memory is getting cheaper and cheaper.....with the advancement of computer storage architecture. The difference of using C++ vs C# vs Java or even Python is minimum. The difference is about willingness to "Change" and the cost of replacing legacy codes because they are written in C++.

Today, if the market discovers that using "X" programming language in HFT will make more money. Trust me it will change.

Judging that everyone is rushing to FE field, do you guy ever think about your own "time value" of money since FE is part of the finance world. C++ is great but the learning curve is steep for someone with no experience or with minimum experience in programming. It takes years to use it properly. Why not getting something is easy to learn and can land a job first. We can always learn C++ programming in any time frame if anyone is willing to invest time.

You can go to the street to ask the new FE graduates and find out how many of them are still hunting for jobs??? Why??? They claim that they know C++ on their CV but they can't even know how to use pointer properly. Do you think they can get a job? Taking a C++ programming course is just to justify that some is enthusiastic and willing to learn. It takes years to learn it well. Similarly it happens in many programming languages world. Many keep saying that Java, C#, Python and etc are easier to pick up but can you use it well before you claim it easy?? Be realistic folks.
 
First of all, you don't know what I use C# for....then you say is not useful. Don't you think you are prejudice on something with no relevant facts to support your claim.

Secondly, C/C++ is the core of most FE stuff but there are companies also using Java and C#.

I think the key point - it's easier to land a job in FE areas if someone knows how to use C++ in the most efficient way.

With due respect, you mentioned that C# is not that useful or because you have not master it yet.

I think we live in the modern internet era. Think about if C# is born in the same year as C++, perhaps the history is difference. Because of the old school and the current quants are mostly using C++ therefore they have resistance to replace C++.

Maybe my post was not clear; I say C# does not help if you want to learn C++ after having learned C#.
C# is useful, as stated in my post. But it is not the best first language IMO which after all is the current topic.

And I use C# a lot , see this to 'support my claim':) Since you mention 'relevant facts', here goes:

http://www.amazon.com/Financial-Markets-Wiley-Finance-Series/dp/0470030089


Think about if C# is born in the same year as C++, perhaps the history is difference.
Then C# would have disappeared(MSDOS era). C++ became popular because it was C wih some "PLUSPLUS" stuff.

hth
 
[...]If you want good parallel code, you can wave goodbye to objects. C++ doesn't even have a proper MPI library.

I agree, but only in the OOP way that people ahve been using C++ to date, i.e. in a sequential programming way. More generally, pure OOP is the wrong paradigm for parallel programming. The data must take centre stage; wih OOP it's hidden all over the place!

The fact that MPI is not C++ is neither here nor there. My auto can't fly but it does get me to work.
 
Back
Top Bottom