Further, I can't really see how "this blog post that you cited is rather old" (2008) is relevant to the fact that a warp (thread of SIMD instructions) size has been fixed at 32 since CUDA compute capability version 1.0 (2006) and remains fixed at 32 in CUDA compute capability version 3.5 (Kepler GK110; 2012). If anything, comments like that look a bit silly and don't make for a very convincing argument ;-)
When I mentioned that Pfister's blog post is rather old, it was sort of apology on his behalf - I guess he understood CUDA architecture better in the meantime, and that these days he wouldn't make factual errors like one I pointed in the excerpt you cited. Maybe I should have been more clear about that, but otherwise I don't get what is exactly that you found silly in my previous message...
Other than that, I really had no intention to comment more on this topic, mostly as I said what I had to say, and as I'm not sure any more that I understand what is your exact point. Mine was, from the beginning, that I think that putting too much weight in categorizations of parallel architectures (in particular according to Flynn scheme), that seemed to me Daniel is doing in his comments, is not much helpful, or productive. I'm of course drawing from my personal experience for this opinion: I'm doing mostly HPC work (in oil&gas, nuclear engineering, little bit of quant finance, etc.) for my living for about 15 years, MPI all the time, OpenMP about 10 years, and CUDA about 5 years, and never ever throughout that time I've seen any particular use for Flynn or any other categorizations. Indeed, for many of these projects I had no word on architecture/API to be used, but for some I actually did have to make this type of decision. But I would never start from, or use, categorizations in the decision process, as these are just too broad; instead, I'd typically examine the algorithm, and programming models for architectures taken into the consideration, then did some sort of performance analysis, maybe build prototypes, then of course did cost analysis, etc. - and then make my decision.
I don't know - maybe your experience is different, maybe you found taking categorizations into account very important in your work? In any case, I certainly wasn't claiming that categorizations have no value at all - for example, I guess at least for novices in the field these may have educational value, but still I think that at least having some sort of agreement on extension of Flynn categorization would be needed to start with, but I don't see that this is happening in the field (which probably speaks for itself about perceived importance of all of this among practitioners).
Finally, I'd like also to stress that I'm certainly not advocating that GPUs are some sort of "best" parallel architecture today. It's just that GPU programming specialist are in high demand and well paid, and admittedly this is the main reason I'm doing GPU programming these days. But what matters for me is just that programming model is bearable (what matters much more for me is that the work is interesting and challenging), and I'd happily switch either to MIC or any other accelerator or whatever architecture that may come up as better alternative in the future.