Matrix Multiplication: C++ Multithreading Without Thread Synchronization (With Source Code)

Implementing BLAS takes time so if you are limited in time than you should use already optimized codes. ==>> This is the point which you made and I think we all agree. Good suggestions above. I was considering learning that kind of things, which would make algorithms faster. I deal mostly with matrix manipulation and simulation techniques. Thanks I understood you.
 
just to quote "What is the maximum dimensions of a matrix" ... you can allocate ???;

I have benchmark Matrix Mult for "big" matrices, unfortunately my numbers are not so good
C(n,n) = A(n, n)*B(n,n)

I just noticed your modified code editor stating the time of execution and it seems very big. I have run the similar matrix multiplication algorithm yesterday on a much low capability PC and on each size of matrix it looks faster. Haven't tried for more than 1000 though.
 
I followed the link and got to Armadillo (for C++ interface into BLAS). But, it is not supported on win 64 bit. Any suggestions?

To be honest, this is for the first time I've heard about Armadillo - from the description, it doesn't seem like C++ wrapper for BLAS to me, and also I really don't know about the performance. So I'd still suggest you go with BLAS implementation for your machine (either Intel MKL for Intel processors, or AMD ACML for AMD processors) - C interface for BLAS is somewhat tricky admittedly, but we're talking about calling single function here.

But if you really can't live without "Matrix A, B, C; ... C = A * B;" syntax, then I'd strongly recommend trying Eigen library that I've mentioned above - it's really fast, and also it consists practically of the set of header files only, so it's trivial to install and use it from any C++ compiler or operating system.


Finally: if someone is really determined doing BLAS like stuff on his own and pushing it really hard enough until reaching optimum levels of performance, then for some motivation when it gets tough I'd suggest reading about guy named Kazushige Goto (start from here: http://en.wikipedia.org/wiki/Kazushige_Goto) - it takes lots of dedication and effort, but amazingly if you know computer architecture and assembly language well, you could eventually beat libraries.
 
I'll put in a shameless plug here for my group (NVIDIA CUDA group): use CUBLAS on the GPU!
CUDA 3.2 performance report
The report has some slides on performance measures comparing MKL versus CUBLAS. If you need fast matrix multiply, you really want to look into GPUs.

The BLAS interface was originally a FORTRAN interface, so it's a little off-putting for C++ developers. But once you get used to it, you start to appreciate how it works. My experience has been that if I want to do something simple that's not in BLAS, I'm doing something wrong (e.g. transpose a matrix in place, add two matrices). Conversely, once I understand the best way to compute some numerical thing I find out there's generally a specific BLAS function that does exactly what I want.

It's also nice because it's a standard interface. You learn it once then you can use it for CUBLAS, MKL, ACML, etc. Sometimes papers even have explicit BLAS notation for numerical algorithms, which makes implementing the algorithms easy.
 
if you want to code matrix multiplication to "learn and study", by all means roll your own. However for any other purpose, use one of the try and true libraries out there: MKL (Intel), ACML(AMD), ATLAS, GotoBLAS, Eigen, GSL, boost uBlas, etc.

Great resources! I have one question. What prevents us from optimizing such kind of libraries?! We know how to implement the models mathematically. Do we need to learn some other source which is focused on making the algorithms faster, more reliable, etc..?Is the development time only concern which people cannot afford or there are some pure programming complexities that stop us?
 
AFAIK, MKL and ACML are hand optimized for the processors they manufacture, Intel and AMD. You will be hard pressed to do better. You could but good luck.

Atlas is being also hand optimized but not at the same level, good luck with that too.

Goto BLAS, follow the link that somebody posted already.

The other ones have gone through different levels of optimizations for the different needs of the authors. You are welcome to try to do better but, IMHO, your energy might be better spent somewhere else.
 
Atlas is being also hand optimized but not at the same level, good luck with that too.

Atlas is not hand optimized. On the contrary, the whole idea behind Atlas is to perform, for each machine it is going to be installed on (for optimum performance, Atlas should be always compiled from source - never use pre-built Atlas installation, as it is probably not tuned exactly for the specific configuration of your machine), an extensive search (at Atlas compile time) through the space of relevant algorithm implementation parameters (for example, for matrix-matrix multiplication algorithm, that is typically implemented through block multiplication, the most important implementation parameter would be the size of the block), in order to find parameter values that would provide fastest calculations on this specific machine.
 
Atlas is not hand optimized. On the contrary, the whole idea behind Atlas is to perform, for each machine it is going to be installed on (for optimum performance, Atlas should be always compiled from source - never use pre-built Atlas installation, as it is probably not tuned exactly for the specific configuration of your machine), an extensive search (at Atlas compile time) through the space of relevant algorithm implementation parameters (for example, for matrix-matrix multiplication algorithm, that is typically implemented through block multiplication, the most important implementation parameter would be the size of the block), in order to find parameter values that would provide fastest calculations on this specific machine.

IIRC, Atlas BLAS will test different kernels at build time and do some sort of empirical optimization. Kernels are supplied by contributors to the project. I think (I might be wrong) that some of the kernels are hand optimized for a given platform. That was the reason for my comment.
 
Back
Top Bottom