Matrix Multiplication: C++ Multithreading Without Thread Synchronization (With Source Code)

Joined
7/6/09
Messages
7
Points
11
In my previous blog, I mentioned that “we may be better off writing multithreaded programs for our problems without explicitly using these(thread synchronization) mechanisms, if and whenever possible.” We saw a C++ multithreaded code in action using Boost.Asio Strand. In this blog, I am going to continue this topic and show you another working C++ multithreaded program for matrix multiplication without thread synchronization.

Here is the list of subtitles:
1. Multithreading Without Thread Synchronization: What Kind of Problems Can We Solve With This Approach?

2. A C++ Multithreaded Program for Matrix Multiplication Without Thread Synchronization

3. Thead Function: multiply() in the Class

4. What Is Going On Under Hood?

5. Play With Matrices of Different Dimension (SIZE)

For the details and its source code, please read my blog at:
http://zhaowuluo.wordpress.com/2011/01/02/cplusplus_threading_matrix_multiplication/


Happy reading!
 
I am just wondering why you use threads with compile-time ('tiny') matrices? All data is on the stack so the matrix size will be quite small. What is the speed-up when compare to serial execution time? There is a chance that the parallel solution will be slower.

Normally, matrices are created on the heap and that's when all the L1, L2 caching problems kick in.

Comparing libraries, in OpenMP one can paralellelise matrix multiplication by inserting 1 line of code into serial source program.
 
in OpenMP one can paralellelise matrix multiplication by inserting 1 line of code into serial source program.

and that, my friends, is the beauty of OpenMP. :)
 
just to quote "What is the maximum dimensions of a matrix" ... you can allocate ???;

I have benchmark Matrix Mult for "big" matrices, unfortunately my numbers are not so good
C(n,n) = A(n, n)*B(n,n)

my machine Pentium Dual Core 2.7GHz RAM 3GBYTES
Attached zip file, it contains executable exsan_matrix_multiply


Code:
                (n)         time (s)
         <-------------------------->
     1:          100            0
     2:          200            1
     3:          300            4
     4:          400           10
     5:          500           19
     6:          600           34
     7:          700           54
     8:          800           81
     9:          900          116
    10:         1000          160
    11:         1100          214
    12:         1200          281
    13:         1300          364
    14:         1400          464
    15:         1500          598
    16:         1600          741
    17:         1700          920
    18:         1800         1119
    19:         1900         1363
    20:         2000         1658
    21:         2100         2000
    22:         2200         2393
    23:         2300         2852
    24:         2400         3386
    25:         2500         3986
    26:         2600         4669
    27:         2700         5460
    28:         2800         6350
    29:         2900         7250
    28:         3000         8449
             >---------------------<
 

Attachments

yes, C++ is my choice
have you benchmarked your matrix mult algorithm?
what is the biggest 3 matrices you can alocate ?
my machine 3 Gb Ram allows me to do C(4000, 4000) = A(4000, 4000) * B(4000, 4000)
double declaration
 
yes, C++ is my choice
have you benchmarked your matrix mult algorithm?
what is the biggest 3 matrices you can alocate ?
my machine 3 Gb Ram allows me to do C(4000, 4000) = A(4000, 4000) * B(4000, 4000)
double declaration

I am checking it now. And what is the time of execution???
 
attached report in my previous post, the execution time for n = 3000 8449 seconds, which makes me so happy ...:(
couldn't wait for the n = 4000, but I would extrapolate results later, and work out my algorith to see if I can improve it
 
attached report in my previous post, the execution time for n = 3000 8449 seconds, which makes me so happy ...:(
couldn't wait for the n = 4000, but I would extrapolate results later, and work out my algorith to see if I can improve it

8449??? :eek::eek: How could you wait for so long?!

I am still waiting for C(2000,2000) = A(2000,200) * B (2000,2000). For 1000 square matrices I have done pretty fast like seconds.
I am on 2.41GHz, 2GB RAM, 4800+ core processor machine. I don't even want to hear about 3000 or 4000 since I am very afraid of it on this machine. Ill do it again when sitting on my laptop when I get home. ;)
 
http://people.sc.fsu.edu/~jburkardt/presentations/matmul.pdf

particularly take a look at:
Using Pointers to Speed Up Array Access pp 18

I am doing it on C#. And see the machine properties above. It is not powerful. I will switch to my lap later but for now, when I run the 1000x1000 matrix multiplication algorithm, it takes 1.25 minutes on average. On MATLAB you can multiply 2000x2000 matrices in 2 seconds regardless of the machine you are running.
 
For the sake of Christ: use your vendor supplied BLAS (http://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms). Matlab does it, as is any other numerical software under the Sun. Nobody should be writing own code for matrix multiplication.

Haah...
Math has already been invented, don't learn math.
Almost all the mathematical algorithms have been created in programming languages. So why do you learn programming at all?! Or what are the ways to learn it when avoiding the codes which have already been written?!
 
As said above: Matlab does it in 2 secs because it uses some sort of vendor supplied BLAS under the hood. So your C++ or C# code should do it too.

None of previous posts mention that any of you guys is doing this for the learning purpose. But in that case, indeed it makes sense to try; still from your numbers it is clear you're doing it wrong. For x86-like architecture matrix multiplication implementation, one has to care most about using SSE instructions and cache hierarchy properly. But the architecture is tricky, and getting these things right is major undertake (I'm talking about weeks of work here). There are other interesting architectures where the task would be less cumbersome - I remember couple years ago I was able in two days of work to get close to vendor supplied BLAS implementation of matrix multiplication timings on NVIDIA GPUs.

There are other approaches possible too. OpenMP is already mentioned above as a quick way to get you to the fast matrix multiplication code. On the other side, if you're some kind of C++ wizard, you may wish try templates based approach, like with Eigen (http://eigen.tuxfamily.org/) library - I have to admit I didn't believed it until tried, but this way it is actually possible to get close to the optimal timings.

It is far from true that all mathematical algorithms are implemented. But some building blocks are, and in these cases one should stick with them, as this sort of work is very complex and takes lots of knowledge and experience to get it right.
 
Back
Top Bottom