Matrix Inverse Algorithm

Gene Boo · 12/31/13

I just used LU.X = B and solved for X using forward and backward substitution. Found it to be faster than Gaussian Elimination. Implemented using XLW library in c++ to compute in Excel... whereas Excel took 0.28s to compute a the inverse of a 201x201 matrix, single-threaded gaussian elimination took me 14.97s, and LU (using Doolittle - also single-threaded) 8.4s. My overhead against Excel is of course the use of XLFOPERS and MyMatrix class to store the matrices in RAM - these are too slow - I've not yet tried converting to just std::vector <std::vector <double>>.

Gene Boo · 12/31/13

Oh ya, and these are not optimized - in the first place, forward/backward substitution to solve L.L-1 = I = U-1U can be written and shortened much, just work the algebra and get a nice algo. Then L-1U-1 can be easily matrix multiplied using a parallel for naive or divide-and-conquer

Daniel Duffy · 12/31/13

There's a LU decomposition C++ code (Aug 30 2011) here

http://www.datasimfinancial.com/forum/viewtopic.php?t=111

Single-threaded solution should be sufficient?

Gene Boo · 1/7/14

Ok I know it's an old thread - but I'm just putting this on for any future Googler's benefit.

Here's my system:
Core i5-3470 @ 3.2 GHz
4 GB RAM
32 bit operating system

My code was compiled for use with Excel (XLW) - using VC++ 10 Express, with compiler optimization flags /O2 and /Ot (Maximize speed and favor fast code).

For Matrix inversion, I've tried a few methods, namely - here are their times (300x300)
1. Gaussian elimination (about the same time as Gauss Jordan so I'll lump them together) (22.5 s)
2. SVD - slowwwwww but good for getting pseudoinverse (63.2424 s)
3. A^-1 = U^-1 L^-1 - a bit slower than Gaussian elimination (29.02734 s) - and I have optimized the triangular matrix inversion to run using forward substitution - LU chosen algo was Doolittle though Crout was slightly slower and Cholesky would fail based on positive definiteness of input (Chol would be fastest really if the input was possible to decompose)
4. L.U.(A^-1) = I (12.64453 s) --- winner of my algo trials - Doolittle LU and forward substitute, backward substitute.

but... this is the disheartening part: Excel took only (0.7421875 s)!!! HOW DO THEY DO THIS?!?!?

Any ideas?? I want to do matrix inversion in under a second as well!!!

C S · 1/7/14

Gene Boo said:
Ok I know it's an old thread - but I'm just putting this on for any future Googler's benefit.

Here's my system:
Core i5-3470 @ 3.2 GHz
4 GB RAM
32 bit operating system

My code was compiled for use with Excel (XLW) - using VC++ 10 Express, with compiler optimization flags /O2 and /Ot (Maximize speed and favor fast code).

For Matrix inversion, I've tried a few methods, namely - here are their times (300x300)
... this is the disheartening part: Excel took only (0.7421875 s)!!! HOW DO THEY DO THIS?!?!?

Any ideas?? I want to do matrix inversion in under a second as well!!!

Did you vectorize your code?

Gene Boo · 1/7/14

I am using <ppl.h> for parallel processing - but I monitored Excel's thread numbers - Excel only used a single thread to do its inversion. So it's really a question of algo choice.

Also, my pinvokes made the code run slower due to overheads - so I reckon for 300x300 - the size is not huge enough to perform parallelization.

Gene Boo · 1/7/14

About parallelization...

In fact, Excel's matrix multiply (single threaded) vs. my matrix multiply (using parallel_for) (4 threads) multiplying 300x300 by another of same size took 0.47 s vs 0.71 s respectively. Again, Excel only used 1 thread, whereas my optimized parallelized code used 4, and mine single threaded took also around 1.68 s...

So vectorization (I equate that to parallelization) doesn't seem to help with the matrix inversion problem... someone please help before all my hair gets pulled out! ^_^ = especially since it is such that Excel ain't using it. It could be overheads from accessing memory addresses... but I really think that ought not to increase the computing time by that number of folds...

Anyone know of any algo that is simply superbly faster than L.U.(A^-1) = I? I reiterate that I have optimized the forward and backward substitution to disregard the sparseness, and only work with the relative triangles. Parallelization (done on columns - as it is the only independent factor) did only increase the time due to overheads (which was shocking to me too).

C S · 1/7/14

Gene Boo said:
I am using <ppl.h> for parallel processing - but I monitored Excel's thread numbers - Excel only used a single thread to do its inversion. So it's really a question of algo choice.

Also, my pinvokes made the code run slower due to overheads - so I reckon for 300x300 - the size is not huge enough to perform parallelization.

You're confusing multithreading with vectorization. Excel's single thread is undoubtedly using vectorized operations.

Gene Boo · 1/7/14

Do you know of any example or which library I can use to vectorize my code?

Gene Boo · 1/7/14

Ok, I see what you mean... but unfortunately there's no way to auto-vectorize my code using my current visual studio express version - I'll have to upgrade to 2012 :-((((

http://stackoverflow.com/questions/11424075/sse2-visual-studio-2010-and-debug-build
http://msdn.microsoft.com/en-us/library/hh872235(v=vs.110).aspx

Ok... I'm gonna try VSExpress 2012 - and try to set lazy auto-vectorization (because I know jackcrap about vectorization and SME2 instructions and the lot - and I really am not patient enough to learn up assembly code and work with vector registers manually) will recompile and post times

ExSan · 1/8/14

Numerical Recipes in C

Gene Boo · 1/8/14

Thanks ExSan, I've been a great fan of that book for more than a decade now

But specifics please! Which algo? Implemented in which way? Parallelizable? How do I write a vectorized version of the code - because during the NR days, parallelization nor vectorization nor GPU etc... were explored nor were they options, for most part. - i.e. unless new versions of the book have these things featured...

Gene Boo · 1/8/14

Ok I reviewed numerical recipes - correct me if I'm wrong - are you suggesting perhaps using Strassen's algorithm?

http://10.18.50.254:8080/progress?p...eferer=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8=&foo=2

But on p. 104 it mentions in the end that LU decomposition is still the way to go... as to exploit Strassens O(N^2.8), N has to be purrrrrrty large, and not to mention, the mem required and practical overheads from address accessing... - I think that in this case C S is on the correct track regarding vectorisation being the way to go... except I don't know assembly - so how do I write code that can exploit the vector registers?

Side note - I had implemented Strassen for matrix multiplication before - but for the 300x300 case, it had taken buh-geezuz long time to run due to my perhaps bad implementation, but also due to the padding of 0s required - 300 x 300 required me to work on 512 x 512 using strassen, which means additional of 172144 0s... yeeeek!

Gene Boo · 1/8/14

Gene Boo said:
Ok, I see what you mean... but unfortunately there's no way to auto-vectorize my code using my current visual studio express version - I'll have to upgrade to 2012 :-((((

http://stackoverflow.com/questions/11424075/sse2-visual-studio-2010-and-debug-build
http://msdn.microsoft.com/en-us/library/hh872235(v=vs.110).aspx

Ok... I'm gonna try VSExpress 2012 - and try to set lazy auto-vectorization (because I know jackcrap about vectorization and SME2 instructions and the lot - and I really am not patient enough to learn up assembly code and work with vector registers manually) will recompile and post times

Ok - my results after recompiling with VSExpress 2012 - using auto-vectorization (compiler flag /Qpar) - absolutely no augmentation in speeds... :-( I give up...

Gene Boo · 1/8/14

Gene Boo said:
Ok - my results after recompiling with VSExpress 2012 - using auto-vectorization (compiler flag /Qpar) - absolutely no augmentation in speeds... :-( I give up...

AHA!!!! I knocked the times down to 2 s for 300x300 using LU(A^-1) = I - without vectorization - but simply by optimizing the calling of matrix classes :-P - room for more optimization surely - it's still 0.33 Excel speed :-S, and using parallel_for in well chosen locations.

A^-1 = U^-1 * L^-1 took 2.5 s, parallel_for implemented in simultaneous U and L inversions, and matrix multiply. Thus clearly LU(A^-1) = I wins in algorithmic processing.

I'm sure I can vectorize it by manually hacking the assembly code generated to use the vector registers... but I shudder at the thought.

Polter · 1/9/14

Try Eigen (remember to enable the newest SSE and/or AVX in your compiler options; if GCC or Clang, then also use -march=native and -mtune=native):
http://eigen.tuxfamily.org/dox/group__TutorialLinearAlgebra.html
http://eigen.tuxfamily.org/dox/group__TopicLinearAlgebraDecompositions.html

alain · 1/9/14

@Gene Boo ,

How did you profile your code?

Gene Boo · 1/9/14

Sorry please explain profile? And ya I know abt eigen etc but I'm building my own lib cuz I prefer to have full control and also im testing out diff algos to satisfyc uriosity not production work

Gene Boo · 1/9/14

Meaning

alain said:
@Gene Boo ,

How did you profile your code?

Ah - ok I just Wiki'ed profiling - I didn't do it - I only analysed my code by eyeballing and saw where I could apply parallel invokes - and then trial and error by timing in Excel. How do I do that in VS Express 2010?

Sorry I'm not such an advanced IT savvy person... I'm trying to learn stuff - so thanks guys for all your inputs!!! Happy 2014!

Gene Boo · 1/9/14

Here are the NEW TIMES!

Here's my system:
Core i5-3470 @ 3.2 GHz
4 GB RAM
32 bit operating system

My code was compiled for use with Excel (XLW) - using VC++ 10 Express, with compiler optimization flags /O2 and /Ot (Maximize speed and favor fast code).

For Matrix inversion, I've tried a few methods, namely - here are their times (300x300)
1. Gaussian elimination (about the same time as Gauss Jordan so I'll lump them together) (8 s)
2. SVD - slowwwwww but good for getting pseudoinverse (15 s)
3. A^-1 = U^-1 L^-1 - a bit slower than Gaussian elimination (2.5 s) - and I have optimized the triangular matrix inversion to run using forward substitution - LU chosen algo was Doolittle though Crout was slightly slower and Cholesky would fail based on positive definiteness of input (Chol would be fastest really if the input was possible to decompose)
4. L.U.(A^-1) = I (2 s - this was debug build - release build is 1 s) --- winner of my algo trials - Doolittle LU and forward substitute, backward substitute.

but... this is the disheartening part: Excel took only (0.7421875 s)!!! HOW DO THEY DO THIS?!?!?

Any ideas?? I want to do matrix inversion in under a second as well!!! (I STILL DO!)

Matrix Inverse Algorithm

Gene Boo

Gene Boo

Daniel Duffy

C++ author, trainer

Gene Boo

C S

Gene Boo

Gene Boo

C S

Gene Boo

Gene Boo

ExSan

Gene Boo

Gene Boo

Gene Boo

Gene Boo

Polter

alain

Older and Wiser

Gene Boo

Gene Boo

Gene Boo

Similar threads