• C++ Programming for Financial Engineering
    Highly recommended by thousands of MFE students. Covers essential C++ topics with applications to financial engineering. Learn more Join!
    Python for Finance with Intro to Data Science
    Gain practical understanding of Python to read, understand, and write professional Python code for your first day on the job. Learn more Join!
    An Intuition-Based Options Primer for FE
    Ideal for entry level positions interviews and graduate studies, specializing in options trading arbitrage and options valuation models. Learn more Join!

Matrix Inverse Algorithm

I just used LU.X = B and solved for X using forward and backward substitution. Found it to be faster than Gaussian Elimination. Implemented using XLW library in c++ to compute in Excel... whereas Excel took 0.28s to compute a the inverse of a 201x201 matrix, single-threaded gaussian elimination took me 14.97s, and LU (using Doolittle - also single-threaded) 8.4s. My overhead against Excel is of course the use of XLFOPERS and MyMatrix class to store the matrices in RAM - these are too slow - I've not yet tried converting to just std::vector <std::vector <double>>.
 
Oh ya, and these are not optimized - in the first place, forward/backward substitution to solve L.L-1 = I = U-1U can be written and shortened much, just work the algebra and get a nice algo. Then L-1U-1 can be easily matrix multiplied using a parallel for naive or divide-and-conquer
 
Ok I know it's an old thread - but I'm just putting this on for any future Googler's benefit.

Here's my system:
Core i5-3470 @ 3.2 GHz
4 GB RAM
32 bit operating system

My code was compiled for use with Excel (XLW) - using VC++ 10 Express, with compiler optimization flags /O2 and /Ot (Maximize speed and favor fast code).

For Matrix inversion, I've tried a few methods, namely - here are their times (300x300)
1. Gaussian elimination (about the same time as Gauss Jordan so I'll lump them together) (22.5 s)
2. SVD - slowwwwww but good for getting pseudoinverse (63.2424 s)
3. A^-1 = U^-1 L^-1 - a bit slower than Gaussian elimination (29.02734 s) - and I have optimized the triangular matrix inversion to run using forward substitution - LU chosen algo was Doolittle though Crout was slightly slower and Cholesky would fail based on positive definiteness of input (Chol would be fastest really if the input was possible to decompose)
4. L.U.(A^-1) = I (12.64453 s) --- winner of my algo trials - Doolittle LU and forward substitute, backward substitute.

but... this is the disheartening part: Excel took only (0.7421875 s)!!! HOW DO THEY DO THIS?!?!?

Any ideas?? I want to do matrix inversion in under a second as well!!!
 
Ok I know it's an old thread - but I'm just putting this on for any future Googler's benefit.

Here's my system:
Core i5-3470 @ 3.2 GHz
4 GB RAM
32 bit operating system

My code was compiled for use with Excel (XLW) - using VC++ 10 Express, with compiler optimization flags /O2 and /Ot (Maximize speed and favor fast code).

For Matrix inversion, I've tried a few methods, namely - here are their times (300x300)
... this is the disheartening part: Excel took only (0.7421875 s)!!! HOW DO THEY DO THIS?!?!?

Any ideas?? I want to do matrix inversion in under a second as well!!!

Did you vectorize your code?
 
I am using <ppl.h> for parallel processing - but I monitored Excel's thread numbers - Excel only used a single thread to do its inversion. So it's really a question of algo choice.

Also, my pinvokes made the code run slower due to overheads - so I reckon for 300x300 - the size is not huge enough to perform parallelization.
 
About parallelization...

In fact, Excel's matrix multiply (single threaded) vs. my matrix multiply (using parallel_for) (4 threads) multiplying 300x300 by another of same size took 0.47 s vs 0.71 s respectively. Again, Excel only used 1 thread, whereas my optimized parallelized code used 4, and mine single threaded took also around 1.68 s...

So vectorization (I equate that to parallelization) doesn't seem to help with the matrix inversion problem... someone please help before all my hair gets pulled out! ^_^ = especially since it is such that Excel ain't using it. It could be overheads from accessing memory addresses... but I really think that ought not to increase the computing time by that number of folds...

Anyone know of any algo that is simply superbly faster than L.U.(A^-1) = I? I reiterate that I have optimized the forward and backward substitution to disregard the sparseness, and only work with the relative triangles. Parallelization (done on columns - as it is the only independent factor) did only increase the time due to overheads (which was shocking to me too).
 
I am using <ppl.h> for parallel processing - but I monitored Excel's thread numbers - Excel only used a single thread to do its inversion. So it's really a question of algo choice.

Also, my pinvokes made the code run slower due to overheads - so I reckon for 300x300 - the size is not huge enough to perform parallelization.

You're confusing multithreading with vectorization. Excel's single thread is undoubtedly using vectorized operations.
 
Do you know of any example or which library I can use to vectorize my code?
 
Ok, I see what you mean... but unfortunately there's no way to auto-vectorize my code using my current visual studio express version - I'll have to upgrade to 2012 :-((((

http://stackoverflow.com/questions/11424075/sse2-visual-studio-2010-and-debug-build
http://msdn.microsoft.com/en-us/library/hh872235(v=vs.110).aspx

Ok... I'm gonna try VSExpress 2012 - and try to set lazy auto-vectorization (because I know jackcrap about vectorization and SME2 instructions and the lot - and I really am not patient enough to learn up assembly code and work with vector registers manually) will recompile and post times
 
Last edited:
Thanks ExSan, I've been a great fan of that book for more than a decade now :) But specifics please! Which algo? Implemented in which way? Parallelizable? How do I write a vectorized version of the code - because during the NR days, parallelization nor vectorization nor GPU etc... were explored nor were they options, for most part. - i.e. unless new versions of the book have these things featured...
 
Last edited:
Ok I reviewed numerical recipes - correct me if I'm wrong - are you suggesting perhaps using Strassen's algorithm?

http://10.18.50.254:8080/progress?p...eferer=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8=&foo=2

But on p. 104 it mentions in the end that LU decomposition is still the way to go... as to exploit Strassens O(N^2.8), N has to be purrrrrrty large, and not to mention, the mem required and practical overheads from address accessing... - I think that in this case C S is on the correct track regarding vectorisation being the way to go... except I don't know assembly - so how do I write code that can exploit the vector registers?

Side note - I had implemented Strassen for matrix multiplication before - but for the 300x300 case, it had taken buh-geezuz long time to run due to my perhaps bad implementation, but also due to the padding of 0s required - 300 x 300 required me to work on 512 x 512 using strassen, which means additional of 172144 0s... yeeeek!
 
Last edited:
Ok, I see what you mean... but unfortunately there's no way to auto-vectorize my code using my current visual studio express version - I'll have to upgrade to 2012 :-((((

http://stackoverflow.com/questions/11424075/sse2-visual-studio-2010-and-debug-build
http://msdn.microsoft.com/en-us/library/hh872235(v=vs.110).aspx

Ok... I'm gonna try VSExpress 2012 - and try to set lazy auto-vectorization (because I know jackcrap about vectorization and SME2 instructions and the lot - and I really am not patient enough to learn up assembly code and work with vector registers manually) will recompile and post times

Ok - my results after recompiling with VSExpress 2012 - using auto-vectorization (compiler flag /Qpar) - absolutely no augmentation in speeds... :-( I give up...
 
Ok - my results after recompiling with VSExpress 2012 - using auto-vectorization (compiler flag /Qpar) - absolutely no augmentation in speeds... :-( I give up...

AHA!!!! I knocked the times down to 2 s for 300x300 using LU(A^-1) = I - without vectorization - but simply by optimizing the calling of matrix classes :-P - room for more optimization surely - it's still 0.33 Excel speed :-S, and using parallel_for in well chosen locations.

A^-1 = U^-1 * L^-1 took 2.5 s, parallel_for implemented in simultaneous U and L inversions, and matrix multiply. Thus clearly LU(A^-1) = I wins in algorithmic processing.

I'm sure I can vectorize it by manually hacking the assembly code generated to use the vector registers... but I shudder at the thought.
 
Last edited:
Sorry please explain profile? And ya I know abt eigen etc but I'm building my own lib cuz I prefer to have full control and also im testing out diff algos to satisfyc uriosity not production work
 
Meaning
@Gene Boo ,

How did you profile your code?
Ah - ok I just Wiki'ed profiling - I didn't do it - I only analysed my code by eyeballing and saw where I could apply parallel invokes - and then trial and error by timing in Excel. How do I do that in VS Express 2010?

Sorry I'm not such an advanced IT savvy person... I'm trying to learn stuff - so thanks guys for all your inputs!!! Happy 2014!
 
Last edited:
Here are the NEW TIMES!

Here's my system:
Core i5-3470 @ 3.2 GHz
4 GB RAM
32 bit operating system

My code was compiled for use with Excel (XLW) - using VC++ 10 Express, with compiler optimization flags /O2 and /Ot (Maximize speed and favor fast code).

For Matrix inversion, I've tried a few methods, namely - here are their times (300x300)
1. Gaussian elimination (about the same time as Gauss Jordan so I'll lump them together) (8 s)
2. SVD - slowwwwww but good for getting pseudoinverse (15 s)
3. A^-1 = U^-1 L^-1 - a bit slower than Gaussian elimination (2.5 s) - and I have optimized the triangular matrix inversion to run using forward substitution - LU chosen algo was Doolittle though Crout was slightly slower and Cholesky would fail based on positive definiteness of input (Chol would be fastest really if the input was possible to decompose)
4. L.U.(A^-1) = I (2 s - this was debug build - release build is 1 s) --- winner of my algo trials - Doolittle LU and forward substitute, backward substitute.

but... this is the disheartening part: Excel took only (0.7421875 s)!!! HOW DO THEY DO THIS?!?!?

Any ideas?? I want to do matrix inversion in under a second as well!!! (I STILL DO!)
 
Last edited:
Back
Top