Daniel Duffy
C++ author, trainer
- Joined
- 10/4/07
- Messages
- 10,435
- Points
- 648
In this blog we give an overview of the statistical univariate distributions in the Boost Math Toolkit. We discuss the functionality in the Toolkit, some examples of use and applications to computational finance.
The Math Toolkit has many applications and is easy to understand and to apply in applications.
Statistical Distributions in Toolkit
We give a detailed overview of the univariate statistical distributions and functions in the Math Toolkit. The emphasis is on discussing the functionality in the toolkit, in particular:
The discrete probability distributions in Boost are:
Some examples are:
These functions are:
Alternatively, if you are only using a specific distribution you can use its corresponding include file, for example, in the case of the normal distribution we use:
Similarly, if we wish to use the noncentral chi-squared distribution we use the following include file:
We now discuss the normal distribution in detail.
An Example
The normal (or Gaussian) distribution is one of the most important statistical distributions because of its ability to model many kinds of phenomena in diverse areas such as economics, computational finance, physics and the social sciences. In general, the normal distribution is used to describe variables that tend to cluster around a mean value. We now show how to implement the normal distribution in Boost and we show how to call the member and non-member functions associated with it. We present the code without further explanation as it should be clear from the current context:
The reason for including this code is that it is representative for many of the other distributions and examples when you create code. It is a template for other cases. You can copy the code and run it assuming you have a C++ compiler and the Boost library installed on your computer.
The Noncentral Chi-Squared Distribution
This distribution is a generalization of the chi-squared distribution and it has a number of applications in computational finance, for example when modeling the short rate or calculating bond option prices under the Cox-Ingersoll-Ross (CIR) model (Cox 1985).
In this section we shall create matrix lookup tables for the non-central chi-squared distribution. In particular, we reproduce the same results as in the Math Toolkit documentation (see the section “Worked Example, Non Central Chi Squared Example”). The objective is to create a table (two-dimensional associative matrix) of the power of the Chi squared test at the 5% significance level for various values of the degrees of freedom and non-centrality parameters. To this end, we employ the template classes in Duffy 2009 that model associative matrices on the one hand and the software for presenting the results in Excel on the other hand. An associative matrix is a generalization of the standard matrix class that we see in C++ and Matlab in that we access the elements of the matrix using arbitrary data types (for example, string or doubles) rather than integers. In our example the data type is double because we generate a set of increasing values of the two defining parameters of the non-central chi-squared distribution. We summarise the steps that allow us to create the lookup table (see Duffy 2009 for more details and source code):
[tr1][td1]2[/td1][td1]0.225545[/td1][td1]0.415427[/td1][td1]0.58404[/td1][td1]0.717564[/td1][td1]0.815421[/td1][td1]0.883163[/td1][/tr1][tr2][td1]3[/td1][td1]0.192238[/td1][td1]0.358534[/td1][td1]0.518079[/td1][td1]0.654111[/td1][td1]0.761063[/td1][td1]0.840227[/td1][/tr2][tr1][td1]4[/td1][td1]0.171467[/td1][td1]0.320074[/td1][td1]0.470102[/td1][td1]0.604725[/td1][td1]0.715986[/td1][td1]0.802426[/td1][/tr1]
[tr2][td1]5[/td1][td1]0.156993[/td1][td1]0.291756[/td1][td1]0.432876[/td1][td1]0.564449[/td1][td1]0.677439[/td1][td1]0.768598[/td1][/tr2][tr1][td1]6[/td1][td1]0.146212[/td1][td1]0.269796[/td1][td1]0.40283[/td1][td1]0.530652[/td1][td1]0.643848[/td1][td1]0.738025[/td1][/tr1]
[tr2][td1]7[/td1][td1]0.137813[/td1][td1]0.252152[/td1][td1]0.377911[/td1][td1]0.501722[/td1][td1]0.614188[/td1][td1]0.710197[/td1][/tr2][tr1][td1]8[/td1][td1]0.131052[/td1][td1]0.237603[/td1][td1]0.356823[/td1][td1]0.476586[/td1][td1]0.587734[/td1][td1]0.684726[/td1][/tr1][tr2][td1]9[/td1][td1]0.125473[/td1][td1]0.225361[/td1][td1]0.338694[/td1][td1]0.454489[/td1][td1]0.563949[/td1][td1]0.661305[/td1][/tr2][tr1][td1]10[/td1][td1]0.120777[/td1][td1]0.214894[/td1][td1]0.322908[/td1][td1]0.434875[/td1][td1]0.542418[/td1][td1]0.639683[/td1][/tr1][tr2][td1]11[/td1][td1]0.116762[/td1][td1]0.205827[/td1][td1]0.309018[/td1][td1]0.417326[/td1][td1]0.522819[/td1][td1]0.619652[/td1][/tr2]Exceptions
An important attention point when using C++ algorithms and other code is to discover and handle run-time errors. In other words, we wish to support a certain level of fault-tolerance in our code. Some of the possible errors occurring in a typical application are:
There are other ways to handle exceptional situations and customizable policies and these are discussed on www.boost.org.
Applications in Computational Finance
Statistical distributions and their applications are used in computational finance in areas such as stochastic processes, Monte Carlo simulation and jump-diffusion model, to name just a few. The advantage of having the Boost Math Toolkit at our disposal is that we do not have to develop the code for statistical distributions ourselves but we can use a peer-reviewed library. The library can also be integrated into Excel if we use the C++/CLI interoperability language. We can then use the functionality as worksheet function or as COM addins.
References
Cox, J., J. Ingersoll, S. Ross, “A theory of the term structure of interest rates”, Econometrica, 53(2) (1985):385–407
Duffy, Daniel J. and Kienitz, J. 2009 Monte Carlo frameworks, John Wiley and Sons, Chichester, UK
About the Author:
Daniel Duffy is an author and trainer. His company Datasim specializes in methods and techniques for solving problems in quantitative finance. He is the author of Monte Carlo Frameworks: Building Customisable High-performance C++ Applications and Introduction to C++ for Financial Engineers: An Object-Oriented Approach. For more information on the author, see QuantNet's interview with Daniel Duffy
The Math Toolkit has many applications and is easy to understand and to apply in applications.
Statistical Distributions in Toolkit
We give a detailed overview of the univariate statistical distributions and functions in the Math Toolkit. The emphasis is on discussing the functionality in the toolkit, in particular:
- Discrete and continuous distributions, their member functions and defining properties
- Other non-member functions, for example the probability and cumulative density functions, kurtosis and skewness
- Some examples to motivate how to use the classes in the toolkit
The discrete probability distributions in Boost are:
- Bernoulli (a single trial whose outcome is 0 (failure) or 1 (success))
- Binomial (used to obtain the probability of observing k successes in N trials)
- Negative Binomial (used to obtain the probability of k failures and r successes in k + r trials)
- Hypergeometric (describes the number of events k from a sample n drawn from a total population N without replacement)
- Poisson (expresses the probability of a number of events occurring in a fixed period of time)
- Beta (used in Bayesian statistics applications)
- Cauchy-Lorentz (used in physics, spectroscopy and to solve differential equations)
- Chi Squared (used in statistical tests)
- Exponential (models the time between independent events)
- Extreme Value (models rare events)
- F (The Fisher F distribution that tests if two samples have the same variance)
- Gamma (and Erlang) (used to model waiting times)
- Laplace (the distribution of differences between two independent variates with identical exponential distributions)
- Logistic (used in logistic regression and feedforward neural network applications)
- Log Normal (used when the logarithm of the random variable is normally distributed)
- Noncentral Beta (a generalization of the Beta Distribution)
- Noncentral Chi-Squared (a generalization of the Chi Squared Distribution)
- Noncentral F (a generalization of the Fisher F distribution)
- Noncentral T (generalization of Student’s t Distribution)
- Normal (Gaussian) (probably the best known distribution)
- Pareto (compare large and small numbers)
- Rayleigh (combine two orthogonal components having an absolute value)
- Student’s t (the ‘best’ approximate distribution to an unknown distribution)
- Triangular (used when a distribution is only vaguely known, for example I in software projects)
- Weibull (used in failure analysis models)
- Uniform (also known as the rectangular distribution and it models a probability distribution with a constant probability)
Some examples are:
- How to handle results from bad input arguments.
- Some distributions can return real or integer values; we also wish to determine how they should be rounded.
- Determine the working precision when we calculate results.
- How to control accuracy when data types are promoted to more precise types.
- The maximum number of iterations a special function or algorithm should execute before stopping the computation and throwing an exception.
- Calling a function on a distribution for which it is not defined; for example, the mean of the Cauchy distribution is not defined
These functions are:
- cdf (cumulative distribution function)
- cdf complement (this is 1 – cdf)
- hazard (the event rate at time t conditional on survival until time t or later; useful when modeling failure in mechanical systems)
- chf (cumulative hazard function which measures the accumulation of hazard over time)
- kurtosis (a measure of the ‘peakedness’ of a probability distribution)
- kurtosis_excess (has a distribution fatter tails than a normal distribution?)
- mean (the expected value)
- median (the value separating the lower and higher halves of a distribution)
- mode (the point at which the probability mass or density function takes its maximum)
- pdf (probability density function)
- range (the length of the smallest interval which contains all the data)
- quantile (points taken at regular intervals from the cdf)
- skewness (a measure of the asymmetry of a probability distribution)
- support (the smallest closed interval/set whose complement has probability zero)
- variance (how far do values differ from the mean)
Code:
#include <boost/math/distributions.hpp>
Code:
#include <boost/math/distributions/normal.hpp>
Code:
#include <boost/math/distributions/non_central_chi_squared.hpp>
An Example
The normal (or Gaussian) distribution is one of the most important statistical distributions because of its ability to model many kinds of phenomena in diverse areas such as economics, computational finance, physics and the social sciences. In general, the normal distribution is used to describe variables that tend to cluster around a mean value. We now show how to implement the normal distribution in Boost and we show how to call the member and non-member functions associated with it. We present the code without further explanation as it should be clear from the current context:
Code:
normal_distribution<> myNormal(1.0, 10.0);
cout << "Mean: " << myNormal.mean()
<< ", standard deviation: " << myNormal.standard_deviation() << endl;
// Distributional properties
double x = 10.25;
cout << "pdf: " << pdf(myNormal, x) << endl;
cout << "cdf: " << cdf(myNormal, x) << endl;
// Choose another data type and now a N(0,1) variate
normal_distribution<float> myNormal2;
cout << "Mean: " << myNormal2.mean() << ", standard deviation: " << myNormal2.standard_deviation() << endl;
cout << "pdf: " << pdf(myNormal2, x) << endl;
cout << "cdf: " << cdf(myNormal2, x) << endl;
cout << "cdf complement:" << cdf(complement(myNormal2, x));
// Choose precision
cout.precision(10); // Number of values behind the comma
// Other properties
cout << "n***normal distribution: n";
cout << "mean: " << mean(myNormal) << endl;
cout << "variance: " << variance(myNormal) << endl;
cout << "median: " << median(myNormal) << endl;
cout << "mode: " << mode(myNormal) << endl;
cout << "kurtosis excess: " << kurtosis_excess(myNormal; cout << "kurtosis: " << kurtosis(myNormal) << endl;
cout << "skewness: " << skewness(myNormal) << endl;
cout << "characteristic function: " << chf(myNormal, x);
cout << "hazard: " << hazard(myNormal, x) << endl;
cout << "cumulative hazard: " << chf(myNormal, x) << endl;
// Return the value of random variable
// s.t. dist(myNormal, x) = p
double p = 0.3;
cout << "quantile: " << quantile(myNormal, p) << endl;
// Other properties; these functions return a pair
cout << "range: (" << range(myNormal).first << ","
<< range(myNormal).second << ")" << endl;
cout << "support: (" << support(myNormal).first << ","
<< support(myNormal).second << ")" << endl;
The Noncentral Chi-Squared Distribution
This distribution is a generalization of the chi-squared distribution and it has a number of applications in computational finance, for example when modeling the short rate or calculating bond option prices under the Cox-Ingersoll-Ross (CIR) model (Cox 1985).
In this section we shall create matrix lookup tables for the non-central chi-squared distribution. In particular, we reproduce the same results as in the Math Toolkit documentation (see the section “Worked Example, Non Central Chi Squared Example”). The objective is to create a table (two-dimensional associative matrix) of the power of the Chi squared test at the 5% significance level for various values of the degrees of freedom and non-centrality parameters. To this end, we employ the template classes in Duffy 2009 that model associative matrices on the one hand and the software for presenting the results in Excel on the other hand. An associative matrix is a generalization of the standard matrix class that we see in C++ and Matlab in that we access the elements of the matrix using arbitrary data types (for example, string or doubles) rather than integers. In our example the data type is double because we generate a set of increasing values of the two defining parameters of the non-central chi-squared distribution. We summarise the steps that allow us to create the lookup table (see Duffy 2009 for more details and source code):
- Generate the two collections of values that serves as the indices of the rows (degrees of freedom) and columns (non-centrality parameters) based on a starting value, number of rows or columns and value increment:
Code:// Now create the row and column indices VectorCollectionGenerator<double, double> dofRows; dofRows.Start = 2.0; dofRows.Increment = 1.0; dofRows.Size = 10; Set<double> dofSet = createSet<double>(dofRows); VectorCollectionGenerator<double, double> nonCentralParameterColumns; nonCentralParameterColumns.Start = 2.0; nonCentralParameterColumns.Increment = 2.0; nonCentralParameterColumns.Size = 6; Set<double> nonCentralParameterSet = createSet<double>(nonCentralParameterColumns);
- Create the ‘data’ matrix by calling the quantile function at the 5% significance level:
Code:// Start values for rows and columns double r1 = dofRows.Start; double c1 = nonCentralParameterColumns.Start; // Lookup table dimensions long NRows = dofRows.Size; // Degrees of freedom long NColumns = nonCentralParameterColumns.Size; // Non-centrality parameter double incrementRow = dofRows.Increment; double incrementColumn = nonCentralParameterColumns.Increment; NumericMatrix<double, long> mat(NRows, NColumns); using namespace boost::math; // For convenience // Basic case, no associativity for (long r = mat.MinRowIndex(); r <= mat.MaxRowIndex(); ++r) { c1 = nonCentralParameterColumns.Start; for (long c = mat.MinColumnIndex(); c <= mat.MaxColumnIndex(); ++c) { cs = quantile (complement(chi_squared(r1), 0.05)); mat(r,c) = cdf(complement(non_central_chi_squared(r1,c1),cs)); c1 += incrementColumn; } r1 += incrementRow; }
- Create the associative matrix and display it in Excel:
Code:// Now create the associative matrix AssocMatrix<double, double, double> myAssocMat(dofSet, nonCentralParameterSet, mat); printAssocMatrixInExcel(myAssocMat, string("NCCQT"));
* | 2 | 4 | 6 | 8 | 10 | 12 |
---|
[tr2][td1]5[/td1][td1]0.156993[/td1][td1]0.291756[/td1][td1]0.432876[/td1][td1]0.564449[/td1][td1]0.677439[/td1][td1]0.768598[/td1][/tr2][tr1][td1]6[/td1][td1]0.146212[/td1][td1]0.269796[/td1][td1]0.40283[/td1][td1]0.530652[/td1][td1]0.643848[/td1][td1]0.738025[/td1][/tr1]
[tr2][td1]7[/td1][td1]0.137813[/td1][td1]0.252152[/td1][td1]0.377911[/td1][td1]0.501722[/td1][td1]0.614188[/td1][td1]0.710197[/td1][/tr2][tr1][td1]8[/td1][td1]0.131052[/td1][td1]0.237603[/td1][td1]0.356823[/td1][td1]0.476586[/td1][td1]0.587734[/td1][td1]0.684726[/td1][/tr1][tr2][td1]9[/td1][td1]0.125473[/td1][td1]0.225361[/td1][td1]0.338694[/td1][td1]0.454489[/td1][td1]0.563949[/td1][td1]0.661305[/td1][/tr2][tr1][td1]10[/td1][td1]0.120777[/td1][td1]0.214894[/td1][td1]0.322908[/td1][td1]0.434875[/td1][td1]0.542418[/td1][td1]0.639683[/td1][/tr1][tr2][td1]11[/td1][td1]0.116762[/td1][td1]0.205827[/td1][td1]0.309018[/td1][td1]0.417326[/td1][td1]0.522819[/td1][td1]0.619652[/td1][/tr2]
An important attention point when using C++ algorithms and other code is to discover and handle run-time errors. In other words, we wish to support a certain level of fault-tolerance in our code. Some of the possible errors occurring in a typical application are:
- “bad” input arguments in constructors and other member and non-member functions
- overflow, underflow and NaN (‘not a number’) phenomenon
- what to do when an iterative algorithm fails to converge in a given number of iterations
- and many more
Code:
#include <boost/math/distributions/students_t.hpp>
double dof = -10; // Degrees of freedom
double t = 3.0;
try
{
students_t myDist(dof);
double p = cdf(myDist, t);
}
catch(std::exception& e)
{
cout << endl << "Bad argument" << endl;
cout << "Message: " << e.what() << endl;
}
Applications in Computational Finance
Statistical distributions and their applications are used in computational finance in areas such as stochastic processes, Monte Carlo simulation and jump-diffusion model, to name just a few. The advantage of having the Boost Math Toolkit at our disposal is that we do not have to develop the code for statistical distributions ourselves but we can use a peer-reviewed library. The library can also be integrated into Excel if we use the C++/CLI interoperability language. We can then use the functionality as worksheet function or as COM addins.
References
Cox, J., J. Ingersoll, S. Ross, “A theory of the term structure of interest rates”, Econometrica, 53(2) (1985):385–407
Duffy, Daniel J. and Kienitz, J. 2009 Monte Carlo frameworks, John Wiley and Sons, Chichester, UK
About the Author:
Daniel Duffy is an author and trainer. His company Datasim specializes in methods and techniques for solving problems in quantitative finance. He is the author of Monte Carlo Frameworks: Building Customisable High-performance C++ Applications and Introduction to C++ for Financial Engineers: An Object-Oriented Approach. For more information on the author, see QuantNet's interview with Daniel Duffy