C++ Statistical Distributions in Boost

Daniel Duffy

C++ author, trainer
Joined
10/4/07
Messages
10,466
Points
648
In this blog we give an overview of the statistical univariate distributions in the Boost Math Toolkit. We discuss the functionality in the Toolkit, some examples of use and applications to computational finance.

The Math Toolkit has many applications and is easy to understand and to apply in applications.

Statistical Distributions in Toolkit

We give a detailed overview of the univariate statistical distributions and functions in the Math Toolkit. The emphasis is on discussing the functionality in the toolkit, in particular:
  • Discrete and continuous distributions, their member functions and defining properties
  • Other non-member functions, for example the probability and cumulative density functions, kurtosis and skewness
  • Some examples to motivate how to use the classes in the toolkit
All distributions use random variables which at mapping of a probability space into some other space, typically a real number. A discrete probability distribution is one in which the distribution of the random variable is discrete while a continuous probability distribution is one whose cumulative distribution is continuous.

The discrete probability distributions in Boost are:
  1. Bernoulli (a single trial whose outcome is 0 (failure) or 1 (success))
  2. Binomial (used to obtain the probability of observing k successes in N trials)
  3. Negative Binomial (used to obtain the probability of k failures and r successes in k + r trials)
  4. Hypergeometric (describes the number of events k from a sample n drawn from a total population N without replacement)
  5. Poisson (expresses the probability of a number of events occurring in a fixed period of time)
The continuous probability distributions in Boost are:
  • Beta (used in Bayesian statistics applications)
  • Cauchy-Lorentz (used in physics, spectroscopy and to solve differential equations)
  • Chi Squared (used in statistical tests)
  • Exponential (models the time between independent events)
  • Extreme Value (models rare events)
  • F (The Fisher F distribution that tests if two samples have the same variance)
  • Gamma (and Erlang) (used to model waiting times)
  • Laplace (the distribution of differences between two independent variates with identical exponential distributions)
  • Logistic (used in logistic regression and feedforward neural network applications)
  • Log Normal (used when the logarithm of the random variable is normally distributed)
  • Noncentral Beta (a generalization of the Beta Distribution)
  • Noncentral Chi-Squared (a generalization of the Chi Squared Distribution)
  • Noncentral F (a generalization of the Fisher F distribution)
  • Noncentral T (generalization of Student’s t Distribution)
  • Normal (Gaussian) (probably the best known distribution)
  • Pareto (compare large and small numbers)
  • Rayleigh (combine two orthogonal components having an absolute value)
  • Student’s t (the ‘best’ approximate distribution to an unknown distribution)
  • Triangular (used when a distribution is only vaguely known, for example I in software projects)
  • Weibull (used in failure analysis models)
  • Uniform (also known as the rectangular distribution and it models a probability distribution with a constant probability)
Each of the above distributions is implemented by a corresponding template class with two template parameters. The first parameter is the underlying data type used by the distribution (the default type if used is double) and the second parameter is the so-called policy. In general, a policy is a fine-grained compile-time mechanism that we can use to customize the behaviour of a library. It allows us to change error-handling or calculation precision at both program level and at the client site.
Some examples are:
  • How to handle results from bad input arguments.
  • Some distributions can return real or integer values; we also wish to determine how they should be rounded.
  • Determine the working precision when we calculate results.
  • How to control accuracy when data types are promoted to more precise types.
  • The maximum number of iterations a special function or algorithm should execute before stopping the computation and throwing an exception.
  • Calling a function on a distribution for which it is not defined; for example, the mean of the Cauchy distribution is not defined
Each distribution class has defining parameters that we design as private data members that are initialized in the distribution’s constructor and that subsequently can be accessed by const member functions to return these data members. Each class also has a number of member functions. Furthermore, the toolkit contains properties that are common to all distributions that we call using non-member functions. The advantage of this approach is that new functionality can be added to these distributions in the future.
These functions are:
  • cdf (cumulative distribution function)
  • cdf complement (this is 1 – cdf)
  • hazard (the event rate at time t conditional on survival until time t or later; useful when modeling failure in mechanical systems)
  • chf (cumulative hazard function which measures the accumulation of hazard over time)
  • kurtosis (a measure of the ‘peakedness’ of a probability distribution)
  • kurtosis_excess (has a distribution fatter tails than a normal distribution?)
  • mean (the expected value)
  • median (the value separating the lower and higher halves of a distribution)
  • mode (the point at which the probability mass or density function takes its maximum)
  • pdf (probability density function)
  • range (the length of the smallest interval which contains all the data)
  • quantile (points taken at regular intervals from the cdf)
  • skewness (a measure of the asymmetry of a probability distribution)
  • support (the smallest closed interval/set whose complement has probability zero)
  • variance (how far do values differ from the mean)
Before we give an initial example on using Boost distributions and related functions we need to say some words on configuration. First, all relevant code is in the namespace boost::math, second in we need include files in out test program. The easiest approach is to include all distributions using the convenience file:
Code:
#include <boost/math/distributions.hpp>
Alternatively, if you are only using a specific distribution you can use its corresponding include file, for example, in the case of the normal distribution we use:
Code:
#include <boost/math/distributions/normal.hpp>
Similarly, if we wish to use the noncentral chi-squared distribution we use the following include file:
Code:
#include <boost/math/distributions/non_central_chi_squared.hpp>
We now discuss the normal distribution in detail.

An Example

The normal (or Gaussian) distribution is one of the most important statistical distributions because of its ability to model many kinds of phenomena in diverse areas such as economics, computational finance, physics and the social sciences. In general, the normal distribution is used to describe variables that tend to cluster around a mean value. We now show how to implement the normal distribution in Boost and we show how to call the member and non-member functions associated with it. We present the code without further explanation as it should be clear from the current context:
Code:
normal_distribution<> myNormal(1.0, 10.0);
cout << "Mean: " << myNormal.mean()
<< ", standard deviation: " << myNormal.standard_deviation() << endl;
 
// Distributional properties
double x = 10.25;
 
cout << "pdf: " << pdf(myNormal, x) << endl;
cout << "cdf: " << cdf(myNormal, x) << endl;
 
// Choose another data type and now a N(0,1) variate
normal_distribution<float> myNormal2;
cout << "Mean: " << myNormal2.mean() << ", standard deviation: " << myNormal2.standard_deviation() << endl;
 
cout << "pdf: " << pdf(myNormal2, x) << endl;
cout << "cdf: " << cdf(myNormal2, x) << endl;
cout << "cdf complement:" << cdf(complement(myNormal2, x));
 
// Choose precision
cout.precision(10); // Number of values behind the comma
 
// Other properties
cout << "n***normal distribution: n";
cout << "mean: " << mean(myNormal) << endl;
cout << "variance: " << variance(myNormal) << endl;
cout << "median: " << median(myNormal) << endl;
cout << "mode: " << mode(myNormal) << endl;
cout << "kurtosis excess: " << kurtosis_excess(myNormal; cout << "kurtosis: " << kurtosis(myNormal) << endl;
cout << "skewness: " << skewness(myNormal) << endl;
cout << "characteristic function: " << chf(myNormal, x);
cout << "hazard: " << hazard(myNormal, x) << endl;
cout << "cumulative hazard: " << chf(myNormal, x) << endl;
 
// Return the value of random variable
// s.t. dist(myNormal, x) = p
double p = 0.3;
cout << "quantile: " << quantile(myNormal, p) << endl;
 
// Other properties; these functions return a pair
cout << "range: (" << range(myNormal).first << ","
<< range(myNormal).second << ")" << endl;
cout << "support: (" << support(myNormal).first << ","
<< support(myNormal).second << ")" << endl;
The reason for including this code is that it is representative for many of the other distributions and examples when you create code. It is a template for other cases. You can copy the code and run it assuming you have a C++ compiler and the Boost library installed on your computer.

The Noncentral Chi-Squared Distribution

This distribution is a generalization of the chi-squared distribution and it has a number of applications in computational finance, for example when modeling the short rate or calculating bond option prices under the Cox-Ingersoll-Ross (CIR) model (Cox 1985).

In this section we shall create matrix lookup tables for the non-central chi-squared distribution. In particular, we reproduce the same results as in the Math Toolkit documentation (see the section “Worked Example, Non Central Chi Squared Example”). The objective is to create a table (two-dimensional associative matrix) of the power of the Chi squared test at the 5% significance level for various values of the degrees of freedom and non-centrality parameters. To this end, we employ the template classes in Duffy 2009 that model associative matrices on the one hand and the software for presenting the results in Excel on the other hand. An associative matrix is a generalization of the standard matrix class that we see in C++ and Matlab in that we access the elements of the matrix using arbitrary data types (for example, string or doubles) rather than integers. In our example the data type is double because we generate a set of increasing values of the two defining parameters of the non-central chi-squared distribution. We summarise the steps that allow us to create the lookup table (see Duffy 2009 for more details and source code):
  1. Generate the two collections of values that serves as the indices of the rows (degrees of freedom) and columns (non-centrality parameters) based on a starting value, number of rows or columns and value increment:
    Code:
    // Now create the row and column indices
    VectorCollectionGenerator<double, double> dofRows;
    dofRows.Start = 2.0;
    dofRows.Increment = 1.0;
    dofRows.Size = 10;
    Set<double> dofSet = createSet<double>(dofRows);
    
    VectorCollectionGenerator<double, double>
    nonCentralParameterColumns;
    nonCentralParameterColumns.Start = 2.0;
    nonCentralParameterColumns.Increment = 2.0;
    nonCentralParameterColumns.Size = 6;
    Set<double> nonCentralParameterSet
    = createSet<double>(nonCentralParameterColumns);
  2. Create the ‘data’ matrix by calling the quantile function at the 5% significance level:
    Code:
    // Start values for rows and columns
    double r1 = dofRows.Start;
    double c1 = nonCentralParameterColumns.Start;
    
    // Lookup table dimensions
    long NRows = dofRows.Size; // Degrees of freedom
    long NColumns = nonCentralParameterColumns.Size; // Non-centrality parameter
    double incrementRow = dofRows.Increment;
    double incrementColumn =
    nonCentralParameterColumns.Increment;
    
    NumericMatrix<double, long> mat(NRows, NColumns);
    using namespace boost::math; // For convenience
    // Basic case, no associativity
    for (long r = mat.MinRowIndex();
    r <= mat.MaxRowIndex(); ++r)
    {
    c1 = nonCentralParameterColumns.Start;
    for (long c = mat.MinColumnIndex();
    c <= mat.MaxColumnIndex(); ++c)
    {
    cs = quantile
    (complement(chi_squared(r1), 0.05));
    mat(r,c) =
    cdf(complement(non_central_chi_squared(r1,c1),cs));
    c1 += incrementColumn;
    }
    
    r1 += incrementRow;
    }
  3. Create the associative matrix and display it in Excel:
    Code:
    // Now create the associative matrix
    AssocMatrix<double, double, double>
    myAssocMat(dofSet, nonCentralParameterSet, mat);
    printAssocMatrixInExcel(myAssocMat, string("NCCQT"));
The output is shown in Figure below.
*24681012
[tr1][td1]2[/td1][td1]0.225545[/td1][td1]0.415427[/td1][td1]0.58404[/td1][td1]0.717564[/td1][td1]0.815421[/td1][td1]0.883163[/td1][/tr1][tr2][td1]3[/td1][td1]0.192238[/td1][td1]0.358534[/td1][td1]0.518079[/td1][td1]0.654111[/td1][td1]0.761063[/td1][td1]0.840227[/td1][/tr2][tr1][td1]4[/td1][td1]0.171467[/td1][td1]0.320074[/td1][td1]0.470102[/td1][td1]0.604725[/td1][td1]0.715986[/td1][td1]0.802426[/td1][/tr1]
[tr2][td1]5[/td1][td1]0.156993[/td1][td1]0.291756[/td1][td1]0.432876[/td1][td1]0.564449[/td1][td1]0.677439[/td1][td1]0.768598[/td1][/tr2][tr1][td1]6[/td1][td1]0.146212[/td1][td1]0.269796[/td1][td1]0.40283[/td1][td1]0.530652[/td1][td1]0.643848[/td1][td1]0.738025[/td1][/tr1]
[tr2][td1]7[/td1][td1]0.137813[/td1][td1]0.252152[/td1][td1]0.377911[/td1][td1]0.501722[/td1][td1]0.614188[/td1][td1]0.710197[/td1][/tr2][tr1][td1]8[/td1][td1]0.131052[/td1][td1]0.237603[/td1][td1]0.356823[/td1][td1]0.476586[/td1][td1]0.587734[/td1][td1]0.684726[/td1][/tr1][tr2][td1]9[/td1][td1]0.125473[/td1][td1]0.225361[/td1][td1]0.338694[/td1][td1]0.454489[/td1][td1]0.563949[/td1][td1]0.661305[/td1][/tr2][tr1][td1]10[/td1][td1]0.120777[/td1][td1]0.214894[/td1][td1]0.322908[/td1][td1]0.434875[/td1][td1]0.542418[/td1][td1]0.639683[/td1][/tr1][tr2][td1]11[/td1][td1]0.116762[/td1][td1]0.205827[/td1][td1]0.309018[/td1][td1]0.417326[/td1][td1]0.522819[/td1][td1]0.619652[/td1][/tr2]
Exceptions

An important attention point when using C++ algorithms and other code is to discover and handle run-time errors. In other words, we wish to support a certain level of fault-tolerance in our code. Some of the possible errors occurring in a typical application are:
  • “bad” input arguments in constructors and other member and non-member functions
  • overflow, underflow and NaN (‘not a number’) phenomenon
  • what to do when an iterative algorithm fails to converge in a given number of iterations
  • and many more
In general, we use the standard C++ exception handling mechanism to discover and catch exceptions. As an example, we (deliberately) create a instance of the student’s t-distribution by
Code:
#include <boost/math/distributions/students_t.hpp>
 
double dof = -10; // Degrees of freedom
double t = 3.0;
 
try
{
students_t myDist(dof);
double p = cdf(myDist, t);
}
catch(std::exception& e)
{
cout << endl << "Bad argument" << endl;
cout << "Message: " << e.what() << endl;
}
There are other ways to handle exceptional situations and customizable policies and these are discussed on www.boost.org.

Applications in Computational Finance
Statistical distributions and their applications are used in computational finance in areas such as stochastic processes, Monte Carlo simulation and jump-diffusion model, to name just a few. The advantage of having the Boost Math Toolkit at our disposal is that we do not have to develop the code for statistical distributions ourselves but we can use a peer-reviewed library. The library can also be integrated into Excel if we use the C++/CLI interoperability language. We can then use the functionality as worksheet function or as COM addins.

References
Cox, J., J. Ingersoll, S. Ross, “A theory of the term structure of interest rates”, Econometrica, 53(2) (1985):385–407
Duffy, Daniel J. and Kienitz, J. 2009 Monte Carlo frameworks, John Wiley and Sons, Chichester, UK

About the Author:
Daniel Duffy is an author and trainer. His company Datasim specializes in methods and techniques for solving problems in quantitative finance. He is the author of Monte Carlo Frameworks: Building Customisable High-performance C++ Applications and Introduction to C++ for Financial Engineers: An Object-Oriented Approach. For more information on the author, see QuantNet's interview with Daniel Duffy
 
Can anyone contrast the math toolkit (on the stats side) vs accumulators in a nutshell? If math has all the distributions and features, why is the accumulators library there?

I used accumulators for a minute, but found it was more effective to just write it myself, for my purposes.
 
The Maths Toolkit Stats is 26 univariate distributions and their properties.

Accumulators is for incremental statistical computation, There is very little overlap AFAIK.

Why not just use them so you can concentrate on core stuff?

"I used accumulators for a minute"
How can you form an idea then? Have you compiled the code? That's the test !!
 
Thanks, for the reply. I more 'did it myself' as a learning exercise, I should give them another look. I did run accumulators and compile the code, i meant 'minute' in a figurative sense. I use other boost libraries currently though and like them.

Thanks for the ariticle too!
 
You're welcome.

Another good library is the Special Functions which have lots of applications in Finance,.I remember implementing some of these in the past. It was a pain (not my area) but using them helps a lot.

And John Maddock and colleagues have done a great job of the documentation imo.

//

There is extra code and examples here

http://www.datasimfinancial.com/forum/viewtopic.php?t=111

In particular, my posts of 13, 15 and 22 August 2009 are of particular relevance to this blog.
 
Some new relevant developments in Boost (Random, Math Toolkit) mighg be worth having a look at.
 
Back
Top Bottom