ROOT 6.13/01 Reference Guide |
Minimum Covariance Determinant Estimator - a Fast Algorithm invented by Peter J.Rousseeuw and Katrien Van Dreissen "A Fast Algorithm for the Minimum covariance Determinant Estimator" Technometrics, August 1999, Vol.41, NO.3.
What are robust estimators? "An important property of an estimator is its robustness. An estimator is called robust if it is insensitive to measurements that deviate from the expected behaviour. There are 2 ways to treat such deviating measurements: one may either try to recognise them and then remove them from the data sample; or one may leave them in the sample, taking care that they do not influence the estimate unduly. In both cases robust estimators are needed...Robust procedures compensate for systematic errors as much as possible, and indicate any situation in which a danger of not being able to operate reliably is detected." R.Fruhwirth, M.Regler, R.K.Bock, H.Grote, D.Notz "Data Analysis Techniques for High-Energy Physics", 2nd edition
What does this algorithm do? It computes a highly robust estimator of multivariate location and scatter. Then, it takes those estimates to compute robust distances of all the data vectors. Those with large robust distances are considered outliers. Robust distances can then be plotted for better visualization of the data.
How does this algorithm do it? The MCD objective is to find h observations(out of n) whose classical covariance matrix has the lowest determinant. The MCD estimator of location is then the average of those h points and the MCD estimate of scatter is their covariance matrix. The minimum(and default) h = (n+nvariables+1)/2 so the algorithm is effective when less than (n+nvar+1)/2 variables are outliers. The algorithm also allows for exact fit situations - that is, when h or more observations lie on a hyperplane. Then the algorithm still yields the MCD location T and scatter matrix S, the latter being singular as it should be. From (T,S) the program then computes the equation of the hyperplane.
How can this algorithm be used? In any case, when contamination of data is suspected, that might influence the classical estimates. Also, robust estimation of location and scatter is a tool to robustify other multivariate techniques such as, for example, principal-component analysis and discriminant analysis.
Technical details of the algorithm:
Definition at line 23 of file TRobustEstimator.h.
Public Member Functions | |
TRobustEstimator () | |
this constructor should be used in a univariate case: first call this constructor, then - the EvaluateUni(..) function More... | |
TRobustEstimator (Int_t nvectors, Int_t nvariables, Int_t hh=0) | |
constructor More... | |
virtual | ~TRobustEstimator () |
void | AddColumn (Double_t *col) |
adds a column to the data matrix it is assumed that the column has size fN variable fVarTemp keeps the number of columns l already added More... | |
void | AddRow (Double_t *row) |
adds a vector to the data matrix it is supposed that the vector is of size fNvar More... | |
void | Evaluate () |
Finds the estimate of multivariate mean and variance. More... | |
void | EvaluateUni (Int_t nvectors, Double_t *data, Double_t &mean, Double_t &sigma, Int_t hh=0) |
for the univariate case estimates of location and scatter are returned in mean and sigma parameters the algorithm works on the same principle as in multivariate case - it finds a subset of size hh with smallest sigma, and then returns mean and sigma of this subset More... | |
Int_t | GetBDPoint () |
returns the breakdown point of the algorithm More... | |
Double_t | GetChiQuant (Int_t i) const |
returns the chi2 quantiles More... | |
void | GetCorrelation (TMatrixDSym &matr) |
returns the correlation matrix More... | |
const TMatrixDSym * | GetCorrelation () const |
void | GetCovariance (TMatrixDSym &matr) |
returns the covariance matrix More... | |
const TMatrixDSym * | GetCovariance () const |
const TMatrixD & | GetData () |
returns a reference to the data matrix More... | |
void | GetHyperplane (TVectorD &vec) |
if the points are on a hyperplane, returns this hyperplane More... | |
const TVectorD * | GetHyperplane () const |
if the points are on a hyperplane, returns this hyperplane More... | |
void | GetMean (TVectorD &means) |
return the estimate of the mean More... | |
const TVectorD * | GetMean () const |
Int_t | GetNHyp () |
Int_t | GetNOut () |
returns the number of outliers More... | |
Int_t | GetNumberObservations () const |
Int_t | GetNvar () const |
const TArrayI * | GetOuliers () const |
void | GetRDistances (TVectorD &rdist) |
returns the robust distances (helps to find outliers) More... | |
const TVectorD * | GetRDistances () const |
Protected Member Functions | |
void | AddToSscp (TMatrixD &sscp, TVectorD &vec) |
update the sscp matrix with vector vec More... | |
void | Classic () |
called when h=n. More... | |
void | ClearSscp (TMatrixD &sscp) |
clear the sscp matrix, used for covariance and mean calculation More... | |
void | Correl () |
transforms covariance matrix into correlation matrix More... | |
void | Covar (TMatrixD &sscp, TVectorD &m, TMatrixDSym &cov, TVectorD &sd, Int_t nvec) |
calculates mean and covariance More... | |
void | CreateOrtSubset (TMatrixD &dat, Int_t *index, Int_t hmerged, Int_t nmerged, TMatrixD &sscp, Double_t *ndist) |
creates a subset of hmerged vectors with smallest orthogonal distances to the hyperplane hyp[1]*(x1-mean[1])+...+hyp[nvar]*(xnvar-mean[nvar])=0 This function is called in case when less than fH samples lie on a hyperplane. More... | |
void | CreateSubset (Int_t ntotal, Int_t htotal, Int_t p, Int_t *index, TMatrixD &data, TMatrixD &sscp, Double_t *ndist) |
creates a subset of htotal elements from ntotal elements first, p+1 elements are drawn randomly(without repetitions) if their covariance matrix is singular, more elements are added one by one, until their covariance matrix becomes regular or it becomes clear that htotal observations lie on a hyperplane If covariance matrix determinant!=0, distances of all ntotal elements are calculated, using formula d_i=Sqrt((x_i-M)*S_inv*(x_i-M)), where M is mean and S_inv is the inverse of the covariance matrix htotal points with smallest distances are included in the returned subset. More... | |
Double_t | CStep (Int_t ntotal, Int_t htotal, Int_t *index, TMatrixD &data, TMatrixD &sscp, Double_t *ndist) |
from the input htotal-subset constructs another htotal subset with lower determinant More... | |
Int_t | Exact (Double_t *ndist) |
for the exact fit situations returns number of observations on the hyperplane More... | |
Int_t | Exact2 (TMatrixD &mstockbig, TMatrixD &cstockbig, TMatrixD &hyperplane, Double_t *deti, Int_t nbest, Int_t kgroup, TMatrixD &sscp, Double_t *ndist) |
This function is called if determinant of the covariance matrix of a subset=0. More... | |
Double_t | KOrdStat (Int_t ntotal, Double_t *arr, Int_t k, Int_t *work) |
because I need an Int_t work array More... | |
Int_t | Partition (Int_t nmini, Int_t *indsubdat) |
divides the elements into approximately equal subgroups number of elements in each subgroup is stored in indsubdat number of subgroups is returned More... | |
Int_t | RDist (TMatrixD &sscp) |
Calculates robust distances.Then the samples with robust distances greater than a cutoff value (0.975 quantile of chi2 distribution with fNvar degrees of freedom, multiplied by a correction factor), are given weiht=0, and new, reweighted estimates of location and scatter are calculated The function returns the number of outliers. More... | |
void | RDraw (Int_t *subdat, Int_t ngroup, Int_t *indsubdat) |
Draws ngroup nonoverlapping subdatasets out of a dataset of size n such that the selected case numbers are uniformly distributed from 1 to n. More... | |
Protected Attributes | |
TMatrixDSym | fCorrelation |
TMatrixDSym | fCovariance |
TMatrixD | fData |
Int_t | fExact |
Int_t | fH |
TVectorD | fHyperplane |
TMatrixDSym | fInvcovariance |
TVectorD | fMean |
Int_t | fN |
Int_t | fNvar |
TArrayI | fOut |
TVectorD | fRd |
TVectorD | fSd |
Int_t | fVarTemp |
Int_t | fVecTemp |
#include <TRobustEstimator.h>
TRobustEstimator::TRobustEstimator | ( | ) |
this constructor should be used in a univariate case: first call this constructor, then - the EvaluateUni(..) function
Definition at line 124 of file TRobustEstimator.cxx.
TRobustEstimator::TRobustEstimator | ( | Int_t | nvectors, |
Int_t | nvariables, | ||
Int_t | hh = 0 |
||
) |
constructor
Definition at line 130 of file TRobustEstimator.cxx.
|
inlinevirtual |
Definition at line 78 of file TRobustEstimator.h.
void TRobustEstimator::AddColumn | ( | Double_t * | col | ) |
adds a column to the data matrix it is assumed that the column has size fN variable fVarTemp keeps the number of columns l already added
Definition at line 170 of file TRobustEstimator.cxx.
void TRobustEstimator::AddRow | ( | Double_t * | row | ) |
adds a vector to the data matrix it is supposed that the vector is of size fNvar
Definition at line 191 of file TRobustEstimator.cxx.
update the sscp matrix with vector vec
Definition at line 778 of file TRobustEstimator.cxx.
|
protected |
called when h=n.
Returns classic covariance matrix and mean
Definition at line 808 of file TRobustEstimator.cxx.
|
protected |
clear the sscp matrix, used for covariance and mean calculation
Definition at line 795 of file TRobustEstimator.cxx.
|
protected |
transforms covariance matrix into correlation matrix
Definition at line 849 of file TRobustEstimator.cxx.
|
protected |
calculates mean and covariance
Definition at line 826 of file TRobustEstimator.cxx.
|
protected |
creates a subset of hmerged vectors with smallest orthogonal distances to the hyperplane hyp[1]*(x1-mean[1])+...+hyp[nvar]*(xnvar-mean[nvar])=0 This function is called in case when less than fH samples lie on a hyperplane.
Definition at line 967 of file TRobustEstimator.cxx.
|
protected |
creates a subset of htotal elements from ntotal elements first, p+1 elements are drawn randomly(without repetitions) if their covariance matrix is singular, more elements are added one by one, until their covariance matrix becomes regular or it becomes clear that htotal observations lie on a hyperplane If covariance matrix determinant!=0, distances of all ntotal elements are calculated, using formula d_i=Sqrt((x_i-M)*S_inv*(x_i-M)), where M is mean and S_inv is the inverse of the covariance matrix htotal points with smallest distances are included in the returned subset.
Definition at line 877 of file TRobustEstimator.cxx.
|
protected |
from the input htotal-subset constructs another htotal subset with lower determinant
As proven by Peter J.Rousseeuw and Katrien Van Driessen, if distances for all elements are calculated, using the formula:d_i=Sqrt((x_i-M)*S_inv*(x_i-M)), where M is the mean of the input htotal-subset, and S_inv - the inverse of its covariance matrix, then htotal elements with smallest distances will have covariance matrix with determinant less or equal to the determinant of the input subset covariance matrix.
determinant for this htotal-subset with smallest distances is returned
Definition at line 999 of file TRobustEstimator.cxx.
void TRobustEstimator::Evaluate | ( | ) |
Finds the estimate of multivariate mean and variance.
Definition at line 208 of file TRobustEstimator.cxx.
void TRobustEstimator::EvaluateUni | ( | Int_t | nvectors, |
Double_t * | data, | ||
Double_t & | mean, | ||
Double_t & | sigma, | ||
Int_t | hh = 0 |
||
) |
for the univariate case estimates of location and scatter are returned in mean and sigma parameters the algorithm works on the same principle as in multivariate case - it finds a subset of size hh with smallest sigma, and then returns mean and sigma of this subset
Definition at line 608 of file TRobustEstimator.cxx.
|
protected |
for the exact fit situations returns number of observations on the hyperplane
Definition at line 1036 of file TRobustEstimator.cxx.
|
protected |
This function is called if determinant of the covariance matrix of a subset=0.
If there are more then fH vectors on a hyperplane, returns this hyperplane and stops else stores the hyperplane coordinates in hyperplane matrix
Definition at line 1071 of file TRobustEstimator.cxx.
Int_t TRobustEstimator::GetBDPoint | ( | ) |
returns the breakdown point of the algorithm
Definition at line 674 of file TRobustEstimator.cxx.
Double_t TRobustEstimator::GetChiQuant | ( | Int_t | i | ) | const |
returns the chi2 quantiles
Definition at line 684 of file TRobustEstimator.cxx.
void TRobustEstimator::GetCorrelation | ( | TMatrixDSym & | matr | ) |
returns the correlation matrix
Definition at line 705 of file TRobustEstimator.cxx.
|
inline |
Definition at line 94 of file TRobustEstimator.h.
void TRobustEstimator::GetCovariance | ( | TMatrixDSym & | matr | ) |
returns the covariance matrix
Definition at line 693 of file TRobustEstimator.cxx.
|
inline |
Definition at line 92 of file TRobustEstimator.h.
|
inline |
returns a reference to the data matrix
Definition at line 89 of file TRobustEstimator.h.
void TRobustEstimator::GetHyperplane | ( | TVectorD & | vec | ) |
if the points are on a hyperplane, returns this hyperplane
Definition at line 730 of file TRobustEstimator.cxx.
const TVectorD * TRobustEstimator::GetHyperplane | ( | ) | const |
if the points are on a hyperplane, returns this hyperplane
Definition at line 717 of file TRobustEstimator.cxx.
void TRobustEstimator::GetMean | ( | TVectorD & | means | ) |
return the estimate of the mean
Definition at line 746 of file TRobustEstimator.cxx.
|
inline |
Definition at line 99 of file TRobustEstimator.h.
|
inline |
Definition at line 97 of file TRobustEstimator.h.
Int_t TRobustEstimator::GetNOut | ( | ) |
returns the number of outliers
Definition at line 770 of file TRobustEstimator.cxx.
|
inline |
Definition at line 102 of file TRobustEstimator.h.
|
inline |
Definition at line 103 of file TRobustEstimator.h.
|
inline |
Definition at line 104 of file TRobustEstimator.h.
void TRobustEstimator::GetRDistances | ( | TVectorD & | rdist | ) |
returns the robust distances (helps to find outliers)
Definition at line 758 of file TRobustEstimator.cxx.
|
inline |
Definition at line 101 of file TRobustEstimator.h.
|
protected |
because I need an Int_t work array
Definition at line 1267 of file TRobustEstimator.cxx.
|
protected |
divides the elements into approximately equal subgroups number of elements in each subgroup is stored in indsubdat number of subgroups is returned
Definition at line 1118 of file TRobustEstimator.cxx.
|
protected |
Calculates robust distances.Then the samples with robust distances greater than a cutoff value (0.975 quantile of chi2 distribution with fNvar degrees of freedom, multiplied by a correction factor), are given weiht=0, and new, reweighted estimates of location and scatter are calculated The function returns the number of outliers.
Definition at line 1172 of file TRobustEstimator.cxx.
|
protected |
Draws ngroup nonoverlapping subdatasets out of a dataset of size n such that the selected case numbers are uniformly distributed from 1 to n.
Definition at line 1235 of file TRobustEstimator.cxx.
|
protected |
Definition at line 39 of file TRobustEstimator.h.
|
protected |
Definition at line 37 of file TRobustEstimator.h.
|
protected |
Definition at line 46 of file TRobustEstimator.h.
|
protected |
Definition at line 34 of file TRobustEstimator.h.
|
protected |
Definition at line 28 of file TRobustEstimator.h.
|
protected |
Definition at line 43 of file TRobustEstimator.h.
|
protected |
Definition at line 38 of file TRobustEstimator.h.
|
protected |
Definition at line 36 of file TRobustEstimator.h.
|
protected |
Definition at line 29 of file TRobustEstimator.h.
|
protected |
Definition at line 27 of file TRobustEstimator.h.
|
protected |
Definition at line 42 of file TRobustEstimator.h.
|
protected |
Definition at line 40 of file TRobustEstimator.h.
|
protected |
Definition at line 41 of file TRobustEstimator.h.
|
protected |
Definition at line 31 of file TRobustEstimator.h.
|
protected |
Definition at line 32 of file TRobustEstimator.h.