Logo ROOT   6.13/01
Reference Guide
List of all members | Public Member Functions | Protected Member Functions | Protected Attributes | List of all members
TRobustEstimator Class Reference

Minimum Covariance Determinant Estimator - a Fast Algorithm invented by Peter J.Rousseeuw and Katrien Van Dreissen "A Fast Algorithm for the Minimum covariance Determinant Estimator" Technometrics, August 1999, Vol.41, NO.3.

What are robust estimators? "An important property of an estimator is its robustness. An estimator is called robust if it is insensitive to measurements that deviate from the expected behaviour. There are 2 ways to treat such deviating measurements: one may either try to recognise them and then remove them from the data sample; or one may leave them in the sample, taking care that they do not influence the estimate unduly. In both cases robust estimators are needed...Robust procedures compensate for systematic errors as much as possible, and indicate any situation in which a danger of not being able to operate reliably is detected." R.Fruhwirth, M.Regler, R.K.Bock, H.Grote, D.Notz "Data Analysis Techniques for High-Energy Physics", 2nd edition

What does this algorithm do? It computes a highly robust estimator of multivariate location and scatter. Then, it takes those estimates to compute robust distances of all the data vectors. Those with large robust distances are considered outliers. Robust distances can then be plotted for better visualization of the data.

How does this algorithm do it? The MCD objective is to find h observations(out of n) whose classical covariance matrix has the lowest determinant. The MCD estimator of location is then the average of those h points and the MCD estimate of scatter is their covariance matrix. The minimum(and default) h = (n+nvariables+1)/2 so the algorithm is effective when less than (n+nvar+1)/2 variables are outliers. The algorithm also allows for exact fit situations - that is, when h or more observations lie on a hyperplane. Then the algorithm still yields the MCD location T and scatter matrix S, the latter being singular as it should be. From (T,S) the program then computes the equation of the hyperplane.

How can this algorithm be used? In any case, when contamination of data is suspected, that might influence the classical estimates. Also, robust estimation of location and scatter is a tool to robustify other multivariate techniques such as, for example, principal-component analysis and discriminant analysis.

Technical details of the algorithm:

  1. The default h = (n+nvariables+1)/2, but the user may choose any integer h with (n+nvariables+1)/2<=h<=n. The program then reports the MCD's breakdown value (n-h+1)/n. If you are sure that the dataset contains less than 25% contamination which is usually the case, a good compromise between breakdown value and efficiency is obtained by putting h=[.75*n].
  2. If h=n,the MCD location estimate is the average of the whole dataset, and the MCD scatter estimate is its covariance matrix. Report this and stop
  3. If nvariables=1 (univariate data), compute the MCD estimate by the exact algorithm of Rousseeuw and Leroy (1987, pp.171-172) in O(nlogn)time and stop
  4. From here on, h<n and nvariables>=2.
    1. If n is small:
      • repeat (say) 500 times:
        • construct an initial h-subset, starting from a random (nvar+1)-subset
        • carry out 2 C-steps (described in the comments of CStep function)
      • for the 10 results with lowest det(S):
        • carry out C-steps until convergence
      • report the solution (T, S) with the lowest det(S)
    2. If n is larger (say, n>600), then
      • construct up to 5 disjoint random subsets of size nsub (say, nsub=300)
      • inside each subset repeat 500/5 times:
        • construct an initial subset of size hsub=[nsub*h/n]
        • carry out 2 C-steps
        • keep the best 10 results (Tsub, Ssub)
      • pool the subsets, yielding the merged set (say, of size nmerged=1500)
      • in the merged set, repeat for each of the 50 solutions (Tsub, Ssub)
        • carry out 2 C-steps
        • keep the 10 best results
      • in the full dataset, repeat for those best results:
        • take several C-steps, using n and h
        • report the best final result (T, S)
  5. To obtain consistency when the data comes from a multivariate normal distribution, covariance matrix is multiplied by a correction factor
  6. Robust distances for all elements, using the final (T, S) are calculated Then the very final mean and covariance estimates are calculated only for values, whose robust distances are less than a cutoff value (0.975 quantile of chi2 distribution with nvariables degrees of freedom)

Definition at line 23 of file TRobustEstimator.h.

Public Member Functions

 TRobustEstimator ()
 this constructor should be used in a univariate case: first call this constructor, then - the EvaluateUni(..) function More...
 
 TRobustEstimator (Int_t nvectors, Int_t nvariables, Int_t hh=0)
 constructor More...
 
virtual ~TRobustEstimator ()
 
void AddColumn (Double_t *col)
 adds a column to the data matrix it is assumed that the column has size fN variable fVarTemp keeps the number of columns l already added More...
 
void AddRow (Double_t *row)
 adds a vector to the data matrix it is supposed that the vector is of size fNvar More...
 
void Evaluate ()
 Finds the estimate of multivariate mean and variance. More...
 
void EvaluateUni (Int_t nvectors, Double_t *data, Double_t &mean, Double_t &sigma, Int_t hh=0)
 for the univariate case estimates of location and scatter are returned in mean and sigma parameters the algorithm works on the same principle as in multivariate case - it finds a subset of size hh with smallest sigma, and then returns mean and sigma of this subset More...
 
Int_t GetBDPoint ()
 returns the breakdown point of the algorithm More...
 
Double_t GetChiQuant (Int_t i) const
 returns the chi2 quantiles More...
 
void GetCorrelation (TMatrixDSym &matr)
 returns the correlation matrix More...
 
const TMatrixDSymGetCorrelation () const
 
void GetCovariance (TMatrixDSym &matr)
 returns the covariance matrix More...
 
const TMatrixDSymGetCovariance () const
 
const TMatrixDGetData ()
 returns a reference to the data matrix More...
 
void GetHyperplane (TVectorD &vec)
 if the points are on a hyperplane, returns this hyperplane More...
 
const TVectorDGetHyperplane () const
 if the points are on a hyperplane, returns this hyperplane More...
 
void GetMean (TVectorD &means)
 return the estimate of the mean More...
 
const TVectorDGetMean () const
 
Int_t GetNHyp ()
 
Int_t GetNOut ()
 returns the number of outliers More...
 
Int_t GetNumberObservations () const
 
Int_t GetNvar () const
 
const TArrayI * GetOuliers () const
 
void GetRDistances (TVectorD &rdist)
 returns the robust distances (helps to find outliers) More...
 
const TVectorDGetRDistances () const
 

Protected Member Functions

void AddToSscp (TMatrixD &sscp, TVectorD &vec)
 update the sscp matrix with vector vec More...
 
void Classic ()
 called when h=n. More...
 
void ClearSscp (TMatrixD &sscp)
 clear the sscp matrix, used for covariance and mean calculation More...
 
void Correl ()
 transforms covariance matrix into correlation matrix More...
 
void Covar (TMatrixD &sscp, TVectorD &m, TMatrixDSym &cov, TVectorD &sd, Int_t nvec)
 calculates mean and covariance More...
 
void CreateOrtSubset (TMatrixD &dat, Int_t *index, Int_t hmerged, Int_t nmerged, TMatrixD &sscp, Double_t *ndist)
 creates a subset of hmerged vectors with smallest orthogonal distances to the hyperplane hyp[1]*(x1-mean[1])+...+hyp[nvar]*(xnvar-mean[nvar])=0 This function is called in case when less than fH samples lie on a hyperplane. More...
 
void CreateSubset (Int_t ntotal, Int_t htotal, Int_t p, Int_t *index, TMatrixD &data, TMatrixD &sscp, Double_t *ndist)
 creates a subset of htotal elements from ntotal elements first, p+1 elements are drawn randomly(without repetitions) if their covariance matrix is singular, more elements are added one by one, until their covariance matrix becomes regular or it becomes clear that htotal observations lie on a hyperplane If covariance matrix determinant!=0, distances of all ntotal elements are calculated, using formula d_i=Sqrt((x_i-M)*S_inv*(x_i-M)), where M is mean and S_inv is the inverse of the covariance matrix htotal points with smallest distances are included in the returned subset. More...
 
Double_t CStep (Int_t ntotal, Int_t htotal, Int_t *index, TMatrixD &data, TMatrixD &sscp, Double_t *ndist)
 from the input htotal-subset constructs another htotal subset with lower determinant More...
 
Int_t Exact (Double_t *ndist)
 for the exact fit situations returns number of observations on the hyperplane More...
 
Int_t Exact2 (TMatrixD &mstockbig, TMatrixD &cstockbig, TMatrixD &hyperplane, Double_t *deti, Int_t nbest, Int_t kgroup, TMatrixD &sscp, Double_t *ndist)
 This function is called if determinant of the covariance matrix of a subset=0. More...
 
Double_t KOrdStat (Int_t ntotal, Double_t *arr, Int_t k, Int_t *work)
 because I need an Int_t work array More...
 
Int_t Partition (Int_t nmini, Int_t *indsubdat)
 divides the elements into approximately equal subgroups number of elements in each subgroup is stored in indsubdat number of subgroups is returned More...
 
Int_t RDist (TMatrixD &sscp)
 Calculates robust distances.Then the samples with robust distances greater than a cutoff value (0.975 quantile of chi2 distribution with fNvar degrees of freedom, multiplied by a correction factor), are given weiht=0, and new, reweighted estimates of location and scatter are calculated The function returns the number of outliers. More...
 
void RDraw (Int_t *subdat, Int_t ngroup, Int_t *indsubdat)
 Draws ngroup nonoverlapping subdatasets out of a dataset of size n such that the selected case numbers are uniformly distributed from 1 to n. More...
 

Protected Attributes

TMatrixDSym fCorrelation
 
TMatrixDSym fCovariance
 
TMatrixD fData
 
Int_t fExact
 
Int_t fH
 
TVectorD fHyperplane
 
TMatrixDSym fInvcovariance
 
TVectorD fMean
 
Int_t fN
 
Int_t fNvar
 
TArrayI fOut
 
TVectorD fRd
 
TVectorD fSd
 
Int_t fVarTemp
 
Int_t fVecTemp
 

#include <TRobustEstimator.h>

Inheritance diagram for TRobustEstimator:
[legend]

Constructor & Destructor Documentation

◆ TRobustEstimator() [1/2]

TRobustEstimator::TRobustEstimator ( )

this constructor should be used in a univariate case: first call this constructor, then - the EvaluateUni(..) function

Definition at line 124 of file TRobustEstimator.cxx.

◆ TRobustEstimator() [2/2]

TRobustEstimator::TRobustEstimator ( Int_t  nvectors,
Int_t  nvariables,
Int_t  hh = 0 
)

constructor

Definition at line 130 of file TRobustEstimator.cxx.

◆ ~TRobustEstimator()

virtual TRobustEstimator::~TRobustEstimator ( )
inlinevirtual

Definition at line 78 of file TRobustEstimator.h.

Member Function Documentation

◆ AddColumn()

void TRobustEstimator::AddColumn ( Double_t *  col)

adds a column to the data matrix it is assumed that the column has size fN variable fVarTemp keeps the number of columns l already added

Definition at line 170 of file TRobustEstimator.cxx.

◆ AddRow()

void TRobustEstimator::AddRow ( Double_t *  row)

adds a vector to the data matrix it is supposed that the vector is of size fNvar

Definition at line 191 of file TRobustEstimator.cxx.

◆ AddToSscp()

void TRobustEstimator::AddToSscp ( TMatrixD sscp,
TVectorD vec 
)
protected

update the sscp matrix with vector vec

Definition at line 778 of file TRobustEstimator.cxx.

◆ Classic()

void TRobustEstimator::Classic ( )
protected

called when h=n.

Returns classic covariance matrix and mean

Definition at line 808 of file TRobustEstimator.cxx.

◆ ClearSscp()

void TRobustEstimator::ClearSscp ( TMatrixD sscp)
protected

clear the sscp matrix, used for covariance and mean calculation

Definition at line 795 of file TRobustEstimator.cxx.

◆ Correl()

void TRobustEstimator::Correl ( )
protected

transforms covariance matrix into correlation matrix

Definition at line 849 of file TRobustEstimator.cxx.

◆ Covar()

void TRobustEstimator::Covar ( TMatrixD sscp,
TVectorD m,
TMatrixDSym cov,
TVectorD sd,
Int_t  nvec 
)
protected

calculates mean and covariance

Definition at line 826 of file TRobustEstimator.cxx.

◆ CreateOrtSubset()

void TRobustEstimator::CreateOrtSubset ( TMatrixD dat,
Int_t *  index,
Int_t  hmerged,
Int_t  nmerged,
TMatrixD sscp,
Double_t *  ndist 
)
protected

creates a subset of hmerged vectors with smallest orthogonal distances to the hyperplane hyp[1]*(x1-mean[1])+...+hyp[nvar]*(xnvar-mean[nvar])=0 This function is called in case when less than fH samples lie on a hyperplane.

Definition at line 967 of file TRobustEstimator.cxx.

◆ CreateSubset()

void TRobustEstimator::CreateSubset ( Int_t  ntotal,
Int_t  htotal,
Int_t  p,
Int_t *  index,
TMatrixD data,
TMatrixD sscp,
Double_t *  ndist 
)
protected

creates a subset of htotal elements from ntotal elements first, p+1 elements are drawn randomly(without repetitions) if their covariance matrix is singular, more elements are added one by one, until their covariance matrix becomes regular or it becomes clear that htotal observations lie on a hyperplane If covariance matrix determinant!=0, distances of all ntotal elements are calculated, using formula d_i=Sqrt((x_i-M)*S_inv*(x_i-M)), where M is mean and S_inv is the inverse of the covariance matrix htotal points with smallest distances are included in the returned subset.

Definition at line 877 of file TRobustEstimator.cxx.

◆ CStep()

Double_t TRobustEstimator::CStep ( Int_t  ntotal,
Int_t  htotal,
Int_t *  index,
TMatrixD data,
TMatrixD sscp,
Double_t *  ndist 
)
protected

from the input htotal-subset constructs another htotal subset with lower determinant

As proven by Peter J.Rousseeuw and Katrien Van Driessen, if distances for all elements are calculated, using the formula:d_i=Sqrt((x_i-M)*S_inv*(x_i-M)), where M is the mean of the input htotal-subset, and S_inv - the inverse of its covariance matrix, then htotal elements with smallest distances will have covariance matrix with determinant less or equal to the determinant of the input subset covariance matrix.

determinant for this htotal-subset with smallest distances is returned

Definition at line 999 of file TRobustEstimator.cxx.

◆ Evaluate()

void TRobustEstimator::Evaluate ( )

Finds the estimate of multivariate mean and variance.

Definition at line 208 of file TRobustEstimator.cxx.

◆ EvaluateUni()

void TRobustEstimator::EvaluateUni ( Int_t  nvectors,
Double_t *  data,
Double_t &  mean,
Double_t &  sigma,
Int_t  hh = 0 
)

for the univariate case estimates of location and scatter are returned in mean and sigma parameters the algorithm works on the same principle as in multivariate case - it finds a subset of size hh with smallest sigma, and then returns mean and sigma of this subset

Definition at line 608 of file TRobustEstimator.cxx.

◆ Exact()

Int_t TRobustEstimator::Exact ( Double_t *  ndist)
protected

for the exact fit situations returns number of observations on the hyperplane

Definition at line 1036 of file TRobustEstimator.cxx.

◆ Exact2()

Int_t TRobustEstimator::Exact2 ( TMatrixD mstockbig,
TMatrixD cstockbig,
TMatrixD hyperplane,
Double_t *  deti,
Int_t  nbest,
Int_t  kgroup,
TMatrixD sscp,
Double_t *  ndist 
)
protected

This function is called if determinant of the covariance matrix of a subset=0.

If there are more then fH vectors on a hyperplane, returns this hyperplane and stops else stores the hyperplane coordinates in hyperplane matrix

Definition at line 1071 of file TRobustEstimator.cxx.

◆ GetBDPoint()

Int_t TRobustEstimator::GetBDPoint ( )

returns the breakdown point of the algorithm

Definition at line 674 of file TRobustEstimator.cxx.

◆ GetChiQuant()

Double_t TRobustEstimator::GetChiQuant ( Int_t  i) const

returns the chi2 quantiles

Definition at line 684 of file TRobustEstimator.cxx.

◆ GetCorrelation() [1/2]

void TRobustEstimator::GetCorrelation ( TMatrixDSym matr)

returns the correlation matrix

Definition at line 705 of file TRobustEstimator.cxx.

◆ GetCorrelation() [2/2]

const TMatrixDSym* TRobustEstimator::GetCorrelation ( ) const
inline

Definition at line 94 of file TRobustEstimator.h.

◆ GetCovariance() [1/2]

void TRobustEstimator::GetCovariance ( TMatrixDSym matr)

returns the covariance matrix

Definition at line 693 of file TRobustEstimator.cxx.

◆ GetCovariance() [2/2]

const TMatrixDSym* TRobustEstimator::GetCovariance ( ) const
inline

Definition at line 92 of file TRobustEstimator.h.

◆ GetData()

const TMatrixD& TRobustEstimator::GetData ( )
inline

returns a reference to the data matrix

Definition at line 89 of file TRobustEstimator.h.

◆ GetHyperplane() [1/2]

void TRobustEstimator::GetHyperplane ( TVectorD vec)

if the points are on a hyperplane, returns this hyperplane

Definition at line 730 of file TRobustEstimator.cxx.

◆ GetHyperplane() [2/2]

const TVectorD * TRobustEstimator::GetHyperplane ( ) const

if the points are on a hyperplane, returns this hyperplane

Definition at line 717 of file TRobustEstimator.cxx.

◆ GetMean() [1/2]

void TRobustEstimator::GetMean ( TVectorD means)

return the estimate of the mean

Definition at line 746 of file TRobustEstimator.cxx.

◆ GetMean() [2/2]

const TVectorD* TRobustEstimator::GetMean ( ) const
inline

Definition at line 99 of file TRobustEstimator.h.

◆ GetNHyp()

Int_t TRobustEstimator::GetNHyp ( )
inline

Definition at line 97 of file TRobustEstimator.h.

◆ GetNOut()

Int_t TRobustEstimator::GetNOut ( )

returns the number of outliers

Definition at line 770 of file TRobustEstimator.cxx.

◆ GetNumberObservations()

Int_t TRobustEstimator::GetNumberObservations ( ) const
inline

Definition at line 102 of file TRobustEstimator.h.

◆ GetNvar()

Int_t TRobustEstimator::GetNvar ( ) const
inline

Definition at line 103 of file TRobustEstimator.h.

◆ GetOuliers()

const TArrayI* TRobustEstimator::GetOuliers ( ) const
inline

Definition at line 104 of file TRobustEstimator.h.

◆ GetRDistances() [1/2]

void TRobustEstimator::GetRDistances ( TVectorD rdist)

returns the robust distances (helps to find outliers)

Definition at line 758 of file TRobustEstimator.cxx.

◆ GetRDistances() [2/2]

const TVectorD* TRobustEstimator::GetRDistances ( ) const
inline

Definition at line 101 of file TRobustEstimator.h.

◆ KOrdStat()

Double_t TRobustEstimator::KOrdStat ( Int_t  ntotal,
Double_t *  arr,
Int_t  k,
Int_t *  work 
)
protected

because I need an Int_t work array

Definition at line 1267 of file TRobustEstimator.cxx.

◆ Partition()

Int_t TRobustEstimator::Partition ( Int_t  nmini,
Int_t *  indsubdat 
)
protected

divides the elements into approximately equal subgroups number of elements in each subgroup is stored in indsubdat number of subgroups is returned

Definition at line 1118 of file TRobustEstimator.cxx.

◆ RDist()

Int_t TRobustEstimator::RDist ( TMatrixD sscp)
protected

Calculates robust distances.Then the samples with robust distances greater than a cutoff value (0.975 quantile of chi2 distribution with fNvar degrees of freedom, multiplied by a correction factor), are given weiht=0, and new, reweighted estimates of location and scatter are calculated The function returns the number of outliers.

Definition at line 1172 of file TRobustEstimator.cxx.

◆ RDraw()

void TRobustEstimator::RDraw ( Int_t *  subdat,
Int_t  ngroup,
Int_t *  indsubdat 
)
protected

Draws ngroup nonoverlapping subdatasets out of a dataset of size n such that the selected case numbers are uniformly distributed from 1 to n.

Definition at line 1235 of file TRobustEstimator.cxx.

Member Data Documentation

◆ fCorrelation

TMatrixDSym TRobustEstimator::fCorrelation
protected

Definition at line 39 of file TRobustEstimator.h.

◆ fCovariance

TMatrixDSym TRobustEstimator::fCovariance
protected

Definition at line 37 of file TRobustEstimator.h.

◆ fData

TMatrixD TRobustEstimator::fData
protected

Definition at line 46 of file TRobustEstimator.h.

◆ fExact

Int_t TRobustEstimator::fExact
protected

Definition at line 34 of file TRobustEstimator.h.

◆ fH

Int_t TRobustEstimator::fH
protected

Definition at line 28 of file TRobustEstimator.h.

◆ fHyperplane

TVectorD TRobustEstimator::fHyperplane
protected

Definition at line 43 of file TRobustEstimator.h.

◆ fInvcovariance

TMatrixDSym TRobustEstimator::fInvcovariance
protected

Definition at line 38 of file TRobustEstimator.h.

◆ fMean

TVectorD TRobustEstimator::fMean
protected

Definition at line 36 of file TRobustEstimator.h.

◆ fN

Int_t TRobustEstimator::fN
protected

Definition at line 29 of file TRobustEstimator.h.

◆ fNvar

Int_t TRobustEstimator::fNvar
protected

Definition at line 27 of file TRobustEstimator.h.

◆ fOut

TArrayI TRobustEstimator::fOut
protected

Definition at line 42 of file TRobustEstimator.h.

◆ fRd

TVectorD TRobustEstimator::fRd
protected

Definition at line 40 of file TRobustEstimator.h.

◆ fSd

TVectorD TRobustEstimator::fSd
protected

Definition at line 41 of file TRobustEstimator.h.

◆ fVarTemp

Int_t TRobustEstimator::fVarTemp
protected

Definition at line 31 of file TRobustEstimator.h.

◆ fVecTemp

Int_t TRobustEstimator::fVecTemp
protected

Definition at line 32 of file TRobustEstimator.h.

Libraries for TRobustEstimator:
[legend]

The documentation for this class was generated from the following files: