Minimum Covariance Determinant Estimator - a Fast Algorithm invented by Peter J.Rousseeuw and Katrien Van Dreissen "A Fast Algorithm for the Minimum covariance Determinant Estimator" Technometrics, August 1999, Vol.41, NO.3.

What are robust estimators? "An important property of an estimator is its robustness. An estimator is called robust if it is insensitive to measurements that deviate from the expected behaviour. There are 2 ways to treat such deviating measurements: one may either try to recognise them and then remove them from the data sample; or one may leave them in the sample, taking care that they do not influence the estimate unduly. In both cases robust estimators are needed...Robust procedures compensate for systematic errors as much as possible, and indicate any situation in which a danger of not being able to operate reliably is detected." R.Fruhwirth, M.Regler, R.K.Bock, H.Grote, D.Notz "Data Analysis Techniques for High-Energy Physics", 2nd edition

What does this algorithm do? It computes a highly robust estimator of multivariate location and scatter. Then, it takes those estimates to compute robust distances of all the data vectors. Those with large robust distances are considered outliers. Robust distances can then be plotted for better visualization of the data.

How does this algorithm do it? The MCD objective is to find h observations(out of n) whose classical covariance matrix has the lowest determinant. The MCD estimator of location is then the average of those h points and the MCD estimate of scatter is their covariance matrix. The minimum(and default) h = (n+nvariables+1)/2 so the algorithm is effective when less than (n+nvar+1)/2 variables are outliers. The algorithm also allows for exact fit situations - that is, when h or more observations lie on a hyperplane. Then the algorithm still yields the MCD location T and scatter matrix S, the latter being singular as it should be. From (T,S) the program then computes the equation of the hyperplane.

How can this algorithm be used? In any case, when contamination of data is suspected, that might influence the classical estimates. Also, robust estimation of location and scatter is a tool to robustify other multivariate techniques such as, for example, principal-component analysis and discriminant analysis.

Technical details of the algorithm:

The default h = (n+nvariables+1)/2, but the user may choose any integer h with (n+nvariables+1)/2<=h<=n. The program then reports the MCD's breakdown value (n-h+1)/n. If you are sure that the dataset contains less than 25% contamination which is usually the case, a good compromise between breakdown value and efficiency is obtained by putting h=[.75*n].
If h=n,the MCD location estimate is the average of the whole dataset, and the MCD scatter estimate is its covariance matrix. Report this and stop
If nvariables=1 (univariate data), compute the MCD estimate by the exact algorithm of Rousseeuw and Leroy (1987, pp.171-172) in O(nlogn)time and stop
From here on, h<n and nvariables>=2.
1. If n is small:
  - repeat (say) 500 times:
    - construct an initial h-subset, starting from a random (nvar+1)-subset
    - carry out 2 C-steps (described in the comments of CStep function)
  - for the 10 results with lowest det(S):
    - carry out C-steps until convergence
  - report the solution (T, S) with the lowest det(S)
2. If n is larger (say, n>600), then
  - construct up to 5 disjoint random subsets of size nsub (say, nsub=300)
  - inside each subset repeat 500/5 times:
    - construct an initial subset of size hsub=[nsub*h/n]
    - carry out 2 C-steps
    - keep the best 10 results (Tsub, Ssub)
  - pool the subsets, yielding the merged set (say, of size nmerged=1500)
  - in the merged set, repeat for each of the 50 solutions (Tsub, Ssub)
    - carry out 2 C-steps
    - keep the 10 best results
  - in the full dataset, repeat for those best results:
    - take several C-steps, using n and h
    - report the best final result (T, S)
To obtain consistency when the data comes from a multivariate normal distribution, covariance matrix is multiplied by a correction factor
Robust distances for all elements, using the final (T, S) are calculated Then the very final mean and covariance estimates are calculated only for values, whose robust distances are less than a cutoff value (0.975 quantile of chi2 distribution with nvariables degrees of freedom)

Definition at line 23 of file TRobustEstimator.h.

Public Member Functions
	TRobustEstimator ()
	this constructor should be used in a univariate case: first call this constructor, then - the EvaluateUni(..) function More...

	TRobustEstimator (Int_t nvectors, Int_t nvariables, Int_t hh=0)
	constructor More...

virtual	~TRobustEstimator ()

void	AddColumn (Double_t *col)
	adds a column to the data matrix it is assumed that the column has size fN variable fVarTemp keeps the number of columns l already added More...

void	AddRow (Double_t *row)
	adds a vector to the data matrix it is supposed that the vector is of size fNvar More...

void	Evaluate ()
	Finds the estimate of multivariate mean and variance. More...

void	EvaluateUni (Int_t nvectors, Double_t *data, Double_t &mean, Double_t &sigma, Int_t hh=0)
	for the univariate case estimates of location and scatter are returned in mean and sigma parameters the algorithm works on the same principle as in multivariate case - it finds a subset of size hh with smallest sigma, and then returns mean and sigma of this subset More...

Int_t	GetBDPoint ()
	returns the breakdown point of the algorithm More...

Double_t	GetChiQuant (Int_t i) const
	returns the chi2 quantiles More...

void	GetCorrelation (TMatrixDSym &matr)
	returns the correlation matrix More...

const TMatrixDSym *	GetCorrelation () const

void	GetCovariance (TMatrixDSym &matr)
	returns the covariance matrix More...

const TMatrixDSym *	GetCovariance () const

const TMatrixD &	GetData ()
	returns a reference to the data matrix More...

void	GetHyperplane (TVectorD &vec)
	if the points are on a hyperplane, returns this hyperplane More...

const TVectorD *	GetHyperplane () const
	if the points are on a hyperplane, returns this hyperplane More...

void	GetMean (TVectorD &means)
	return the estimate of the mean More...

const TVectorD *	GetMean () const

Int_t	GetNHyp ()

Int_t	GetNOut ()
	returns the number of outliers More...

Int_t	GetNumberObservations () const

Int_t	GetNvar () const

const TArrayI *	GetOuliers () const

void	GetRDistances (TVectorD &rdist)
	returns the robust distances (helps to find outliers) More...

const TVectorD *	GetRDistances () const

Protected Member Functions
void	AddToSscp (TMatrixD &sscp, TVectorD &vec)
	update the sscp matrix with vector vec More...

void	Classic ()
	called when h=n. More...

void	ClearSscp (TMatrixD &sscp)
	clear the sscp matrix, used for covariance and mean calculation More...

void	Correl ()
	transforms covariance matrix into correlation matrix More...

void	Covar (TMatrixD &sscp, TVectorD &m, TMatrixDSym &cov, TVectorD &sd, Int_t nvec)
	calculates mean and covariance More...

void	CreateOrtSubset (TMatrixD &dat, Int_t index, Int_t hmerged, Int_t nmerged, TMatrixD &sscp, Double_t ndist)
	creates a subset of hmerged vectors with smallest orthogonal distances to the hyperplane hyp[1](x1-mean[1])+...+hyp[nvar](xnvar-mean[nvar])=0 This function is called in case when less than fH samples lie on a hyperplane. More...

void	CreateSubset (Int_t ntotal, Int_t htotal, Int_t p, Int_t index, TMatrixD &data, TMatrixD &sscp, Double_t ndist)
	creates a subset of htotal elements from ntotal elements first, p+1 elements are drawn randomly(without repetitions) if their covariance matrix is singular, more elements are added one by one, until their covariance matrix becomes regular or it becomes clear that htotal observations lie on a hyperplane If covariance matrix determinant!=0, distances of all ntotal elements are calculated, using formula d_i=Sqrt((x_i-M)S_inv(x_i-M)), where M is mean and S_inv is the inverse of the covariance matrix htotal points with smallest distances are included in the returned subset. More...

Double_t	CStep (Int_t ntotal, Int_t htotal, Int_t index, TMatrixD &data, TMatrixD &sscp, Double_t ndist)
	from the input htotal-subset constructs another htotal subset with lower determinant More...

Int_t	Exact (Double_t *ndist)
	for the exact fit situations returns number of observations on the hyperplane More...

Int_t	Exact2 (TMatrixD &mstockbig, TMatrixD &cstockbig, TMatrixD &hyperplane, Double_t deti, Int_t nbest, Int_t kgroup, TMatrixD &sscp, Double_t ndist)
	This function is called if determinant of the covariance matrix of a subset=0. More...

Double_t	KOrdStat (Int_t ntotal, Double_t arr, Int_t k, Int_t work)
	because I need an Int_t work array More...

Int_t	Partition (Int_t nmini, Int_t *indsubdat)
	divides the elements into approximately equal subgroups number of elements in each subgroup is stored in indsubdat number of subgroups is returned More...

Int_t	RDist (TMatrixD &sscp)
	Calculates robust distances.Then the samples with robust distances greater than a cutoff value (0.975 quantile of chi2 distribution with fNvar degrees of freedom, multiplied by a correction factor), are given weiht=0, and new, reweighted estimates of location and scatter are calculated The function returns the number of outliers. More...

void	RDraw (Int_t subdat, Int_t ngroup, Int_t indsubdat)
	Draws ngroup nonoverlapping subdatasets out of a dataset of size n such that the selected case numbers are uniformly distributed from 1 to n. More...

Protected Attributes
TMatrixDSym	fCorrelation

TMatrixDSym	fCovariance

TMatrixD	fData

Int_t	fExact

Int_t	fH

TVectorD	fHyperplane

TMatrixDSym	fInvcovariance

TVectorD	fMean

Int_t	fN

Int_t	fNvar

TArrayI	fOut

TVectorD	fRd

TVectorD	fSd

Int_t	fVarTemp

Int_t	fVecTemp

#include <TRobustEstimator.h>

Inheritance diagram for TRobustEstimator:

[legend]

Constructor & Destructor Documentation

◆ TRobustEstimator() [1/2]

TRobustEstimator::TRobustEstimator ( )

this constructor should be used in a univariate case: first call this constructor, then - the EvaluateUni(..) function

Definition at line 124 of file TRobustEstimator.cxx.

◆ TRobustEstimator() [2/2]

TRobustEstimator::TRobustEstimator	(	Int_t	nvectors,
		Int_t	nvariables,
		Int_t	hh = `0`
	)

constructor

Definition at line 130 of file TRobustEstimator.cxx.

◆ ~TRobustEstimator()

virtual TRobustEstimator::~TRobustEstimator ( )

inlinevirtual

Definition at line 78 of file TRobustEstimator.h.

Member Function Documentation

◆ AddColumn()

void TRobustEstimator::AddColumn ( Double_t * col )

adds a column to the data matrix it is assumed that the column has size fN variable fVarTemp keeps the number of columns l already added

Definition at line 170 of file TRobustEstimator.cxx.

◆ AddRow()

void TRobustEstimator::AddRow ( Double_t * row )

adds a vector to the data matrix it is supposed that the vector is of size fNvar

Definition at line 191 of file TRobustEstimator.cxx.

◆ AddToSscp()

void TRobustEstimator::AddToSscp	(	TMatrixD &	sscp,
		TVectorD &	vec
	)

protected

update the sscp matrix with vector vec

Definition at line 778 of file TRobustEstimator.cxx.

◆ Classic()

void TRobustEstimator::Classic ( )

protected

called when h=n.

Returns classic covariance matrix and mean

Definition at line 808 of file TRobustEstimator.cxx.

◆ ClearSscp()

void TRobustEstimator::ClearSscp ( TMatrixD & sscp )

protected

clear the sscp matrix, used for covariance and mean calculation

Definition at line 795 of file TRobustEstimator.cxx.

◆ Correl()

void TRobustEstimator::Correl ( )

protected

transforms covariance matrix into correlation matrix

Definition at line 849 of file TRobustEstimator.cxx.

◆ Covar()

void TRobustEstimator::Covar	(	TMatrixD &	sscp,
		TVectorD &	m,
		TMatrixDSym &	cov,
		TVectorD &	sd,
		Int_t	nvec
	)

protected

calculates mean and covariance

Definition at line 826 of file TRobustEstimator.cxx.

◆ CreateOrtSubset()

void TRobustEstimator::CreateOrtSubset	(	TMatrixD &	dat,
		Int_t *	index,
		Int_t	hmerged,
		Int_t	nmerged,
		TMatrixD &	sscp,
		Double_t *	ndist
	)

protected

creates a subset of hmerged vectors with smallest orthogonal distances to the hyperplane hyp[1]*(x1-mean[1])+...+hyp[nvar]*(xnvar-mean[nvar])=0 This function is called in case when less than fH samples lie on a hyperplane.

Definition at line 967 of file TRobustEstimator.cxx.

◆ CreateSubset()

void TRobustEstimator::CreateSubset	(	Int_t	ntotal,
		Int_t	htotal,
		Int_t	p,
		Int_t *	index,
		TMatrixD &	data,
		TMatrixD &	sscp,
		Double_t *	ndist
	)

protected

creates a subset of htotal elements from ntotal elements first, p+1 elements are drawn randomly(without repetitions) if their covariance matrix is singular, more elements are added one by one, until their covariance matrix becomes regular or it becomes clear that htotal observations lie on a hyperplane If covariance matrix determinant!=0, distances of all ntotal elements are calculated, using formula d_i=Sqrt((x_i-M)*S_inv*(x_i-M)), where M is mean and S_inv is the inverse of the covariance matrix htotal points with smallest distances are included in the returned subset.

Definition at line 877 of file TRobustEstimator.cxx.

◆ CStep()

Double_t TRobustEstimator::CStep	(	Int_t	ntotal,
		Int_t	htotal,
		Int_t *	index,
		TMatrixD &	data,
		TMatrixD &	sscp,
		Double_t *	ndist
	)

protected

from the input htotal-subset constructs another htotal subset with lower determinant

As proven by Peter J.Rousseeuw and Katrien Van Driessen, if distances for all elements are calculated, using the formula:d_i=Sqrt((x_i-M)*S_inv*(x_i-M)), where M is the mean of the input htotal-subset, and S_inv - the inverse of its covariance matrix, then htotal elements with smallest distances will have covariance matrix with determinant less or equal to the determinant of the input subset covariance matrix.

determinant for this htotal-subset with smallest distances is returned

Definition at line 999 of file TRobustEstimator.cxx.

◆ Evaluate()

void TRobustEstimator::Evaluate ( )

Finds the estimate of multivariate mean and variance.

Definition at line 208 of file TRobustEstimator.cxx.

◆ EvaluateUni()

void TRobustEstimator::EvaluateUni	(	Int_t	nvectors,
		Double_t *	data,
		Double_t &	mean,
		Double_t &	sigma,
		Int_t	hh = `0`
	)

for the univariate case estimates of location and scatter are returned in mean and sigma parameters the algorithm works on the same principle as in multivariate case - it finds a subset of size hh with smallest sigma, and then returns mean and sigma of this subset

Definition at line 608 of file TRobustEstimator.cxx.

◆ Exact()

Int_t TRobustEstimator::Exact ( Double_t * ndist )

protected

for the exact fit situations returns number of observations on the hyperplane

Definition at line 1036 of file TRobustEstimator.cxx.

◆ Exact2()

Int_t TRobustEstimator::Exact2	(	TMatrixD &	mstockbig,
		TMatrixD &	cstockbig,
		TMatrixD &	hyperplane,
		Double_t *	deti,
		Int_t	nbest,
		Int_t	kgroup,
		TMatrixD &	sscp,
		Double_t *	ndist
	)

protected

This function is called if determinant of the covariance matrix of a subset=0.

If there are more then fH vectors on a hyperplane, returns this hyperplane and stops else stores the hyperplane coordinates in hyperplane matrix

Definition at line 1071 of file TRobustEstimator.cxx.

◆ GetBDPoint()

Int_t TRobustEstimator::GetBDPoint ( )

returns the breakdown point of the algorithm

Definition at line 674 of file TRobustEstimator.cxx.

◆ GetChiQuant()

Double_t TRobustEstimator::GetChiQuant ( Int_t i ) const

returns the chi2 quantiles

Definition at line 684 of file TRobustEstimator.cxx.

◆ GetCorrelation() [1/2]

void TRobustEstimator::GetCorrelation ( TMatrixDSym & matr )

returns the correlation matrix

Definition at line 705 of file TRobustEstimator.cxx.

◆ GetCorrelation() [2/2]

const TMatrixDSym* TRobustEstimator::GetCorrelation ( ) const

inline

Definition at line 94 of file TRobustEstimator.h.

◆ GetCovariance() [1/2]

void TRobustEstimator::GetCovariance ( TMatrixDSym & matr )

returns the covariance matrix

Definition at line 693 of file TRobustEstimator.cxx.

◆ GetCovariance() [2/2]

const TMatrixDSym* TRobustEstimator::GetCovariance ( ) const

inline

Definition at line 92 of file TRobustEstimator.h.

◆ GetData()

const TMatrixD& TRobustEstimator::GetData ( )

inline

returns a reference to the data matrix

Definition at line 89 of file TRobustEstimator.h.

◆ GetHyperplane() [1/2]

void TRobustEstimator::GetHyperplane ( TVectorD & vec )

if the points are on a hyperplane, returns this hyperplane

Definition at line 730 of file TRobustEstimator.cxx.

◆ GetHyperplane() [2/2]

const TVectorD * TRobustEstimator::GetHyperplane ( ) const

if the points are on a hyperplane, returns this hyperplane

Definition at line 717 of file TRobustEstimator.cxx.

◆ GetMean() [1/2]

void TRobustEstimator::GetMean ( TVectorD & means )

return the estimate of the mean

Definition at line 746 of file TRobustEstimator.cxx.

◆ GetMean() [2/2]

const TVectorD* TRobustEstimator::GetMean ( ) const

inline

Definition at line 99 of file TRobustEstimator.h.

◆ GetNHyp()

Int_t TRobustEstimator::GetNHyp ( )

inline

Definition at line 97 of file TRobustEstimator.h.

◆ GetNOut()

Int_t TRobustEstimator::GetNOut ( )

returns the number of outliers

Definition at line 770 of file TRobustEstimator.cxx.

◆ GetNumberObservations()

Int_t TRobustEstimator::GetNumberObservations ( ) const

inline

Definition at line 102 of file TRobustEstimator.h.

◆ GetNvar()

Int_t TRobustEstimator::GetNvar ( ) const

inline

Definition at line 103 of file TRobustEstimator.h.

◆ GetOuliers()

const TArrayI* TRobustEstimator::GetOuliers ( ) const

inline

Definition at line 104 of file TRobustEstimator.h.

◆ GetRDistances() [1/2]

void TRobustEstimator::GetRDistances ( TVectorD & rdist )

returns the robust distances (helps to find outliers)

Definition at line 758 of file TRobustEstimator.cxx.

◆ GetRDistances() [2/2]

const TVectorD* TRobustEstimator::GetRDistances ( ) const

inline

Definition at line 101 of file TRobustEstimator.h.

◆ KOrdStat()

Double_t TRobustEstimator::KOrdStat	(	Int_t	ntotal,
		Double_t *	arr,
		Int_t	k,
		Int_t *	work
	)

protected

because I need an Int_t work array

Definition at line 1267 of file TRobustEstimator.cxx.

◆ Partition()

Int_t TRobustEstimator::Partition	(	Int_t	nmini,
		Int_t *	indsubdat
	)

protected

divides the elements into approximately equal subgroups number of elements in each subgroup is stored in indsubdat number of subgroups is returned

Definition at line 1118 of file TRobustEstimator.cxx.

◆ RDist()

Int_t TRobustEstimator::RDist ( TMatrixD & sscp )

protected

Calculates robust distances.Then the samples with robust distances greater than a cutoff value (0.975 quantile of chi2 distribution with fNvar degrees of freedom, multiplied by a correction factor), are given weiht=0, and new, reweighted estimates of location and scatter are calculated The function returns the number of outliers.

Definition at line 1172 of file TRobustEstimator.cxx.

◆ RDraw()

void TRobustEstimator::RDraw	(	Int_t *	subdat,
		Int_t	ngroup,
		Int_t *	indsubdat
	)

protected

Draws ngroup nonoverlapping subdatasets out of a dataset of size n such that the selected case numbers are uniformly distributed from 1 to n.

Definition at line 1235 of file TRobustEstimator.cxx.

Member Data Documentation

◆ fCorrelation

TMatrixDSym TRobustEstimator::fCorrelation

protected

Definition at line 39 of file TRobustEstimator.h.

◆ fCovariance

TMatrixDSym TRobustEstimator::fCovariance

protected

Definition at line 37 of file TRobustEstimator.h.

◆ fData

TMatrixD TRobustEstimator::fData

protected

Definition at line 46 of file TRobustEstimator.h.

◆ fExact

Int_t TRobustEstimator::fExact

protected

Definition at line 34 of file TRobustEstimator.h.

◆ fH

Int_t TRobustEstimator::fH

protected

Definition at line 28 of file TRobustEstimator.h.

◆ fHyperplane

TVectorD TRobustEstimator::fHyperplane

protected

Definition at line 43 of file TRobustEstimator.h.

◆ fInvcovariance

TMatrixDSym TRobustEstimator::fInvcovariance

protected

Definition at line 38 of file TRobustEstimator.h.

◆ fMean

TVectorD TRobustEstimator::fMean

protected

Definition at line 36 of file TRobustEstimator.h.

◆ fN

Int_t TRobustEstimator::fN

protected

Definition at line 29 of file TRobustEstimator.h.

◆ fNvar

Int_t TRobustEstimator::fNvar

protected

Definition at line 27 of file TRobustEstimator.h.

◆ fOut

TArrayI TRobustEstimator::fOut

protected

Definition at line 42 of file TRobustEstimator.h.

◆ fRd

TVectorD TRobustEstimator::fRd

protected

Definition at line 40 of file TRobustEstimator.h.

◆ fSd

TVectorD TRobustEstimator::fSd

protected

Definition at line 41 of file TRobustEstimator.h.

◆ fVarTemp

Int_t TRobustEstimator::fVarTemp

protected

Definition at line 31 of file TRobustEstimator.h.

◆ fVecTemp

Int_t TRobustEstimator::fVecTemp

protected

Definition at line 32 of file TRobustEstimator.h.

Libraries for TRobustEstimator:

[legend]

The documentation for this class was generated from the following files:

math/physics/inc/TRobustEstimator.h
math/physics/src/TRobustEstimator.cxx

Public Member Functions

Protected Member Functions

Protected Attributes

Constructor & Destructor Documentation

◆ TRobustEstimator() [1/2]

◆ TRobustEstimator() [2/2]

◆ ~TRobustEstimator()

Member Function Documentation

◆ AddColumn()

◆ AddRow()

◆ AddToSscp()

◆ Classic()

◆ ClearSscp()

◆ Correl()

◆ Covar()

◆ CreateOrtSubset()

◆ CreateSubset()

◆ CStep()

◆ Evaluate()

◆ EvaluateUni()

◆ Exact()

◆ Exact2()

◆ GetBDPoint()

◆ GetChiQuant()

◆ GetCorrelation() [1/2]

◆ GetCorrelation() [2/2]

◆ GetCovariance() [1/2]

◆ GetCovariance() [2/2]

◆ GetData()

◆ GetHyperplane() [1/2]

◆ GetHyperplane() [2/2]

◆ GetMean() [1/2]

◆ GetMean() [2/2]

◆ GetNHyp()

◆ GetNOut()

◆ GetNumberObservations()

◆ GetNvar()

◆ GetOuliers()

◆ GetRDistances() [1/2]

◆ GetRDistances() [2/2]

◆ KOrdStat()

◆ Partition()

◆ RDist()

◆ RDraw()

Member Data Documentation

◆ fCorrelation

◆ fCovariance

◆ fData

◆ fExact

◆ fH

◆ fHyperplane

◆ fInvcovariance

◆ fMean

◆ fN

◆ fNvar

◆ fOut

◆ fRd

◆ fSd

◆ fVarTemp

◆ fVecTemp