Illustration of Spurious Signal

Prof. Alex Read, University of Oslo, Department of Physics, December, 2020

See https://www.mn.uio.no/fysikk/personer/vit/read/statistics/spurious-signal/spurioussignaldemo.mlx for the MATLAB live-script used to make this page.

In the context of a search for new physics in high energy physics, a "spurious signal" is a non-zero signal amplitude obtained when fitting the sum of a background model plus a signal model to data that is known to be signal-free, i.e. a pure background distribution. Although not appearing by name in publications, this concept was developed in the

working group of ATLAS in 2011, when the team was worrying about systematic uncertainties in background modeling that could lead to a false claim of observing the Higgs boson. The observation of the Higgs boson was established at the famous seminar at CERN on July 4, 2012, where the ATLAS and CMS collaborations each claimed a 5 standard deviations significant signal in their data consistent with the predicted properties of the Standard Model Higgs boson.

wiktionary says that spurious can mean "False, not authentic, not genuine." or "Extraneous; stray; not relevant or wanted." The combination of all these adjectives is a perfect description of the spurious signal!

Toys

By using Monte Carlo experiments ("toys"), examples of when spurious signals are or aren't expected to appear from fits to a background distribution will be shown. In general, if the background model used to fit the data is not able to describe the background distribution without bias, a spurious signal will result. The size of the spurious signal will tend to be a function of the signal mass hypothesis, with one or more zero-signal crossings and adjacent regions of false excess or false deficit. The amplitude of the spurious signal will also depend on details of the signal model such as the width (σ) of a gaussian signal.

Selected background models

For this study toys will be generated from an exponential distribution and it will be shown that fitting such data with an exponential distribution leads to zero bias and a spurious signal amplitude consistent with zero. Further, the same toys will be fitted with a low-order Bernstein polynomial, which is known to be a poor approximation of an exponential distribution, and regions of positive and negative bias, which correpond to regions of positive and negative spurious signal amplitude, will be seen.

Signal model and binned fits

For the signal model a Gaussian pdf will be used. Fits to binned data will be considered, i.e.

is fitted to a histogram of data, where s is the signal amplitude,

describes the distribution of the signal model, and

describes the background shape and normalization, and where

are the nuisance parameters of the fitted background model ("nuisance" parameters, because we really want to infer the presence of a signal and yet we have to estimate these background parameters as well, which can can be considered a nuisance). When fitting

to a distribution that is known to be signal-free, the signal amplitude (the size of s) is the spurious signal. When simply comparing a fitted background to an underlying background model, e.g. to estimate bias,

is fitted without the signal term.

Study a distribution with the model of that distribution

Set up an exponential background model (and various constants)

% The axis and signal model
alpha = 0.5; mmin = 100; mmax = 160; Nbins = 61; lumifac = 3000;
mH = 125; sigmaGeV = 1.5; sigmax = sigmaGeV/(mmax-mmin); mHx = (mH-mmin)/(mmax-mmin);
m145x = (145-mmin)/(mmax-mmin);
 
m = linspace(mmin,mmax,Nbins);
 

The background model: For

the background distribution

approximately reproduces the shape of the inclusive, unweighted background distribution of the ATLAS

analysis of Run 1 of the LHC (see Figure 8 in the published article and the comparison below).

% The background model: An exponential (with somewhat arbitrary

% normalization).

backModel = lumifac*1000*exppdf((m-mmin)/(mmax-mmin),alpha)/(mmax-mmin);

figure

plot(m,backModel,'g-','LineWidth',3)

xlabel('Mass (GeV)')

ylabel('Events/GeV')

title('Background model (exponential pdf)')

Show that the exponential distribution we study represents a plausible background distribution

The next figure shows this exponential distribution (the green curve) compared to the published distribution referred to above (the dashed blue curve, which is mostly hidden by the solid red curve, which includes the signal model in the fit).

im = imread('HyyBackgroundModel.001.png');

imshow(im)

Create a set of exponential background toys in memory

A "toy" is a single Monte Carlo simulation of an experiment. This is basically a differential counting experiment, i.e. the distribution of counts (over many experiments) we expect in each individual bin

is given by the Poisson distribution for the specified rate for that bin, which in our case is

Ntoys = 2000;
toys = zeros(Ntoys,Nbins);
for i=1:Nbins
toys(1:Ntoys,i) = poissrnd(backModel(i),Ntoys,1);
end

Fit one of the exponential toys to an exponential model and show the results.

guess = [1E4,-0.0002]; % Good starting values for the fits helps speed convergence.

model = @(p,x) p(1)*exp(p(2)*x); % The exponential model we are going fit.

options = optimset('Display','off'); % Supress most output of minimization.

x = (m-mmin)/(mmax-mmin); % Shift and re-scale the mass axis - helps the convergence of the fits.

% Do the fit and print/display some return values

y = toys(1,:); dy = sqrt(y); % Very simple estimate of the uncertainties

[params,perrors,chisqmin,exitflag,errormatrix ] = fitpchisq(x,y,model,guess,options);

figure

errorbar(m,y,dy,'Marker','.','MarkerSize',12,'Color','k')

yfit = model(params,x);

hold on

plot(m,yfit,'b-')

legend('Toy dataset','Exp. fit')

title('Fit to 1 toy')

xlabel('Mass (GeV')

ylabel('Events/GeV')

We might not be able to see small deviations between the fit and the data, so we look a plot of the differences or residuals.

figure

errorbar(m,y-yfit,dy,'Marker','.','MarkerSize',15,'Color','k','LineStyle','none')

title('Residuals for 1 toy')

yline(0,'b-');

xlabel('Mass (GeV)')

ylabel('Data-Fit (Events)')

legend('Residuals','Fit')

Fit all the toys and show the mean and RMS deviation from the model

Since we are fitting Monte Carlo (toy) data to a (background) model that was used to generate the toy data, we expect there to be no bias, i.e. the mean of many toys should have residuals of zero with respect to the generating model. There are, of course, fluctuations from toy to toy (MC experiment to MC experiment), but this results in a spread (root-mean-square or RMS) of the fits with respect to the model.

% Do the fit and print/display some return values

resid = zeros(length(toys),length(y));

options.MaxFunEvals = 400;

for i=1:length(toys)

y = toys(i,:);

[params,perrors,chisqmin,exitflag,errormatrix ] = fitpchisq(x,y,model,guess,options);

guess = params; % To speed up convergence

yfit = model(params,x);

resid(i,:) = yfit-backModel;

end

figure

errorbar(m,mean(resid,1),std(resid,1),'Marker','.','MarkerSize',15,'Color','k','LineStyle','none')

title('Fit residuals w.r.t. model (all toys)')

yline(0,'b-');

xlabel('Mass (GeV)')

ylabel('<Fit-Model> (Events)')

legend('Mean and RMS','Fits')

The trend of decreasing RMS with increasing mass is due to the falling trend of the exponential distribution and the nature of the statistical fluctuations per bin, which are given by the square root of the number of events per bin (for Pearson's

the square root of the number of fitted events). In other words, bins with more events have larger fluctuations. This influences the RMS of the fit as a function of mass correspondingly.

Show an example fit with and without a signal model (a Gaussian pdf).

Since there is no true signal in the toy data, the fitted signal amplitude is a "spurious" signal.

model = @(p,x) p(1)*exp(p(2)*x) + p(3)*normpdf(x,mHx,sigmax); % The S+b model.

Bmodel = @(p,x) p(1)*exp(p(2)*x) ; % The background component.

Smodel = @(p,x) p(1)*normpdf(x,mHx,sigmax);

guess = [1E4,-0.0002,0]; % Good starting values for the fits helps speed convergence.

% Do the fit and print/display some return values

y = toys(1,:); dy = sqrt(y); % Very simple estimate of the uncertainties

[params,perrors,chisqmin,exitflag,errormatrix ] = fitpchisq(x,y,model,guess,options);

figure

errorbar(m,y,dy,'Marker','.','MarkerSize',12,'Color','k')

yfit = model(params,x);

hold on

plot(m,yfit,'b-')

legend('Toy dataset','S+B fit')

title('Fit signal+background to 1 toy')

xlabel('Mass (GeV')

ylabel('Events/GeV')

In scenarios with large backgrounds (like the

background case studied here, and many others), but litte or no bias, the fitted amplitude for an individual experiment tends to be small with respect to its uncertainty. The fitted signal also tends to be hard to spot when looking at the raw distribution, so we had better look at the residuals in the next plot.

figure

errorbar(m,y-Bmodel(params(1:2),x),dy,'Marker','.','MarkerSize',15,'Color','k','LineStyle','none')

hold on

title('Residuals for 1 toy, B fit')

yline(0,'b-');

plot(m,Smodel(params(3),x),'r-')

xlabel('Mass (GeV)')

ylabel('Data-background (Events)')

legend('Data-background','Fitted background','Spurious signal')

There doesn't seem to be much difference between the background component of the fit in the plot above (the blue line) and the total fit result (which includes the red signal bump) with respect to the data points and their uncertainties. We can already suspect that there might not be a significant systematic bias in the signal amplitude when we study the mean signal ampitude over many toys below.

Fit all the toys and scan the entire mass spectrum.

We are interested in both the average spurious signal and its RMS. Since we are fitting all the toys while scanning over the signal mass hypothesis, this takes some time...

ss = zeros(length(toys),length(x)); % The spurious signal per toy per mass point
options.MaxFunEvals = 600;
for ix = 1:length(x)
model = @(p,x) p(1)*exp(p(2)*x) + p(3)*normpdf(x,x(ix),sigmax); 
for i=1:length(toys)
y = toys(i,:);
[params,perrors,chisqmin,exitflag,errormatrix ] = fitpchisq(x,y,model,guess,options);
guess = params; % To speed up convergence
ss(i,ix) = params(3);
end
fprintf('%d',ix)
end
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
fprintf('\n')
 

Plot the results of the scan.

Since we are fitting with the background model used to generate the data, we expect to see a zero net spurious signal with some spread (RMS - the red error bars). Since we have a finite sample, there may be some scatter, indicated by the uncertainties on the sample means (

- the black error bars).

figure

errorbar(m,mean(ss,1),std(ss,1),'r-')

hold on

errorbar(m,mean(ss,1),std(ss,1)/sqrt(length(toys)),'k','LineStyle','none','Marker','none')

grid on

xlabel('Mass (GeV)')

ylabel('SS (Events/GeV)')

title('Spurious signal')

legend('SS A and \sigma_A','SS \delta{A}')

There is indeed some scatter in the mean spurious signal amplitude as a function of mass, but close to all the points are within 1-2 standard deviations of zero, which is what is expected for natural statistical fluctuations.

The reason this result is "bumpy" and the corresponding plot for the background-fit bias above is smooth is that due to the narrow width of the signal model, the correlation distance here is quite short, the order of a few bins, whereas in the background-only fits (a smooth function over the whole mass spectrum) the correlation distance essentially covers the whole mass spectrum.

Study a "wrong" background model: Bernstein(n=2)

We want to repeat all the plots, but now using a second order Bernstein polynomial, which is a very poor approximation of an exponential distribution, as the background model in the fits. We expect to see bias in the background-only fits and we expect to see non-zero spurious signals in the same regions where we see bias.

bern2 = @(p,x) p(1)*(1-x).^2 + p(2)*2*x.*(1-x) + p(3)*x.^2 ; % The background component.

guess = [1E5 2E4 1.5E4];

y = toys(1,:); dy = sqrt(y); % Very simple estimate of the uncertainties

[params,perrors,chisqmin,exitflag,errormatrix ] = fitpchisq(x,y,bern2,guess,options);

figure

errorbar(m,y,dy,'Marker','.','MarkerSize',12,'Color','k','LineStyle','none')

yfit = bern2(params,x);

hold on

plot(m,yfit,'b-')

legend('Toy dataset','Bernstein(n=2) fit')

title('Fit Bernstein (n=2) to 1 toy')

xlabel('Mass (GeV')

ylabel('Events/GeV')

Since the Bernstein polynomial gives a very poor approximation of an exponential distribution, we can already see systematic deviations by eye. The deviations will be even clearer in the following plot of the residuals.

figure

errorbar(m,y-yfit,dy,'Marker','.','MarkerSize',15,'Color','k','LineStyle','none')

title('Residuals for 1 toy, Bernstein(n=2)')

yline(0,'b-');

xlabel('Mass (GeV)')

ylabel('Data-Fit (Events)')

legend('Residuals','Fit')

Here we see very clearly that the second-order Bernstein polynomial is unable to describe the toy data. The deviations of the data from the best fit are many standard deviations over much of the spectrum. We can suspect that there will be a large systematic bias when we study the deviations for a large number of toys.

Fit all the toys and show the mean and RMS deviation from the Bernstein model.

Since we are fitting Monte Carlo (toy) data to a (background) model that was NOT used to generate the toy data, we could expect there to be significant bias, i.e. the mean of many toys should have residuals inconsistent with zero with respect to the generating model.

% Do the fit and print/display some return values

resid = zeros(length(toys),length(y));

for i=1:length(toys)

y = toys(i,:);

[params,perrors,chisqmin,exitflag,errormatrix ] = fitpchisq(x,y,bern2,guess,options);

guess = params; % To speed up convergence

yfit = bern2(params,x);

resid(i,:) = yfit-backModel;

end

figure

errorbar(m,mean(resid,1),std(resid,1),'Marker','.','MarkerSize',15,'Color','k','LineStyle','none')

title('Bernstein (n=2) fit residuals w.r.t. model (all toys)')

yline(0,'b-');

xlabel('Mass (GeV)')

ylabel('<Fit-Model> (Events)')

legend('Mean and RMS','Fits')

As we suspected, there is indeed an enormous systematic bias over almost the whole mass spectrum, the exceptions being the few crossing points at about 108, 134, and 156 GeV.

Show an example Bernstein(n=2) fit with and without a signal model (a Gaussian pdf).

Since there is no true signal in the toy data, the fitted signal amplitude is a spurious signal.

model = @(p,x) p(1)*(1-x).^2 + p(2)*2*x.*(1-x) + p(3)*x.^2 + p(4)*normpdf(x,mHx,sigmax); % The S+b model.

Bmodel = bern2;

Smodel = @(p,x) p(1)*normpdf(x,mHx,sigmax);

guess = [10E4, 2E4, 1.5E4, 0]; % Good starting values for the fits helps speed convergence.

y = toys(1,:); dy = sqrt(y); % Very simple estimate of the uncertainties

[params,perrors,chisqmin,exitflag,errormatrix ] = fitpchisq(x,y,model,guess,options);

figure

errorbar(m,y,dy,'Marker','.','MarkerSize',12,'Color','k','LineStyle','none')

yfit = model(params,x);

hold on

plot(m,yfit,'b-')

legend('Toy dataset','S+B fit')

title('Fit signal + Bernstein(n=2) to 1 toy')

xlabel('Mass (GeV')

ylabel('Events/GeV')

If we compare this plot to the background-only fit (3 plots above), we can see that the region around 125 GeV, at the mass of the signal being fitted, is somewhat better described by the fit. In order to zoom in on what happens, the residuals plot below is even more instructive.

figure

errorbar(m,y-Bmodel(params(1:3),x),dy,'Marker','.','MarkerSize',15,'Color','k','LineStyle','none')

hold on

title('Residuals for 1 toy, S + Bernstein(n=2) fit')

yline(0,'b-');

plot(m,Smodel(params(4),x),'r-')

xlabel('Mass (GeV)')

ylabel('Data-background (Events)')

legend('Data-background','Fitted background','Spurious signal')

Here we see that, although there is no evidence of a real bump in the data at 125 GeV, the signal model "assists" the background model in improving the description of the data, i.e. if we plotted the residuals with respect to the fitted sum of the background and signal models, there would be more points consistent with zero residual.

Fit to a 145 GeV signal as a second example

m145x = (145-mmin)/(mmax-mmin);

model = @(p,x) p(1)*(1-x).^2 + p(2)*2*x.*(1-x) + p(3)*x.^2 + p(4)*normpdf(x,m145x,sigmax); % The S+b model.

Smodel = @(p,x) p(1)*normpdf(x,m145x,sigmax);

y = toys(1,:); dy = sqrt(y); % Very simple estimate of the uncertainties

[params,perrors,chisqmin,exitflag,errormatrix ] = fitpchisq(x,y,model,guess,options);

figure

errorbar(m,y,dy,'Marker','.','MarkerSize',12,'Color','k','LineStyle','none')

yfit = model(params,x);

hold on

plot(m,yfit,'b-')

legend('Toy dataset','S+B fit')

title('Fit signal + Bernstein(n=2) to 1 toy')

xlabel('Mass (GeV')

ylabel('Events/GeV')

We see that testing for a spurious signal at 145 GeV allows for a better description of the background in that mass region.

figure

errorbar(m,y-Bmodel(params(1:3),x),dy,'Marker','.','MarkerSize',15,'Color','k','LineStyle','none')

hold on

title('Residuals for 1 toy, S + Bernstein(n=2) fit')

yline(0,'b-');

plot(m,Smodel(params(4),x),'r-')

xlabel('Mass (GeV)')

ylabel('Data-background (Events)')

legend('Data-background','Fitted background','Spurious signal')

In this region the spurious signal has a positive amplitude.

Fit all the toys with S + Bernstein(n=2) and scan the entire mass spectrum.

We are interested in both the average spurious signal and its RMS. Since we are fitting all the toys while scanning over the signal mass hypothesis this takes some time...

ss = zeros(length(toys),length(x)); % The spurious signal per toy per mass point
for ix = 1:length(x)
model = @(p,x) p(1)*(1-x).^2 + p(2)*2*x.*(1-x) + p(3)*x.^2 + p(4)*normpdf(x,x(ix),sigmax); % The S+b model.
for i=1:length(toys)
y = toys(i,:);
[params,perrors,chisqmin,exitflag,errormatrix ] = fitpchisq(x,y,model,guess,options);
guess = params; % To speed up convergence
ss(i,ix) = params(4);
end
fprintf('%d',ix) % Simple progress indicator
end
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
fprintf('\n')
 

Plot the results of the scan

Since we are fitting with the a background model that was not used to generate the data, we may expect to see a zero net spurious signal with some spread (RMS). Since we have a finite sample, there may be some scatter, indicated by the uncertainty on the sample mean (RMS/sqrt(Ntoys)).

For the case of fitting an exponential distribution with a second-order Bernstein polynomial the spurious signal will be huge apart from a few points that cross zero. If such a bias were to go unnoticed and uncorrected, in regions where the spurious signal is negative, exclusions (upper limits) would be too strong, and in regions where the spurious signal is positive, excesses will be too significant, leading to false claims of discovery.

figure

errorbar(m,mean(ss,1),std(ss,1),'r-')

hold on

errorbar(m,mean(ss,1),std(ss,1)/sqrt(length(toys)),'k','LineStyle','none','Marker','none')

grid on

xlabel('Mass (GeV)')

ylabel('SS (Events/GeV)')

title('Spurious signal using Bernstein(n=2)')

legend('SS A and \sigma_A','SS \delta{A}')

Here we have confirmed that the large bias caused by the second order Bernstein polynomial results in a large spurious signal over nearly the entire mass range, except for the 3 points between the adjacent regions of deficit and excess.

Summary

We have seen the connection between the bias that occurs when fitting a distribution with a different parametric model than the one used to generate it and a spurious signal, i.e. a fake signal inferred in a pure background distribution: Large bias using a "wrong" background model leads to large spurious signals and using a "correct" background model eliminates background bias and spurious signals.

Bias-variance tradeoff

One of the goals of model interpretation is to keep this bias below an acceptable level while minizing the statistical variance of fitted results - this is the infamous "bias-variance tradeoff". Since we don't in general know the underlying model of the background distribution, there may not be any "one-size-fits-all" approach to optimizing the bias-variance tradeoff. Since a spurious signal amplitude is often used as a nuisance parameter to constrain a signal-like background component, one potential optimization criterion is the sum in quadrature of the spurious signal amplitude and its uncertainty.

False discoveries and exclusions

Why is the spurious signal important to take into account? If undetected and not accounted for, the false significance of positive spurious signals may lead to false claims of discovery, while negative spurious signals will tend to exaggerate the significance of the exclusion of new physics.