An outlier is an abnormal and extreme observation in a statistical sample or time series of data that can potentially affect the estimation of its parameters.
In simpler words, an outlier would be an observation within a sample or a time series of data that is not consistent with the rest. Imagine, for example, that we are measuring the height of the students in a class.
Let's imagine a sample of 10 students. The height of each is as follows:
|Pupil||Height in meters|
The average height of the class would be 1.73. If we take into account the maximum height and the minimum height and the distance between them to the mean, we see that it is 0.113 and 0.117 respectively. As we can see, the mean is approximately in the middle of the interval and could be considered a fairly good estimate.
The outlier effect
Now let's think about another sample of 10 students, their heights being the following:
|Pupil||Height in meters|
In this case, the average height of the class would be 1.81. If we now look at the maximum height and the minimum height and the distance between them to the mean, we see that it is 0.39 and 0.18 respectively. In this case, the mean is no longer approximately in the middle of the range.
The effect of the 2 most extreme observations (2.18 and 2.20) has caused the arithmetic mean to have shifted towards the maximum value of the distribution.
With this example, we see the effect that outliers have and how they can distort the calculation of an average.How to detect outliers?
How to correct the effect of outliers
In situations like this in which there are abnormal values that are substantially different from the rest, the median is a better estimate to know at which point a greater number of observations are concentrated.
In the case of both distributions and since we have an even number of values, we cannot take exactly the value that halves the distribution to calculate the median. With which after ordering the values from lowest to highest, we would take the fifth and sixth observation (both leave 4 observations on each side) and we would calculate the median as follows:
1,75+1,72/2 = 1,73
1,79+1,71/2 = 1,75
As we can see, in sample number 1, given that there are no outliers or abnormal observations, the median is 1.73 and coincides with the mean. On the contrary, for sample 2, the mean is 1.75. As we can see, this value is further away from the mean height, which was 1.81 and gives us a higher quality point estimate to know approximately at which point a greater number of observations are concentrated.Point estimate