The Art of Descriptive Statistics

Pushpak Ruhil
6 min readDec 17, 2022

--

Every analysis should, by the norms, begin with descriptive statistics.

Descriptive statistics is the branch of statistics used to summarise the data.
It is used to give brief summary of the data.

I will try to give a quick summary of the types of data before diving into the details of descriptive statistics.

Types of data

Types of data — Drawn on Excalidraw

Summary Statistics for Quantitative Data

There are mainly 2 different ways to use summary statistics for Quantitative (numerical) Data —

  1. Measure of Centrality
  2. Measure of Spread

1) Measure of Centrality

The measure of centrality includes 3 main measures —

  • Mean
  • Mode
  • Median

Mean -

  • Mean is nothing but the arithmetic average of the data points.
  • Let’s say we have n data points from a SAMPLE**

* There’s a difference when we talk about sample and population. There is no difference in the formula for the mean for the two, but you need to understand that there is a difference.

Then, the sample mean, would be →

Mode -

  • Mode is the most frequently occurring value in the data set.
  • Mode is not calculated for a dataset where we have bins** of size more than 1.

* bins are taken into consideration when we have continuous data with lots of values. To ease our work, we end up creating bins of particular bin sizes.

  • We can have single uni-modal, bi-modal, tri-modal,… multi-modal data.
  • If all values are repeated exactly once, we say that the mode doesn’t exist.

Median -

  • The Median is the value which appears at the centre of the data WHEN THE DATA IS SORTED.
  • Let’s say we have n data points, sorted in the order
X is a set of all the data points

CASE 1: n is odd

If n is odd, use this formula

CASE 2: n is even
In this case, we will have 2 medians, calculated using the following formulae

Mean is sensitive to outliers, while median and mode aren’t.

An outlier is a data point which is far off from the other values in a data.
Formal Definition to be given when discussing IQR

2) Measure of Spread

The measure of spread includes a lot of things. Here are a few —

  • Percentile
  • range & IQR
  • Variance
  • Standard Deviation

Percentile -

The Pth percentile of a sample is a value such that the p percentage of the values in the dataset is less than or equal to this value.

As an example, 90 percentile in an exam means that you did better than 90% of the people. Or, there are 90% of people who scored equal to or less than you and 10% of people who scored more than you.

How to calculate the value at p-th percentile? There are 3 steps to it

Step 1: Sort the data
Step 2: Compute the location of the p-th percentile
Step 3: Calculate the value of the p-th percentile.

After sorting the data, use the following formula to calculate the location.

Location of P-th percentile

After calculating the location, we will have a real number with an integer part and a fractional part.
Eg: if Lp = 18.2, then the integer part is 18 and the fractional part is 0.2

storing the integer part and fractional part separately.

As a final step, you need to calculate the value of the p-th percentile

Let’s look at a quick example.
Say there are 25 students who appeared for an exam and we store their marks. We wan’t to see the value for 70th percentile.

So, the 70 percentile means the score would be 56.6

Range -

Range = (max_value — min_value)

If the range is high, we say that the spread is high.

But, the range isn’t the best measure because it is highly sensitive to outliers.

So, what’s the solution? IQR!!!

Inter Quartile Range (IQR) -

  • IQR is not sensitive to outliers, unlike range
  • IQR is defined as the difference between the 75th percentile and the 25th percentile.
  • IQR = Q3 — Q1
IQR = Q3 — Q1

Why do we call it the inter-QUARTILE range, though?

Quartile -

  • Quartiles divide the data into 4 equal parts, giving us 3 quartiles.
  • In the above image, we can see that the data is divided into 4 equal parts (25% each) and we get 3 quartiles, Q1, Q2 and Q3.
  • Similarly, we have quintiles (5 equal parts, each of 20% of the data) and deciles (10 equal parts, each of 10% of the data)

Now that we know what IQR is, let’s formally define outliers.
Outliers: Any data point x is considered to be an outlier if…
x < Q1–1.5*IQR
OR
x > Q3 + 1.5*IQR

Variance -

  • Variance is the measure of how variable the data is from the mean. By how much the data is spread.
  • It is calculated by taking the average of squared deviations from the mean.
  • Variance can also be thought of as a measure of CONSISTENCY. Low variance data are more consistent.
  • Variance is calculated differently for a sample and for a population.
different variance calculation for a sample and for a population.

This difference of 1/n and 1/(n-1) can be explained using the concepts of inferential statistics, which is out of the scope of this article.

Here, n is the size of the sample/population

  • Also, note that the unit of variance is not the same as of the data. Here, the unit is squared.

Standard Deviation-

  • SD is, again, a measure which tells us the variation of the data from the mean.
  • The only difference is that SD is the squared root of variance.
  • S = √S²
  • σ = √σ²
  • Here, the unit is the same as the data.

Effects of Transformation

Finally, as I wind up, it is also important to understand the effects of transformation on the summary statistics, when each data point is scaled and shifted.

Let’s say that each data point is scaled by a factor of a and shifted by a factor of c

Wrap up

Thanks for reading the article.

I wanted to resume writing articles for a long time but my busy schedule didn’t let me. Hopefully, I’ll be writing more articles, gradually increasing the complexity.

This was a fairly easy topic to start with but an important one!
People who are starting to enter this field would be benefitted from it.

--

--

Pushpak Ruhil

Data Science | Machine Learning | Python | Tableau | Tech geek