Posts Tagged data science
The Pearson correlation coefficient is perhaps one of the best known measures of correlation in data science. It describes the linear correlation between two variables X and Y . It is widely used in data sciences. As the name suggests, it was developed by Karl Pearson, an English mathematician and statistician.
The Pearson correlation coefficient can assume values ranging from -1 to +1, where -1 indicates negative linear correlation, +1 indicates perfect linear correlation, and 0 represents no correlation.
The formula for the correlation coefficient is given by:
- is the covariance of X and Y
- is the standard deviation of X and
- is the standard deviation of Y
Goal and Motivation
The purpose of this exercise was to use the correlation coefficient as a means of quantifying whether two variables are correlated i.e. does change in one variable show a linear relationship with change in another variable. Such a method is useful in comparing variables, such as temperature and air pressure, for example.
I was searching the Kaggle dataset repository for a dataset where I could apply the correlation coefficient and found a dataset from the Weather Archive in Jena. The full dataset and its description can be found here: Weather Archive Jena
The dataset contains 10 minute values from a number of different sensors, recorded by a weather mast at the Max-Planck Institute for Biochemistry in Jena, Germany. The different parameters recorded are shown below:As can be seen in Figure 2, we have a number of different weather-related parameters recorded by a weather mast. It would be of interest to find which of these parameters are changing in relation to each other and if there is a similarity between these two.
The above data is available as a .csv file, which was downloaded to a local hard disk and imported into an SQLite Database. For data analysis, I am using the Anaconda Distribution with Python 3.6.1 and Pandas 0.20.1.
The data was stored in a table named “jena_climate”. To load the dataset, the following command was used:
import sqlite3 import pandas as pd conn = sqlite3.connect('D:\\Data Mining\\Datasets\\datasets.db') data = pd.read_sql(sql = 'select * from jena_climate', con = conn )
The above command stores the data in the form of a pandas DataFrame called “data”. Now it is easier for us to manipulate the data and plot one variable against the other as a scatter plot. The following figure illustrates a plot of the variable T (temperature, °C) vs. p (air pressure, mbar).
data.plot(x = 'T(degC)', y = 'p(mbar)', kind = 'scatter')
Which results in the following figure.
The scatter plot gives us a rough idea about whether the two variables are related or not. Evidently, there is not much correlation to be found between temperature and air pressure, as both are spread over a wide range with no apparent relationship between them. However, instead of a visual analysis, we want to determine the quantitative relationship between these two variables. This is where the Pearson correlation coefficient is utilized.
A python function was written that takes two vectors as input and gives us the correlation as output. For writing the function, the equation for calculating the pearson correlation coefficient for a sample population was used .
- is the pearson correlation coefficient of a sample population
- is the ith element in the vector X
- is the mean of the vector X
- is the ith element in the vector Y
- is the mean of the vector Y
- is the total number of elements in X and Y
The above formula was coded in python as follows:
def pearson(x, y): x_hat = sum(x)/len(x) y_hat = sum(y)/len(y) x_minus_x_hat = [xi - x_hat for xi in x] y_minus_y_hat = [yi - y_hat for yi in y] numerator = [xi*yi for xi, yi in zip(x_minus_x_hat, y_minus_y_hat)] numerator = sum(numerator) x_minus_x_hat_sq = [pow(xi,2) for xi in x_minus_x_hat] y_minus_y_hat_sq = [pow(yi,2) for yi in y_minus_y_hat] sum_x_den = sum(x_minus_x_hat_sq) sum_y_den = sum(y_minus_y_hat_sq) denominator = pow(sum_x_den,0.5)*pow(sum_y_den,0.5) return numerator/denominator
Let us use this function to calculate the correlation coefficient between the vectors temperature and air pressure.
pearson(data['T(degC)'], data['p(mbar)']) # - -0.044999999999999998
As expected, the Pearson correlation coefficient between temperature and air pressure is approximately -0.05, which shows no particular correlation.
The Correlation Matrix
The correlation matrix is a useful construct to determine the correlation when we have a large number of variables (in this case, 14). The correlation matrix is an n x n matrix, where the diagonal entries are the correlation of each variable with itself. For our current dataset, the marix will have a size of 14 x 14.
import numpy as np data = data[data.columns[1:]] #drop timestamps column _, num_columns = data.shape #Create a matrix of zeros of size n x n #The zeros will be replaced by the correlation ... #...coefficients cor_mat = np.zeros((num_columns, num_columns)) for i in range(num_columns): for j in range(num_columns): cor_mat[i][j] = pearson(data[data.columns[j]], data[data.columns[i]])
When the above code is run on the dataset, the following correlation matrix is obtained.
As can be seen from the above figure, all the diagonal entries are 1. However, some other variables tend to show strong positive or negative correlation with each other.
Some of the interesting data are plotted and illustrated in a gallery. It is left up to the reader to contemplate if the Pearson correlation coefficient is correctly identifying the relationship between two variables or not.
 Wikipedia: Pearson correlation coefficient. Web: https://en.wikipedia.org/wiki/Pearson_correlation_coefficient Last access: 18.02.2018
 Kaggle: Weather Archive Jena. Web: https://www.kaggle.com/pankrzysiu/weather-archive-jena Last access: 18.02.2018