Principal Component Analysis

S.No Lesson Title
1
Introduction
1.1 Challenges Faced By High-Dimensioned Feature Sets
2
Principal Component Analysis (PCA)
2.1 Steps in PCA
3
PCA On MNIST Data
3.1 Standardization
3.2 Obtaining The Covariance Matrix
3.3 Calculating Eigenvectors and Eigenvalues
3.4 Computing the Principal Components.
4
Practical Example
4.1 Importing Data
4.2 Exploratory Data Analysis
4.3 PCA
4.4 Obtaining New Dataset
5
Conclusion

"We prefer to listen to a concise explanation of a story than a lengthy and complicated one. Even the predictive models expect the same, the curated low dimensional data than having sparse high dimensional data."


1. Introduction

Dimension refers to the number of input features/attributes or variables. Dimensionality reduction involves reducing the number of features of a dataset to form a subset that represents the original high-dimensional data. High dimensional data often imposes numerous challenges on predictive models, generally referred to as the "Curse of dimensionality" as quoted by the mathematician Richard Bellman. To be more specific, as the dimension of the data increases, the complexity level of analyzing, organizing, and visualizing the data increases. In the next section, we will discuss the shortcomings of high-dimensional data followed by the method to efficiently reduce the dimension of the data.

1.1 Challenges Faced By High-Dimensioned Feature Sets

Some of the challenges faced by high-dimensioned feature sets are as follows:

  1. The large datasets cause high computation time and memory overhead during the training process.
  2. Making the model account for so many features or variables to make a decision may induce poor real-time performance.
  3. The complexity of the model needs to be increased when dealing with the high-dimensional data, leading to overfitting and poor generalizability.
  4. The features may have correlations and hence the model will have bias towards correlated features and results may be unreliable.

"When dealing with high dimensional data, it is often useful to reduce the dimensionality by projecting the data to a lower dimensional subspace which captures the "essence" of the data. This is called dimensionality reduction."

- Machine Learning: A Probabilistic Perspective (Adaptive Computation and Machine Learning series)


Now that we know the challenges associated with the high-dimensional data; think of a method that summarizes and generates the subset of the dataset representing the original high-dimensional data that can be easily analyzed. This is where Principal Component Analysis (PCA) comes into play. In the next section, we will walk through the working of PCA.

"Dimensionality reduction yields a more compact, more easily interpretable representation of the target concept, focusing the user's attention on the most relevant variables"

- - Data Mining: Practical Machine Learning Tools and Techniques (Morgan Kaufmann Series in Data Management Systems)


2. Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a method derived from linear algebra intended for significantly reducing the number of dimensions in a dataset while retaining most of the important information. PCA is widely used in the domain of machine learning, pattern recognition, and signal processing for preparing the dataset to create a projection of a dataset before fitting a model. In PCA, we create a set of Principal Components or new features from the Input dataset by solving an Eigenvector/Eigenvalue problem. The Principal Components capture the most important information from the dataset to reduce the loss of data while reducing dimensions. In PCA, we are trying to project our points on a suitable hyperplane to reduce our dimensions. For example, if we want to reduce our dimensions from 3 to 2, then we would need to project our data points present in the 3D coordinate system to a 2D hyperplane. We would take the hyperplane which would give the most variance in our data, to retain most of the information.

2.1 Steps in PCA

Let us now get a brief idea on the steps involved in PCA after that we will take a closer look at all the steps.

  1. We are going to compute a matrix that will summarize how our variables relate to each other.
  2. We will then break down this matrix into direction and magnitude, which will help us to understand the importance of each direction.
  3. Here we can see two directions which are the red and green lines. We will transform our data in such a way that it is projected on the best line.

  4. Now, we have to see what would be the best fit line. We need to reduce the features of our data in such a way that we get the maximum amount of variance in our new reduced dataset. This can be seen by plotting our data on both the lines and see how much variance we get in our new dataset when it is projected on both the lines.
  5. If we project our points on the red line we will see that the distance between the points in the red line is more than in the green line. So, the red line is the best line to protect our data on.
  6. Now we will transform our data in such a way that it is aligned with the red line.

Now let us take an in-depth look at all the steps involved in PCA. Steps involved in performing PCA are as follows:

  1. Standardization of Data
  2. Obtaining the covariance matrix
  3. Calculating the Eigenvectors and Eigenvalues
  4. Computing the Principal Components

3. PCA On MNIST Data

To understand the working of PCA, we will perform every step on the MNIST dataset by using the keras framework. MNIST dataset has 784 features, but we will reduce the number of features to 2 using PCA.

3.1 Standardization

Standardization is a feature scaling technique where the values are centered around the mean with a standard deviation of 1. This means that the mean of the attribute will become 0, and the new distribution will have a standard deviation of 1.

Let us perform standardization on MNIST dataset.

	
#Here we are importing the MNIST dataset from keras and making loading train and test data from it

from keras.datasets import mnist

(X_train,Y_train),(X_test,Y_test)=mnist.load_data()

#We are only performing PCA on the train data right now

# Now let us reshape our train data

X=X_train.reshape(-1,28*28)
Y=Y_train

#Now let us import StandardScaler from Sklearn and Standardize our dataset

from sklearn.preprocessing import StandardScaler

sc=StandardScaler()

#We are Standardizing our train data and storing it in a variable X_

X_=sc.fit_transform(X)
	

The training data is now standardized and now it has 10000 rows (training samples) and 784 columns (features).

3.2 Obtaining The Covariance Matrix

Now our task is to obtain the covariance matrix for our dataset.

The covariance matrix will help us to determine how our features relate to one another. The covariance matrix is obtained by multiplying the standardized matrix X_ and its transpose X_T. Since our dataset has 784 features, our covariance matrix will be of the shape (784,784).

	
#Let us import numpy and compute the covariance matrix

import numpy as np

#Now let use calculate the covariance matrix.

covar=np.dot(X_.T,X_)

covar.shape

Output: (784, 784)

As we can see our covariance matrix has the shape of (784,784).

3.3 Calculating Eigenvectors and Eigenvalues

Now we have to calculate eigenvectors and their corresponding eigenvalues of our covariance matrix. Calculating eigenvectors and eigenvalues are important for the following reason:

We need to find the best fit line, on which we can project our dataset and can have maximum variance. The eigenvector indicates the direction of that line, and the eigenvalue is a value that tells us how the data is spread out on the line.

We will decompose our covariance matrix X_TX_ into PDP-1 where P is the matrix of eigenvectors and D is the diagonal matrix with the eigenvalues situated at the diagonals and 0 at every other position.

The above task of calculating eigenvectors can be achieved through Singular Value Decomposition (SVD) from numpy. This function will return 3 values, The first of which will be our eigenvectors.

	
#Let us import svd from numpy

from numpy.linalg import svd

#Here we are feeding our SVD function our covariance matrix. SVD will return 3 values u,s,v. Our eigenvectors will be in u.

u,s,v=svd(covar)

u.shape

Output: (784, 784)

Matrix U is a n x n matrix, where n is the number of total dimensions/principal components. We choose the first k columns of the matrix, which represent k dimensions we need.

Now, since we want to reduce our 784 features to 2 features, we will take the first 2 features of the 'u' matrix.

	
# Here we are extracting the first 2 columns of our u matrix and storing it in a variable call ured

ured=u[:,:2]

ured.shape

Output: (784, 2)

As we can see we have extracted the first 2 features.

3.4 Computing the Principal Components

In this step, we will convert our train data which is of shape (10000,784) to (10000,2). So we will reduce our features from 784 to 2. This will be achieved by multiplying our train data with the Eigenvectors computed above.

	
z=np.dot(X_,ured)

z.shape

Output: (10000, 2)

As we can see the dimensions of our dataset have been reduced from (10000,784) to (10000,2). Now let us see our reduced dataset.

	
# We will make a dataframe of our reduced data in order to see it better.

import pandas as pd

#Here we are horizontally stacking our dataset along with our target values

new_data=np.hstack((z,Y.reshape(-1,1)))

#Here we are labelling each column. Feature 1 is PC1 and Feature 2 is PC2.

dataframe=pd.DataFrame(new_data,columns=['PC1','PC2','label'])

dataframe.head()
PC1 PC2 Label
0 5.45831 -6.414 7
1 -2.8044 8.02885 2
2 7.41124 3.86404 1
3 -8.7512 -0.046 0
4 0.06576 -6.2963 4

As we can see our dataset has now been converted to 2 features.

4. Practical Example

To demonstrate the working of PCA, we will take the House Prices Dataset. We intend to perform a House Price Prediction study. The model will train on various features of previous transactions and associated prices. The Linear Regression algorithm can be used to create the model to predict future property prices.

However, this input comes with 81 fields. That's a very large number of predictor features and will make the model complex. We intend to perform Principal Component Analysis on this dataset to create a small set of derived features that would still include all the information that is required to create an effective model. This set of features can then be taken into the Linear Regression algorithm to build the prediction model.

4.1 Importing Data

The first step is to import all the libraries we require.

  # Import Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
import sklearn
from sklearn.preprocessing import MinMaxScaler
import warnings
warnings.filterwarnings('ignore')
  

Let us now import the dataset into a DataFrame.

  
# Here we are importing our dataset into a variable called houseprice

houseprice = pd.read_csv("HousePrice.csv", encoding ="utf-8")
houseprice
  

Output:

Unnamed: 0 Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig
0 1 60 RL 65 8450 Pave Reg Lvl AllPub Inside
1 2 20 RL 80 9600 Pave Reg Lvl AllPub FR2
2 3 60 RL 68 11250 Pave IR1 Lvl AllPub Inside
3 4 70 RL 60 9550 Pave IR1 Lvl AllPub Corner
4 5 60 RL 84 14260 Pave IR1 Lvl AllPub FR2

5 rows × 62 columns

As we can see that our dataset has 62 columns

4.2 Exploratory Data Analysis

We need to understand the feature types in the DataFrame. The kind of Preprocessing we have to undertake will depend on that.


houseprice.info()


< class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 62 columns):
Unnamed: 0       1460 non-null object
Unnamed: 0.1     1460 non-null int64
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
MasVnrArea       1452 non-null float64
ExterQual        1460 non-null object
ExterCond        1460 non-null object
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64

There are a lot of Textual fields(Objects) in our dataset. They cant be read into the model and hence must be converted into meaningful numeric fields

Let us now check for Missing values:


# Check variables for Nulls

houseprice.isnull().sum().sort_values(ascending=False).head(15)

Output:


PoolQC          1453
Alley           1369
FireplaceQu      690
LotFrontage      259
GarageFinish      81
GarageQual        81
GarageCond        81
GarageYrBlt       81
MasVnrArea         8
ExterQual          0
Neighborhood       0
RoofStyle          0
ExterCond          0
BsmtFinSF1         0
YearRemodAdd       0
dtype: int64

The features PoolQC, Alley, FireplaceQu, LotFrontage have a lot of missing values. This is reducing the importance of these features. Hence we will drop these features.


# Drop fields with High Null values

houseprice = houseprice.drop(['PoolQC', 'Alley', 'FireplaceQu', 'LotFrontage'], axis=1)

Drop 'I' field as it doesn't influence the House Price, it's only a record identifier:


houseprice = houseprice.drop(['Id'], axis=1)

The next step is to convert the textual/categorical features into numeric ones. There are two methods that we use:

  1. For categorical values that have a degree associated with them, we encode them by an increasing numeric value.
  2. For categorical values that do not have any degree associated, we perform a 'one-hot encoding' creating corresponding Dummy variables

Converting Features with Degree:


# Map ordinal variables to discreet numeric values

houseprice['MSZoning'] = houseprice['MSZoning'].map({'C (All)': 0, 'FV': 1, 'RH': 2, 'RL': 3, 'RM': 4})

houseprice['LandSlope'] = houseprice['LandSlope'].map({'Gtl': 0, 'Mod': 1, 'Sev': 2})

houseprice['ExterQual'] = houseprice['ExterQual'].map({'Po': 0, 'Fa': 1, 'TA': 2, 'Gd': 3, 'Ex': 4})

houseprice['ExterCond'] = houseprice['ExterCond'].map({'Po': 0, 'Fa': 1, 'TA': 2, 'Gd': 3, 'Ex': 4})

houseprice['HeatingQC'] = houseprice['HeatingQC'].map({'Po': 0, 'Fa': 1, 'TA': 2, 'Gd': 3, 'Ex': 4})

houseprice['CentralAir'] = houseprice['CentralAir'].map({'Y': 0, 'N': 1})

houseprice['KitchenQual'] = houseprice['KitchenQual'].map({'Po': 0, 'Fa': 1, 'TA': 2, 'Gd': 3, 'Ex': 4})

houseprice['GarageFinish'] = houseprice['GarageFinish'].map({'NA': 0, 'Unf': 1, 'RFn': 2, 'Fin': 3})

houseprice['GarageQual'] = houseprice['GarageQual'].map({'Po': 0, 'Fa': 1, 'TA': 2, 'Gd': 3, 'Ex': 4})

houseprice['GarageCond'] = houseprice['GarageCond'].map({'Po': 0, 'Fa': 1, 'TA': 2, 'Gd': 3, 'Ex': 4})

For the categorical features without a degree, we will create one hot encoding.


# Create Dummy Variables for Categorical-Nominal Variables

varsforDummy = ['Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'Neighborhood', 'RoofStyle', 'Heating', 'PavedDrive']

houseprice = pd.get_dummies(houseprice, columns = varsforDummy, drop_first = True)

Once we have the transformation of the Textual Categorical features sorted, we have another step, Standardization to perform. The Standardization step will convert the numeric values within the features into a standard scale. The original dataset has features that are on a different scale and features with a larger scale will have an undesired higher influence on the derivation of the Principal Components. The features input to the PCA process must be brought at the same scale through Standardization


# Scale the Numeric Variables - use the Min-Max / Normalization method of Scaling

all_cols = houseprice.columns

scaler = MinMaxScaler()

houseprice[all_cols] = scaler.fit_transform(houseprice)

Now that we have the Standardization of data done, we are ready to perform our PCA. Before we go ahead, we have to split the DataFrame into the target and predictor features.


# Create the X_train (predictor variable set) and y_train (target variable set)

y = houseprice.pop('SalePrice')

X = houseprice

X

Output:

MSSubClass MSZoning LotArea LandSlope LotFrontage OverallQual OverallCond YearBuilt YearRemodAdd LandContour
0 0.235294 0.666667 0.03342 0 0.666667 0.5 0.949275 0.883333 0.1225
1 0 0.666667 0.038795 0 0.555556 0.875 0.753623 0.433333 0
2 0.235294 0.666667 0.046507 0 0.666667 0.5 0.934783 0.866667 0.10125
3 0.294118 0.038561 0.038561 0 0.666667 0.5 0.311594 0.333333 0
4 0.235294 0.666667 0.060576 0 0.777778 0.5 0.927536 0.833333 0.21875

The above data shows that all our values are now in the range 0-1.

4.3 PCA

As the first step towards performing PCA, we create an object of type PCA and 'fit' the predictor features. We are limiting the PCA to 40 components. The rationale behind 40 components is that we have already iterated a few times with different values of n_components and 40 components roughly explain 90% of the data variance which is decent.


pca = PCA(random_state=100, n_components=40)

pca.fit(X)

Output:


PCA(copy=True, iterated_power='auto', n_components=40, random_state=100,

svd_solver='auto', tol=0.0, whiten=False)

Printing the Principal Components. Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance_.


pca.components_

Output:


array([[ 1.64838067e-02,  8.02437311e-02, -1.11982823e-02, ...,
6.12264874e-03,  3.41843357e-02, -1.46312291e-01],
[-8.44601563e-02,  2.81142939e-02,  8.34655678e-03, ...,
1.15928788e-04,  1.75158544e-03, -2.16444039e-02],
[-7.23189984e-02,  3.95209530e-02,  2.02969279e-02, ...,
8.14608176e-04,  4.42426183e-03, -2.22358428e-02],
...,
[-1.78650232e-02, -1.02373406e-01, -1.08968597e-02, ...,
1.30213359e-02,  2.40800538e-01,  2.56673770e-02],
[-3.50197536e-02, -3.21968452e-02, -1.78078591e-02, ...,
1.38228502e-02,  3.87850780e-02,  8.35435156e-02],
[ 2.25941593e-02,  3.89966300e-02, -1.94527083e-03, ...,
-2.18023111e-03,  8.96040700e-02,  5.01139594e-02]])

Printing the Explained Variance Ratio: Percentage of variance explained by each of the selected components.

pca.explained_variance_ratio_
array([0.14808837, 0.09124268, 0.07671718, 0.04850863, 0.04604035,
0.03457759, 0.0316386 , 0.02983203, 0.02584193, 0.02497396,
0.02412199, 0.02212304, 0.02017013, 0.01867444, 0.01810104,
0.01622929, 0.01578623, 0.01536094, 0.01440941, 0.01382469,
0.01279342, 0.01259371, 0.01210324, 0.01136639, 0.01126208,
0.01032889, 0.01018668, 0.00966271, 0.00883392, 0.00860221,
0.00800755, 0.00761721, 0.00721252, 0.00697055, 0.00639642,
0.00597645, 0.00589985, 0.00562679, 0.0053806 , 0.00519324])

Print other Attributes to understand the PCA better.


print("n_components_", pca.n_components_)

print("n_features_", pca.n_features_)

print("n_samples_", pca.n_samples_)

Output:


n_components_ 40

n_features_ 93

n_samples_ 1460

Now we plot a Bar plot with the values of how much variance in the training data is explained by each Principal Component. This is plotted in decreasing order of variance explained percentage.


# Bar Plot to display how much percentage of overall (100%) variance is explained by each Principal Component

plt.bar(range(1,len(pca.explained_variance_ratio_)+1), pca.explained_variance_ratio_)


#Here we are taking the cumulative sum of the explained_variance_ratio, in order to understand how much percentage of variance can be explained by every combination of features

var_cumu = np.cumsum(pca.explained_variance_ratio_)

The X-axis is the sequence of Principal Components and the Y-axis is the %age of variance explained by the corresponding Principal Component.

All Principal Components together explain 100% variance.

We also plot the 'Scree Plot' to make a little better visually to show 40 components that we have selected actually explain 90% of the variance.


# Scree plot

plt.plot(range(1,len(var_cumu)+1), var_cumu)

Output:

Once the PCA is completed, we would be interested to obtain the dataset with new principal components to be used in further model building.

4.4 Obtaining New Dataset

For this, we apply a 'fit_transform' on the PCA object. In this case, we have kept the 'n_components' value as 6 to keep our code simple. Ideally, we should have taken 40 as that's the number that gave us 90% variance explained.


pc2 = PCA(n_components=6, random_state=100)

newdata = pc2.fit_transform(X)

newdata.shape

Output:

(1363, 6)

Here, 'newdata' is a Numpy Array.

We endeavor to convert the Numpy Array into a DataFrame with defined headers.


# Convert the Array into

df = pd.DataFrame(newdata, columns=["PC1", "PC2", "PC3", "PC4", "PC5", "PC6"])

df.head()

Output:

PC1 PC2 PC3 PC4 PC5 PC6
-0.39287 -0.54288 -0.67529 -0.18972 0.128308 0.380955
0.038777 -0.20136 0.304561 -0.69406 0.362831 -0.0886
-0.7441 -0.58105 -0.03557 0.266747 -0.22866 0.409709
-0.01995 -0.24946 0.932655 -0.01506 0.163128 -0.18449
-0.97228 -0.42467 0.679669 -0.22462 0.339433 -0.39399

This is the data 'df' DataFrame.

Thus we were able to reduce the number of features in our dataset from 81 to 6 using PCA.

Conclusion

In this article, we have seen the challenges posed by the high-dimensional data and the necessity of dimensionality reduction We have discussed how Principal Component Analysis addresses the curse of dimensionality with the orthogonal transformation. Furthermore, we have demonstrated the efficacy and working of PCA on the MNIST and House pricing datasets. Hope, this article helps to understand the concept and implementation of PCA.

References

  1. https://www.cs.cmu.edu/~elaw/papers/pca.pdf
  2. https://keras.io/
  3. http://yann.lecun.com/exdb/mnist/

Reach out to us for Recruitment Services, Corporate Training, Source-Train-Hire (STH) Program, Campus Hiring, Executive Hiring, Diversity Hiring

Send Message
Top