Skip to article frontmatterSkip to article content

PCA and LDA

Today, we are looking at some breast cancer data. We try to classify it. The primary application of this dataset is binary classification, where machine learning models are trained to predict whether a breast tumor is malignant (cancerous) or benign (non-cancerous) based on features extracted from digitized images of fine needle aspirate (FNA) samples.

Import important packages

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer

# all the machine learning stuff
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
data_handler = load_breast_cancer()
df = pd.DataFrame(data_handler.data, columns=data_handler.feature_names)
# df['target'] = data.target

To get a feeling for the dataset, let’s print the first five rows:

print(df.head())
   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst radius  worst texture  worst perimeter  \
0                 0.07871  ...         25.38          17.33           184.60   
1                 0.05667  ...         24.99          23.41           158.80   
2                 0.05999  ...         23.57          25.53           152.50   
3                 0.09744  ...         14.91          26.50            98.87   
4                 0.05883  ...         22.54          16.67           152.20   

   worst area  worst smoothness  worst compactness  worst concavity  \
0      2019.0            0.1622             0.6656           0.7119   
1      1956.0            0.1238             0.1866           0.2416   
2      1709.0            0.1444             0.4245           0.4504   
3       567.7            0.2098             0.8663           0.6869   
4      1575.0            0.1374             0.2050           0.4000   

   worst concave points  worst symmetry  worst fractal dimension  
0                0.2654          0.4601                  0.11890  
1                0.1860          0.2750                  0.08902  
2                0.2430          0.3613                  0.08758  
3                0.2575          0.6638                  0.17300  
4                0.1625          0.2364                  0.07678  

[5 rows x 30 columns]

It is a good practice to work with pandas DataFrames. They behave very similar to numpy array. To avoid confusion for beginners, we continue with a numpy array.

data = df.values
print(data.shape)
(569, 30)

The array has 569 samples with 30 features.

Data preprocessing

Recall: We need, that the every instance is mean free and has variance 1.
TODO: Check, that the data is mean free and has a standard derivation of 1.

scaler = StandardScaler()  # create object
normalized_df = scaler.fit_transform(
    data
)  # calculate new values (fit) and apply it (transform)
# The mean is taken such that every feature has mean 0
# If you see an array with 30 entries, you chose the right axis
print(np.mean(normalized_df, axis=0))
print(np.std(normalized_df, axis=0))
[-3.16286735e-15 -6.53060890e-15 -7.07889127e-16 -8.79983452e-16
  6.13217737e-15 -1.12036918e-15 -4.42138027e-16  9.73249991e-16
 -1.97167024e-15 -1.45363120e-15 -9.07641468e-16 -8.85349205e-16
  1.77367396e-15 -8.29155139e-16 -7.54180940e-16 -3.92187747e-16
  7.91789988e-16 -2.73946068e-16 -3.10823423e-16 -3.36676596e-16
 -2.33322442e-15  1.76367415e-15 -1.19802625e-15  5.04966114e-16
 -5.21317026e-15 -2.17478837e-15  6.85645643e-16 -1.41265636e-16
 -2.28956670e-15  2.57517109e-15]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1.]

PCA

Let’s do the PCA:

pca = PCA(n_components=10)  # create object
data_pca = pca.fit_transform(
    normalized_df
)  # calculate new values (fit) and apply it (transform)
print(data_pca.shape)
print(pca.explained_variance_ratio_)
(569, 10)
[0.44272026 0.18971182 0.09393163 0.06602135 0.05495768 0.04024522
 0.02250734 0.01588724 0.01389649 0.01168978]

Make sure to have at least 80% in your data set.

print(np.sum(pca.explained_variance_ratio_))
0.9515688143366668
def plot_pca(scores, label_components=np.array([]), n_first=3):
    """
    plots n_first components against each other and saves the plots
    Arguments:
    scores: pca transformed data
    label_components: these information is written to the axis labels
        example: pca.explained_variance_ratio_
    n_first: number of components wich should be plot against each other
            (number of plots= n_first*(n_first-1)/2 )
    """
    if label_components.shape[0] == 0:
        label_components = np.arange(n_first) + 1
    else:
        label_components = np.round(label_components, 2)
    if n_first > label_components.shape[0]:
        n_first = label_components.shape[0]
    for i in range(n_first):
        for j in range(i + 1, n_first):
            plt.figure()
            plt.scatter(scores[:, i], scores[:, j])
            plt.xlabel("PC" + str(i) + " " + str(label_components[i]))
            plt.ylabel("PC" + str(j) + " " + str(label_components[j]))
            plt.title("PC" + str(i) + " vs PC" + str(j))
    plt.show()
plot_pca(data_pca, label_components=pca.explained_variance_ratio_, n_first=3)
<Figure size 4800x3600 with 1 Axes><Figure size 4800x3600 with 1 Axes><Figure size 4800x3600 with 1 Axes>

So far, not much can be seen. Then an LDA is necessary.

LDA

Recall, the classification labels are stored in

print(data_handler["target"][0:23])
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0]

Because we are doing a classification, we subdivide our dataset into a training and a test dataset. (Usually, you do this BEFORE the PCA)

X_train, X_test, y_train, y_test = train_test_split(
    data_pca, data_handler["target"], test_size=0.5, random_state=42
)
# How many components do we need?
lda = LinearDiscriminantAnalysis(n_components=1)
data_lda = lda.fit_transform(X_train, y_train)
print(data_lda.shape)
samples = np.arange(data_lda.shape[0])
plt.scatter(samples, data_lda, c=y_train)
plt.title("result LDA training data")
plt.xlabel("samples")
plt.ylabel("LDA_calculation")
plt.show()
(284, 1)
<Figure size 4800x3600 with 1 Axes>

Where is the decision boundary?

# look at the decision boundary
lda_prediction = lda.predict(X_train)
mask = np.argsort(lda_prediction)
lda_prediction = lda_prediction[mask]
data_lda = data_lda[mask]

# now both arrays are sorted
mask2 = lda_prediction < 0.5
bdry = np.max(data_lda[mask2])
print(bdry)
plt.figure()
plt.plot(data_lda, lda_prediction, "o")
plt.title("Decision boundary")
plt.xlabel("LDA_calculation")
plt.ylabel("LDA_prediction")
plt.show()
6.195117075332792
<Figure size 4800x3600 with 1 Axes>

Explain the last plot. What is the difference between the transformed data and the prediction?

Apply to test data

accuracy = lda.score(X_test, y_test)
print("Accuracy:", accuracy)

lda_test_data = lda.transform(X_test)
samples = np.arange(lda_test_data.shape[0])
plt.scatter(samples, lda_test_data, c=y_test)
plt.hlines(bdry, 0, lda_test_data.shape[0])
plt.title("result LDA training data")
plt.xlabel("samples")
plt.ylabel("LDA_calculation")
plt.show()
Accuracy: 0.9578947368421052
<Figure size 4800x3600 with 1 Axes>

How the results change, when we change the number of PCA components? Find a good measure to check this.

accuracies = []
for i in range(1, 30):
    pca = PCA(n_components=i)  # create object
    data_pca = pca.fit_transform(
        normalized_df
    )  # calculate new values (fit) and apply it (transform)
    lda = LinearDiscriminantAnalysis(n_components=1)
    data_lda = lda.fit_transform(data_pca, data_handler["target"])
    accuracy = lda.score(data_pca, data_handler["target"])
    accuracies.append(accuracy)
plt.figure()
plt.plot(accuracies)
plt.xlabel("number of components")
plt.ylabel("accuracy")
plt.show()
<Figure size 4800x3600 with 1 Axes>

What is influence of division of the dataset?

pca = PCA(n_components=10)
data_pca = pca.fit_transform(normalized_df)
X_train, X_test, y_train, y_test = train_test_split(
    data_pca, data_handler["target"], test_size=0.5, random_state=42
)


n = 10
test_sizes = np.linspace(0.1, 1, n, endpoint=False)

accuracies = []
for i in range(n):
    X_train, X_test, y_train, y_test = train_test_split(
        data_pca, data_handler["target"], test_size=test_sizes[i], random_state=42
    )
    lda = LinearDiscriminantAnalysis(n_components=1)
    data_lda = lda.fit_transform(X_train, y_train)
    accuracy = lda.score(X_test, y_test)
    accuracies.append(accuracy)

plt.figure()
plt.plot(test_sizes, accuracies)
plt.xlabel("test size")
plt.ylabel("accuracy")
plt.show()
<Figure size 4800x3600 with 1 Axes>

Look at the axes? Is the effect relevant?