The purpose of applying PCA to a data set is ultimately to reduce its dimensionality, by finding a new smaller set of m variables, m < n, retaining most of the data information, i.e. the variation in the data. Since the principal components (PCs) resulting from PCA are sorted in terms of variance, keeping the first m PCs should also retain most of the data information, while reducing the data set dimensionality.
Notice that the PCA transformation is sensitive to the relative scaling of the original variables. Therefore data column ranges need normalizing before applying PCA. Also notice that the new coordinates (PCs) are not real system-produced variables anymore. Applying PCA to your data set loses its interpretability. If interpretability of the results is important for your analysis, PCA is not the transformation for your project.
KNIME has 2 nodes to implement PCA transformation: PCA Compute and PCA Apply.
The PCA Compute node calculates the covariance matrix of the input data columns and its eigenvectors, identifying the directions of maximal variance in the data space. A high value of the eigenvalue indicates a high variance of the data on the corresponding eigenvector. Eigenvectors can be sorted by decreasing eigenvalues, i.e. variance. The PCA Compute node outputs the covariance matrix, the PCA model, and the PCA spectral decomposition of the original data columns along the eigenvectors. The PCA model is produced at the last output port and contains the eigenvalues and the eigenvector projections necessary to transform each data row from the original space into the new PC space.
The PCA Apply node transforms a data row from the original space into the new PC space, using the eigenvector projections in the PCA model. A point from the original data set is converted into the new set of PC coordinates by multiplying the original zero-mean data row by the eigenvector matrix generated by the spectral decomposition data table.
By reducing the number of eigenvectors, we effectively reduce the dimensionality of the new data set. Usually, only a subset of all PCs is necessary to keep 100% information from the original data set. The more tolerant the losing of information, the higher the dimensionality reduction of the data space. The configuration settings of the PCA Apply node allows to define the maximum tolerable information loss and calculate the consequent dimensionality reduction based on the necessary number of PCs.
Notice that PCA is a numerical technique. This means the reduction process only affects the numerical columns and does not act on the nominal columns. Also notice that PCA skips the missing values. On a data set with many missing values, PCA will be less effective.
Figure below shows a PCA sub-workflow. Here the training set is used to build the covariance matrix, after dealing with missing values and normalizing all data columns to fall into [0,1]. The first m eigenvectors of the PCA model are then applied to transform the data set and reduce its dimensionality from the original n coordinates to the m selected PCs, with m < n.