Home » Features » Are you suffering data overload?
Principal Component Analysis (PCA) is helping a new generation of researchers to view their results as detailed 3D images that can be explored with the naked eye, says Carl Johan Ivarsson. But how does PCA work?

Are you suffering data overload?

Principal Component Analysis (PCA) is helping a new generation of researchers to view their results as detailed 3D images that can be explored with the naked eye, says Carl Johan Ivarsson. But how does PCA work?

 

During the last decade, research into molecular biology has helped to identify a large number of genes associated with human disease, and is therefore helping researchers to unpick the fundamental biology of major illnesses. Complex gene expression experiments, in particular, are helping to support this process, as they are able to create a global picture of cellular function by measuring the activity (often called the ‘expression’) of tens of thousands of genes at once.

Findings from these experiments can then be used to distinguish between cells that are actively dividing and/or to show how the cells react to a particular treatment, in addition to many other important characteristics. As part of this process, researchers will often consider sub-groups (such as patients who are in remission versus patients who have suffered a relapse), whilst also examining the different types of cell abnormalities related to clinical conditions such as diabetes and cancer.

Difficulties arise, however, as a result of the vast amount of data that is created by experiments like these. This “data overload” can present a serious problem for researchers, since it is essential to capture, explore, and analyse this kind of data effectively in order to obtain the most meaningful results from their experiments. The good news is that a new generation of data visualisation tools is now able to make sense of this complex data by taking full advantage of the most powerful pattern recogniser that exists: the human brain.

In recent years, researchers have begun to use powerful software engines that enable them to visualise their data as full-colour 3D images that can be easily manipulated on a computer screen. With this approach, not only can scientists identify hidden structures and patterns more easily, but they can also identify any interesting and/or significant results by themselves, without having to rely on specialist bioinformaticians and/or biostatisticians.

This kind of data visualisation works by projecting high dimensional data down to lower dimensions, which can then be plotted in 3D onto a computer screen so that it can be rotated (either manually or automatically) and examined by the naked eye. With the benefit of instant user feedback on all of these actions, scientists studying human disease can now easily analyse their findings in real-time and in an easy-to-interpret graphical form.

When used to support research in this way, the ability to visualise data in 3D represents a very powerful tool for scientists, since the human brain is very good at detecting structures and patterns. The idea behind this approach is that highly complex data will be easier to understand and comprehend by giving it a graphic form. As such, this approach to information visualisation offers a way to transform raw data into a comprehensible graphical format, so that scientists can make decisions based on information that they can identify and understand more easily.

New imaging functions contained within the latest data analysis applications are allowing scientists to study very large data sets by using a combination of different visualisation techniques. To begin the visualisation process, however, researchers must first reduce their high dimensional data down to lower dimensions so that it can be plotted in 3D. This is where Principal Component Analysis (PCA) comes in.

PCA is often used for this purpose as it uses a proven mathematical procedure to transform a number of possibly correlated variables into a number of uncorrelated variables (called principal components). This transformation is defined in such a way that the first principal component has as high a variance as possible, and each succeeding component in turn has the highest variance possible under the constraint that it be uncorrelated with the preceding components.

One of the key breakthroughs in the latest generation of bioinformatics software is the ability to combine this PCA analysis with immediate user interaction. This innovative approach allows scientists to manipulate different PCA-plots – interactively and in real time – directly on the computer screen, with all of their annotations and other links preserved. As such, researchers are given full freedom to explore all possible versions of the presented view, and are therefore able to visualise, analyse, and explore a large dataset easily.

As previously mentioned, the very high dimensional nature of many data sets makes direct visualisation impossible, since the human brain can only process a maximum of three dimensions. As such, the solution is to work with data dimension reduction techniques like PCA.

However, when using PCA to reduce the dimensions of such valuable data, it’s important not to lose too much information in the process. As such, the variation in a data set can be seen as representing the information that researchers would like to keep. PCA works well in this regard, as it is a well-established technique for reducing the dimensionality of data, while keeping as much variation as possible.

A meaningful reduction of the dimensionality is possible with PCA since the data is usually not uniformly distributed, but there are often strong correlations among groups of variables, indicating a certain amount of redundancy in the variable set. Thus, the true number of underlying factors (representing most of the information in the data) is usually much smaller than the number of measured variables.

PCA is able to achieve this dimension reduction by creating new, artificial variables called principal components. Each principal component is a linear combination of the observed variables, so if the data has been centred via subtraction of the mean value for each variable, the first principal component can be interpreted as the linear combination having maximal variance. Subsequent components maximise the variance while being uncorrelated to the previously extracted components. The fact that the different principal components are uncorrelated ensures that they represent different characteristics of the original data set.

PCA is known as an “unsupervised” method, since no information concerning the class labels of the samples is used in the dimension reduction. This also means that PCA is not optimised for class separation, but instead provides a visual representation of the dominant patterns in the data set. As such, PCA can be a very good tool for exploratory data analysis, where the aim is hypothesis generation rather than hypothesis verification.
Since PCA is sensitive to the measurement scale of the individual variables, it is common to standardise each variable by dividing the observed values by their standard deviation. This assures that all variables participate on equal terms in the extraction of the principal components. By studying the variance accounted for by each principal component, the number of underlying factors in the data can then be estimated.

Computing the values of the first three principal components, for example, provides a three-dimensional artificial observation vector for each sample. It means that the samples now can be represented in a three-dimensional space, making visual exploratory analysis possible. The first three principal components contain more of the variance in the original data than any other trio of linear combinations, so in this sense PCA provides the optimal three-dimensional sample representation.

One of the keys behind the success of PCA is that in addition to the low-dimensional sample representation, it also provides a synchronised low-dimensional representation of the variables. This representation is obtained by depicting the weights of the original variables in the extracted principal components. A variable with a high weight in a principal component is therefore located far from the origin in the corresponding direction. The synchronised sample and variable representations thus provide a visual way to find variables that are characteristic for a group of samples.

Even though the exploration and analysis of large data sets can be challenging, tools like PCA can provide a powerful way of identifying important structures and patterns very quickly, since data visualisation can provide the user with instant feedback, and with results that present themselves as they are being generated.

Larger studies, especially those that include multiple samples that need to be analysed on comprehensive array platforms, have traditionally been very time-consuming, and have also required a considerable amount of computer power. As humans, however, we are used to interpreting 3D pictures in our environment, and so our brains are able to find structures in complex 3D figures very quickly. Therefore, it’s no wonder that a 3D presentation of complex mathematical/statistical data makes it much easier for us to interpret.

Already, the latest technological advances in data visualisation are thus making it much easier for scientists to compare the vast quantity of data generated by their studies and to test different hypotheses very quickly. As a result, the latest generation of data analysis tools is helping scientists to regain control of this analysis, and to realise the true potential of the important research being conducted in this area.
Author:

Carl Johan Ivarsson, CEO of Qlucore
Contact: 

t: +46 46 286 3110

e: carl-johan.ivarsson@qlucore.com

w: http://www.qlucore.com

 

Have your say