Usefulness of PCA in Agronomy
Selecting crosses, evaluating cultivars or inputs, involve comparing materials on several traits. And you wonder: How can I compare yield, flowering date, and drought tolerance at the same time? How to draw some conclusions from all these variables?
Then you ask yourself what to do to explore the links between traits and visualize properly the similarities between individuals or treatments. I am sure you have read or heard of Principal Component Analysis (PCA). And you wonder: Can PCA help me choose which quantitative variables to analyze, and take the good decisions for my agronomy research? The answer to this question will allow you to elucidate or bring to light your dark spots on the PCA.
Principal Component Analysis (PCA) is one of the multivariate statistical methods, useful when confronted with n individuals observed on p quantitative variables. What is the purpose of this statistical tool? What are the limits of classical PCA for multiple comparisons, and what are the models that allow to go further?
“It is obviously more tedious if not impossible to summarize a set of agronomic data visibly by describing agronomic characteristics under these conditions, multivariate methods should be used” Romain Lucas Glele Kakaï¹
What is the purpose of Principal Component Analysis (PCA)?
PCA is an extremely powerful statistical tool for synthesizing information, very useful when there is a large amount of quantitative data to be processed and interpreted. When the variables are quantitative, we can perform a Principal Component Analysis.
The forms of PCA visualization for better decision-making in agronomy are usually a cloud of points or in the form of a circle of correlations between variables.
Here is an example of the use of PCA in the context of screening promising commercial varieties of sugar cane in Northern Ivory Coast (Figure 1). We are interested in 5 varieties of sugar cane and a control. Five pieces of information were identified for each cultivar: Cane yield, sugar yield, saccharin content, smut and stem borer sensitivity.
The following graph represents our data, to see if there are significant differences between the varieties, or if we have broadly the same characteristics for all 5.
Figure 1: Screening of promising commercial varieties of sugar cane.
Factor 1: 54.08% of the total variance (saccharin richness r = 0.77, cane yields r = 0.95 and sugar r = 0.99) ;
Factor 2: 27.15% of the total variance (charcoal r = 0.66 and bored internode % r = 0.78)
For this, we have carried out a PCA with 2 principal components, to represent the 5 variables in only two dimensions. Horizontally, the factor 1 represents cane yield, saccharin and sugar richness; Vertically, the second component displays the sensitivity of varieties to charcoal and stem borer. These two dimensions alone summarize more than 85% of the information.
You will notice in the figure 1 that four groups of varieties have been distinguished: At a glance you can see the control NCo376 at the top right, showing good agro-technical qualities but highly sensible to diseases. Then two groups give low yields: SP70-1143 with high sensibility, and a group of 2 varieties more resistant to diseases. Last but not least, the light blue group is made of the 2 most interesting varieties FR80674 and B47258, with a high sucrose content, high cane and sugar yields, and very resistant to stem borer and smut!
The PCA algorithm performs on the individuals / variables matrix various operations (data centering-reduction, diagonalization of the correlation matrix, extraction of eigenvalues and eigenvectors, etc.) in order to combine the initial dimensions to obtain a reduced number of variables: the principal components (PCs). This way, you can explain as much genetic variability (usually variance) as possible with as few PCs as possible.
“Grouping a set of data promotes accuracy by broadening their scope” Erwann Lagabrielle²
What are the limits of classical PCA for multiple comparisons ?
PCA only works with quantitative variables such as yield, height, water and amyloidosis content: You can’t study colors, disease, drought, submergence. In short all that is numerical variables: counts, percents, or numbers
In addition, if the data does not have an underlying structure, the dimension cannot be reduced, as shown in figure 2.
Figure 2: three-dimensional data
The principal components are the underlying structure in the data. They represent the directions in which the data has maximum variance and also the directions in which the data is the most spread out.
The main drawback of principal component analysis (PCA), especially for applications in high dimensions, is that the principal components are linear combinations of all input variables. Therefore, the results may be very sensitive to the presence of even a few atypical observations in the data. When, for example, you have data on lines with the greatest variance, PCA will largely bias interpretations of analyzes involving a common distribution. Without meeting a set of a priori requirements regarding data structure, the ordered axis plot approach is likely to produce misleading results.³
This is the case, for example, of the yield of a crop which is a function of several other components such as tillering, the quantity of grain per tiller and the weight of 1000 grain. You will notice through this example that the yield is a variable with a large dimension because it depends on the other components listed above.
Due to binding and correlated selection, the evolutionary response of any phenotypic trait can only be properly understood in the context of other traits.⁴
“The limitations of Principal Component Analysis come from the fact that it is a projection method, and that the loss of information induced by the projection can lead to erroneous interpretations” Magloire Oteyami ³
Go further with Multiple Correspondence Analysis
The Models that allow to go further consider the following elements: data consistency, exceptional individuals and links (correlations) between p variables, existence of diﬀerent individual proﬁles.
Multiple Correspondence Analysis (MCA), unlike PCA, can be used with qualitative variables. Mainly used for the analysis of survey, Components of the ACM transform qualitative variables into numerical ones.
Figure 3 : rows (individuals) are represented by blue dots and columns (variables) by red triangles
For instance when you have a group of producers using the same varieties in a given area during the survey; Associations between categories of variables will be used in the survey for better decision making to improve the use of new climate smart varieties (Figure 3).
The distance between individuals gives a measure of their similarity (or dissimilarity). Individuals with a similar profile are close on the graph. The same goes for variables.
The first utility of MCA is observed in the presence of binary variables. A binary variable is an element which can only take two values denoted 1 and 0. Its second utility is to transform qualitative variables into numerical ones for other analyzes requiring numeric variables.
“Multiple correspondence analysis is the factorial method suitable for tables in which a set of individuals is described by a set of qualitative variables” (Abdi and Williams 2010)
Linear Mixed model for METs in Plant breeding
Linear mixed model is used for evaluations of genotypes under various environmental conditions, to identify genotypes with superior performance in all area and under conditions or sets of conditions such as abiotic or biotic stresses. Know that here that the environment designates the unfavorable area for the production of rice, or corn, sorghum etc. This is due to the impact of climate variability or climate change which manifests itself in agriculture by drought, flooding, insect pests, diseases.
How to proceed with the selection of superior varieties while using data from multi-environment trials while being based on mixed models of factor analysis? Figures 4a and 4b gives the answer to this question.
Figure 4a: differential response of varieties to environments
Indeed, Figure 4a is translated according to the following terms: Superimposed first latent regression plots for two varieties, V6 (colored blue) and V1 (colored orange). Slopes of the solid lines are given by the EBLUPs of the (rotated) variety scores for the ﬁrst factor. The open circles are the overall performance measure for each variety, namely the value on the regression line at the mean value of the estimated loadings for the first factor (vertical dotted line).
How is the correlation between the different parameters for different area when using Linear Mixed Model? Let’s see an example with this study carried out by Marcos Malosetti, who evaluated the performance of genotypes in different environments using Linear Mixed Model.
Figure 4b: plot of the correlation between yield (ton ha-1) and heading date (days after 1st January) in each of 10 environments
Figure 4b illustrates the correlation between yield (ton ha-1) and heading date (days after 1st January) in each of 10 environments.
This can be very useful for you for the analysis of crop cultivar breeding and evaluation trials.
Through this example, you understand that Linear Mixed model is decision tool for the management of risks or constraints linked to the environment and to the Agriculture, either biotic (insect pests, diseases) or abiotic (drought, flooding, iron toxicity of the soil).
Usefulness of your agronomy software’s statistical wizard
Accurate statistics is the key to make sure your results are reliable all along your experimentation. You know the idiom “correlation is not causality”: It can be tempting to draw hasty conclusions based on biased samples, inadequate analysis method or erroneous data… but the statistical tools of your agro-research software help you trust your breeding and testing results at the key steps of your plant research processes:
• Choose the most appropriate design
• Identify your best performing lines with GCA
• Find the best parents for future crosses with SCA
• Evaluate environment response with GxE matrix
• Compare the performances of your treatments with a valid ANOVA and a complete set of statistic calculations
• Explore your data with interactive graphs for ACP, scatter plots, box plots and histograms
• Characterize traits transmissions with pedigree visualization
Figure 5: Example of dynamic PCA graph in RnDExperience® software
Combining analytical tools with performant data management tool and R&D resources planning is the key of success of your agronomy campaigns. Doriane through our products and services is leading researchers to their goals!
To go further, watch the video of our webinar :
Vegetal R&D statistics with a smile 😜
Discover methods and tools to perform your analysis with simplicity and efficiency!
Integrated to your plant breeding and testing software, statistical tools.
Get access to the whole range of R statistical tests to gain in performance and take the good decisions.
Featuring: Our agronomist Clément Bouckaert and our data analyst Mathilde Choureux.
1- Prof. Dr. Ir. Romain Lucas GLELE KAKAÏ is Lab Director of the Laboratory of Biomathematics and forestry estimations of University of Abomey-Calavi.
2- Erwann Lagabrielle (2007). Planning of biodiversity conservation and territorial modeling on Reunion Island. Geography. University of La Réunion, France.
3- Lande R. Quantitative genetic analysis of multivariate evolution, applied to brain-body size allometry. Evolution. 1979;33:402–416
4- Lynch M, Walsh B. Genetics and analysis of quantitative traits. Sinauer Associates; Sunderland, MA: 1998
5- Alaye H. Magloire Firmin OTEYAMI, Agronomist, Geneticist-breeder in Benin, author of this article
6- Abdi, Hervé & Williams, Lynne. (2010). Principal Component Analysis. Wiley Interdisciplinary Reviews: Computational Statistics. 2. 433 - 459. 10.1002/wics.101.
Figure 1: Kouamé, Didier & Pene, Crépin & Zouzou, Michel. (2018). Evaluating varietal resistance of Sugarcane to the Tropical African Cane borer (Eldana saccharina Walker) in Ivory Coast.
Figure 4b: Malosetti M, Voltas J, Romagosa I, Ullrich SE, van Eeuwijk FA (2004) Mixed models including environmental covariables for studying QTL by environment interaction. Euphytica 137: 139-145
Figure 5: © Doriane SAS