Models are wrong, .... but some are useful (G. Box): 2017

Monday 13 November 2017

Bioassay97: a new EXCEL VBA macro to perform statistical analyses on pesticide dose-response data

From the beginning of my activity I felt the need of developing simple software tools that would enable users with a limited background in statistics and computer programming (mainly students, practitioners and technicians) to perform sound statistical analyses directly from within the most widespread spreadsheet, with no additional cost requirements. The set of routines that is hereby presented has been progressively developed over the years, to support the intensive field research activity that is carried out at my institution and it was specifically tailored to the problems posed by routine agriculture experiments.

Download the macro

If you are a lazy reader and want to download the macro straight away, follow this link here. This is usually the latest version. You can also find an example spreadsheet, to test the macro at this link.

References

I spent some of my spare time developing this macro. If you find it useful, I would greatly appreciate if you could cite my work, by referring to one of these papers:

Onofri A., 2005. Bioassay97: a new excel® vba macro to perform statistical analyses on herbicide dose-response data. Italian Journal of Agrometeorology, 3, 40–45. (PFD HERE)
Onofri, A., Pannacci, E. (2014). Spreadsheet tools for biometry classes in crop science programmes. Communications in Biometry and Crop Science 9 (2), 43–53 (PDF HERE).

Usage

For a smooth usage experience, please note the following steps:

download the file Bioassay97xp.XLS and put it everywhere on the hard-disk, or external devices (pen-drives, removable media…). Please do not rename it! After done this, you might like to create a shortcut on your desktop to lunch the macro more easily;
make sure that the SOLVER add-in is installed and enabled in Excel;
make also sure that macros are enabled in EXCEL;
make sure that programmatic access to the Office VBA project is not denied.

If one of the above items is not ok, BIOASSAY97 will not work. In this case, take the following actions.

How to enable the SOLVER add-in

The add-in macro SOLVER.XLAM is included by default in Excel but kept disabled. In order to enable it, click the FILE Menu and choose OPTIONS. In the Excel Options dialogue box, select ADD-INS from the left sidebar and then hit the GO button, next to Manage Excel Add-ins at the bottom. Select the SOLVER ADD-IN option and hit OK to enable it. Once you’ve enabled the Solver, this will be available under the Data tab on the Data ribbon in Excel.

How to enable macros in Excel

To enable macros, you need to follow a version-dependent procedure. In Excel 2010–2016 : click the FILE MENU and select Options from the left sidebar. Select TRUST CENTER and press TRUST CENTER SETTINGS. In Trust Center Settings, select MACRO SETTINGS from the left sidebar, choose DISABLE ALL MACROS WITH NOTIFICATION and hit the OK button. Save and close Excel completely. In Excel 2007 : click the EXCEL OPTION button in the lower right. Click the TRUST CENTER button on the left. Then, at the bottom right, select TRUST CENTER SETTINGS. In the next window, select MACRO SETTINGS, then select the radio button for DISABLE ALL MACROS WITH NOTIFICATION. To close the Trust Center window, click the lower right OK button. Save and close Excel completely and reopen Excel.

How to enable programmatic access to the Office VBA project

In Excel 2010–2016 : click the FILE MENU and select Options from the left sidebar. Select TRUST CENTER and press TRUST CENTER SETTINGS. In Trust Center Settings, select TRUST ACCESS TO THE VBA PROJECT OBJECT MODEL and hit the OK button. Save and close Excel completely.

How to launch the macro

Open Bioassay97xp.XLS by double-clicking on it. Two things should happen: (1) a SECURITY WARNING pop-up notification appears, beneath the Office ribbon. Click the “Options” button and Select the radio box beside “Enable this content,” then click “OK.” You will have to do this every time, unless you selected the radio button for “Enable all macros” in the window at stage 2 (do so only if you know what you are doing!).

TROUBLE SHOOTING #1: NOTHING OF THE ABOVE HAPPENED!!!

This is because you have downloaded the macro from the Internet. Please, follow these steps: (1) from VIEW, click on UNHIDE and select BIOASSAY97 from the form that appears. (2) A service sheet will appear and, at the same time, you will se a notification pop-up appearing beneath the ribbon. Click on the appropriate button to enable the use of files downloaded from the Internet. (3) If prompted to do so, enable macros, as shown above. (4) Hide the macro (VIEW/HIDE), close Excel and save Bioassay97xp.XLS. This latter step is fundamental, to prevent this same problem from arising in the future.

Now click on the “Add-in” menu on the ribbon: you’ll notice that one new entry is added to the tools menu of Excel (Biologic Assay), which can be used to start the analyses.

TROUBLESHOOTING #2: I GET AN ERROR MESSAGE

An Excel data sheet must be opened, before clicking on any of the above buttons. Otherwise, you’ll get an error message (‘run-time error ‘91’: Object variable or With block variable not set’)!

Setting the analysis

Experimental data have to be organised in a database, with observations in rows, and variables in columns. Basically, at least a dose column and a response column are needed, but a categoric variable might be required as well if several response curves have to be simoultaneously fitted. The first row is reserved for variable names.

Look at this link to see an exemplary dataset. Read ahead if you want detail.

Introduction

Nonlinear regression analysis has been recognised as a very appropriate tool to analyse dose-response studies, involving pesticide antagonism, synergism, selectivity and resistance, as well as the effect of safeners, adjuvants and environmental side effects of pesticides. Advantages over the other techniques involving the linearisation of data (especially probit and logit analysis) are in that these latter methods have usually limited biological meaning and do not always allow for an appropriate description of data in certain response ranges, such as at extremely low and high doses.

Almost all the statistical aspects of non-linear regression analysis on bioassay data have been reviewed and reconsidered after the massive introduction of personal computers and this technique has considerably spread in the past few years, especially in weed science. A list of references has been given later on.

However, several authors insist on analysing dose-response data by linearisation or even by multiple comparison tests. Likely, a further spreading of non-linear regression techniques is restrained by the fact that calculations, though made easier by commercially available statistical packages, still require a certain degree of statistical expertise. Indeed, non-linear regression routines found in those statistical packages have not been specifically developed to analyse bioassay data: certain routine tasks, such as the calculation of ED levels with confidence limits or the comparison among different dose-response curves, are not easily performed without a certain programming effort.

The aim of this work was to develop an EXCEL VBA macro, specifically thought to deal with dose-response curve analyses, which would enable users with a limited background in statistics to automatically perform log-logistic analyses, directly inside the most common spreadsheet.

All the rationale and theory behind this macro have been taken by Streibig et al. (1993 a and b) and all the tools have been included to carry out appropriate analyses on the main part of bioassays regarding the study of factors influencing pesticide performances.

Single curve analysis

A single curve analysis is usually performed whenever it is necessary to estimate ED-levels for a single dose-response curve.

Model selection

A response model, a dose variable and a response dose can be selected from drop-down menus. Concerning models, three selections are possible: (1) Log - logistic model (sigmoidal symmetric responses on log-dose); (2) Weibull model (logistic asymmetric responses) (Streibig et al., 1993); (3) Peaked model (Brain and Cousens, 1989; logistic responses with stimulation at low doses).

See the earlier cited papers for a detailed description of each single model and of biological meaning of parameters. Indeed, a log-logistic model would fit in most of the cases, while a Weibull model should be selected in case of clearly asymmetric responses. The peaked model is suitable in case of response stimulation at low doses, but its usage should be carefully evaluated, as its mathematical properties are not as good as those of the other two sigmoidal models.

Increasing or decreasing curves

Each model can be taken to handle both increasing and decreasing curves, such as those based on growth of test-species (that is expected to decrease, as dose increases) or on growth inhibition (that is expected to increase, as dose increases). The macro itself will decide whether an increasing or a decreasing curve is required and will provide itself for the necessary adjustments on the model.

Constraints

Constraints can be put on the lower and/or on the upper asymptotes. This might be necessary to reach convergence or to improve the estimates of the other parameters. In particular, constraints might be needed: (1) for biological reasons; (2) whenever the lower asymptote is not significantly different from 0 or negative; (3) dealing with curves based on percentage pest control, which are often supposed to range from 0 to 100%; (4) asymptotes are measured or known without any experimental error.

Starting values for parameters

BIOASSAY97 does not provides itself starting values for parameters. Basically, the highest and the lowest observed response values can be used as the starting points for the higher and the lower asymptotes respectively. A starting value for the inflection point can be relatively easily estimated from observed data (just choose a dose value which gave a response approximately half way between the higher and the lower asymptote), while a value of 1 for the slope should be considered appropriate in most cases. Likewise, a value of 1 should be appropriate for the parameter describing stimulation, if needed.

If a constrained model has been chosen, the value to which the parameter has to be constrained must be entered here. For example, if the user wants to constrain the lower asymptote to 0, such a value has to be entered in the apposite input box.

Lambda values for the transformation of response

Data do not always meet the basic assumptions for regression analyses. Tipically, whenever the difference between the highest and the lowest observed response is higher than one order magnitude, variances will not be constant across all the treatments. Among the Box and Cox (1964) families of transformations, the following one has been chosen:

where W is the transformed variable, Y is the untransformed variable, l is the transformation parameter and is the geometric mean of the observations. By default l is set to 1 (no transformation); a value of 0,5 means that a square root transformation is performed, while a value of 0 means that a logarithmic transformation is performed. Often, a value of 0,25 has been found to be appropriate for dose-response analyses. A maximum likelihood value for lambda can be chosen by a direct comparison of RSS values obtained with different values of l (Box e Draper, 1987).

In order to obtain parameter estimates on the original scales, a Transform Both Sides technique has been adopted (according to Streibig et al., 1993).

ED levels and confidence limits

By default, the program itself will calculate the most useful ED levels, together with confidence limits. Indeed, in the case of decreasing curves, ED10, ED30 and ED50 are calculated, while in the case of increasing curves, ED50, ED70 and ED90 are calculated. Should an additional ED level be required together with confidence limits, the corresponding response level must be entered on the appropriate input box (for example if the ED85 level is required, enter 85 on the textbox).

Confidence limits are calculated by using the delta method. For the peaked curve of Brain and Cousens, confidence limits for ED-levels are calculated by using the inference band for the expected response, as shown by Bates and Watts (1988) and Snedecor and Cochran (1991).

Multiple curve analysis

While the previous analyses is normally performed to estimate ED-levels, simultaneous fitting of several curves is performed when dose-response curves need to be compared on a statistical basis.

To be able to simultaneously analyse several dose-response curves, at least three column have to be included on the spreadsheet: a dose variable, a response variable and a categoric variable, coding for the different curves.

Example of database selection, when different curves have to be simultaneously fitted to observed data. In this example the performances of two herbicides are gong to be compared.

In some cases, all the curves share the same untreated check (pesticide rate equals to zero); this may happen for example when comparing the performances of several pesticides. In this case, the untreated check might be assigned arbitrarily to one of the curves, but common higher asymptotes have to be chosen for the different curves later on.

The three variables (dose, response and category) have to be selected from drop-down menus on the apposite window.

Model selection

Two models can be selected: a logistic and a Weibull model. A peaked model has not been included, because a comparison of dose-response curves on the stimulation range is not normally carried out. If needed, stimulation can be masked to improve the precision of estimates.

After selecting the type of model, the user should state whether the different curves have common parameters. Following figure shows an example wherein the user has requested the macro to consider different logistic curves with common asymptotes and slopes (parallel curves).

A lambda value should also be provided here. See single curve analyses for more information on data transformation.

Starting values for parameters and constraints

After selecting the model, the user is prompted to enter, for each curve, starting values for parameters. In the case that lower or higher asymptotes have to be constrained to a specific value, the apposite option button is to be selected.

For example, the following figure shows a case in which the user has requested the macro to constrain the lower asymptote of curve 1 one to 0.

In the case of multiple curves analysis, the macro does not provide starting values for parameters, but it just memorise the lastly specified values. It has been assumed that the user has already carried out separate analyses for each curve and thus has already precise indications on possible starting values for parameters.

Iteration process and results

The iterative procedure for parameter estimation is carried out by the EXCEL Solver. If convergence cannot be reached, a warning message is displayed and the analyses is stopped. Very often this depend on overparameterisation and/or wrong starting values for parameters. A more careful selection of model and/or starting parameter values should improve the iterative process.

When convergence is reached, results are displayed on a new worksheet, that is added to the current workbook. All the statistics are displayed, together with graphs to allow for appropriate graphical analyses of residuals to be performed.

References and further reading

BATES, D.M., WATTS, D.G., 1988. Nonlinear regression analysis & its applications., John Wiley & Sons, Inc, New York. BOX, G.E.P., COX, D.R., 1964. An analysis of transformations. Journal of the Royal Statistical Society, B–26, 211–243. BRAIN, P., COUSENS, R., 1989. An equation to describe dose responses where there is stimulation of growth at low doses. Weed Research 29, 93–96. DRAPER, N.R., SMITH, H., 1981. Applied regression. John Wiley & Sons Inc. FINNEY, D.J., 1979. Bioassay and the practice of statistical inference. International Statistical Reviews, 47, 1–12. STREIBIG, J.C., RUDEMO, M., JENSEN, J.E., 1993a. Dose-response curves and statistical methods. In “Herbicide bioassays”, ed. Streibig J.C. e Kudsk P., Boca Raton, 29–55. STREIBIG, J.C., 1988. Herbicide bioassay. Weed Research 28, 479–484. STREIBIG, J.C., JENSEN, J.E., OLOFSDOTTER, M., HAAS, H., ANDREASEN, C., LAWAETZ, E., 1993. Testing hypotheses with dose-response curves. Proc. 8th Symposium “Quantitative approaches in weed and herbicide research and their practical application”, Braunschweig, 423–431.

Aknowledgements

The author wishes to thank Euro Pannacci (DSAA, University of Perugia), Ivan Sartorato (IBAF, CNR Padova), Albert Fisher (University of California) for testing the macro and providing useful comments and suggestions.

Thursday 28 September 2017

L'analisi AMMI per le interazioni genotipo-ambiente

Il risultato degli esperimenti agronomici è sempre sotto la diretta influenza delle condizioni pedo-climatiche e, di conseguenza, le prove sperimentali vengono sempre ripetute in più ambienti (anni e/o località). Se, come di frequente capita, l’interazione tra il trattamento sperimentale in studio e l’ambiente è significativa, si viene ad introdurre un elemento di difficoltà per l’interpretazione e la sintesi dei dati, poichè diviene necessario discutere i risultati ambiente per ambiente, utilizzando numerose tabelle e/o grafici dai quali è difficile trarre conclusioni generali.

Esistono alcune tecniche esplorative di analisi multivariata che possono aiutarci a risolvere questo problema, come ad esempio l’analisi AMMI (Addittive Main effect Multiplicative Interaction), che permette di descrivere l’interazione tra fattori sperimentali in modo molto parsimonioso ed efficace, garantendo nel contempo un ottimo livello di leggibilità e sinteticità dei risultati.

Cerchiamo di scoprire questa tecnica con l’aiuto di un caso-studio pubblicato in Stagnari et al., 2007.

Un caso studio

Una prova di confronto varietale con 12 varietà di favino (in realtà si tratta di sei varietà in due epoche di semina, ma, per comodità, trascureremo questo secondo trattamento sperimentale) e quattro repliche è stata ripetuta in due località per tre anni (per un totale sei ambienti). Carichiamo il dataset e vediamo le medie.

rm(list=ls())
dataset <- read.csv("FabaBean.csv", header=T)
head(dataset)
##       Varieta Blocco Localita Prod
## 1    Chiaro_A      1    bad_1 4.36
## 2    Chiaro_P      1    bad_1 2.76
## 3 Collameno_A      1    bad_1 3.01
## 4 Collameno_P      1    bad_1 2.50
## 5    Palomb_A      1    bad_1 3.85
## 6    Palomb_P      1    bad_1 2.21

#Tabella pivot
library(reshape)
GEmedie <- cast(Varieta ~ Localita, data = dataset,
                 value = "Prod", fun=mean)
GEmedie
##        Varieta  bad_1  bad_2  bad_3  pap_1  pap_2  pap_3
## 1     Chiaro_A 4.1050 2.3400 4.1250 4.6325 2.4100 3.8500
## 2     Chiaro_P 2.5075 1.3325 4.2025 3.3225 1.4050 4.3175
## 3  Collameno_A 3.2500 2.1150 4.3825 3.8475 2.2325 4.0700
## 4  Collameno_P 1.9075 0.8475 3.8650 2.5200 0.9850 4.0525
## 5     Palomb_A 3.8400 2.0750 4.2050 5.0525 2.6850 4.6675
## 6     Palomb_P 2.2500 0.9725 3.2575 3.2700 0.8825 4.0125
## 7      Scuro_A 4.3700 2.1050 4.1525 4.8625 2.1275 4.2050
## 8      Scuro_P 3.0500 1.6375 3.9300 3.7200 1.7475 4.5125
## 9    Sicania_A 3.8300 1.9450 4.5050 3.9550 2.2350 4.2350
## 10   Sicania_P 3.2700 0.9900 3.7300 4.0475 0.8225 3.8950
## 11   Vesuvio_A 4.1375 2.0175 4.0275 4.5025 2.2650 4.3225
## 12   Vesuvio_P 2.1225 1.1800 3.5250 3.0950 0.9375 3.6275

#Medie varietali
Gmedie <- with(dataset, aggregate(Prod, list(Varieta), mean))
Gmedie
##        Group.1        x
## 1     Chiaro_A 3.577083
## 2     Chiaro_P 2.847917
## 3  Collameno_A 3.316250
## 4  Collameno_P 2.362917
## 5     Palomb_A 3.754167
## 6     Palomb_P 2.440833
## 7      Scuro_A 3.637083
## 8      Scuro_P 3.099583
## 9    Sicania_A 3.450833
## 10   Sicania_P 2.792500
## 11   Vesuvio_A 3.545417
## 12   Vesuvio_P 2.414583

#Medie ambientali
Emedie <- with(dataset, aggregate(Prod, list(Localita), mean))
Emedie
##   Group.1        x
## 1   bad_1 3.220000
## 2   bad_2 1.629792
## 3   bad_3 3.992292
## 4   pap_1 3.902292
## 5   pap_2 1.727917
## 6   pap_3 4.147292

#Media generale
mu <- mean(dataset$Prod)

A questo dataset può essere adattato un modello lineare del tipo:

\[Y_{ij} = \mu + g_i + e_j + ge_{ij} + \epsilon_{ij}\]

dove le osservazioni sperimentali per ogni i-esima ‘varietà’ e per ognij-esimo ambiente possono essere descritte in modo additivo, sommando la media generale, l’effetto varietà, l’effetto ambiente e l’interazione genotipo x ambiente (GE). Il residuo $\epsilon$ costituisce una misura dell’errore sperimentale.

Per un disegno bilanciato, gli effetti principali (varietà e ambiente: g ed e) sono dati dalle differenze tra le medie dei genotipi/ambienti e la media generale. Possiamo calcolarli in questo modo:

g <- Gmedie[,2] - mu
e <- Emedie[,2] - mu

g
##  [1]  0.47381 -0.25535  0.21299 -0.74035  0.65090
##  [6] -0.66243  0.53382 -0.00369  0.34757 -0.31076
## [11]  0.44215 -0.68868

e
## [1]  0.116736 -1.47347  0.88903  0.7990 -1.3753  1.04403

L’effetto di interazione (ge) è dato da:

\[ge_{ij} = Y_{ij} - \left( \mu + g_i + e_j \right)\]

Ad esempio, per Chiaro di Torre Lama (A) l’effetto del genotipo è 0.4738, mentre per il primo ambiente l’effetto è 0.1167. La produzione riscontrata per Chiaro di Torre Lama nel primo ambiente è 4.105, quindi l’interazione genotipo-ambiente è data:

$$ ge_{1,1} = 4.1050 - (3.1033 + 0.4738 + 0.1167) = 0.4112 $$

Questo valore positivo implica che quella particolare combinazione ‘varietà x ambiente’ è risultata particolarmente valida, più di quanto ci si sarebbe potuti aspettare guardando separatamente i singoli effetti principali g ed e. Altri esempi aiuteranno meglio a comprendere il concetto di interazione; nel caso di Chiaro di Torre Lama (A) nel terzo ambiente, abbiamo un interazione pari a:

$$ ge_{1,3} = 4.125 - (3.1033 + 0.4738 + 0.8890) = –0.3411 $$

che mostra come Chiaro di Torre Lama (A), pur avendo avuto una buona produzione in quell’ambiente (4.125), non è apparso particolarmente favorito. Viceversa, nello stesso ambiente, Collameno (P) ha mostrato un interazione:

$$ ge_{4,3} = 3.8650 - (3.1033 - 0.7403 + 0.8890) = 0.6130 $$

che dimostra come una varietà mediamente poco produttiva, in questo ambiente abbia trovato condizioni particolarmente favorevoli.

L’equazione sovrastante può essere scritta:

\[ge_{ij} = Y_{ij} - \left( \mu + e_j \right) - g_i = Y_{ij} - \mu_j - g_i\]

il che mostra che possiamo eseguire una centratura della matrice iniziale per le colonne (il che equivale a sottrarre da ogni dato la media dell’ambiente). La matrice risultante può essere ancora centrata per le righe, il che equivale a sottrarre da ogni dati l’effetto del genotipo, dato che, per ogni riga, abbiamo già rimosso la media generale e l’effetto dell’ambiente. Insomma la matrice risultante è doppiamente centrata (“doubly centered”):

GEint <- as.data.frame(t(scale( t(scale(GEmedie, center=T, scale=F)),
                                center=T, scale=F)))
print(GEint, digits=3)
##               bad_1    bad_2   bad_3   pap_1    pap_2   pap_3
## Chiaro_A     0.4112  0.23639 -0.3411  0.2564  0.20826 -0.7711
## Chiaro_P    -0.4572 -0.04194  0.4656 -0.3244 -0.06757  0.4256
## Collameno_A -0.1830  0.27222  0.1772 -0.2678  0.29160 -0.2903
## Collameno_P -0.5722 -0.04194  0.6131 -0.6419 -0.00257  0.6456
## Palomb_A    -0.0309 -0.20569 -0.4382  0.4993  0.30618 -0.1307
## Palomb_P    -0.3076  0.00514 -0.0724  0.0301 -0.18299  0.5276
## Scuro_A      0.6162 -0.05861 -0.3736  0.4264 -0.13424 -0.4761
## Scuro_P     -0.1663  0.01139 -0.0586 -0.1786  0.02326  0.3689
## Sicania_A    0.2624 -0.03236  0.1651 -0.2949  0.15951 -0.2599
## Sicania_P    0.3608 -0.32903  0.0485  0.4560 -0.59465  0.0585
## Vesuvio_A    0.4753 -0.05444 -0.4069  0.1581  0.09493 -0.2669
## Vesuvio_P   -0.4088  0.23889  0.2214 -0.1186 -0.10174  0.1689

L’ANOVA (considerando gli ambienti come un fattore fisso) è:

mod.aov <- aov(Prod ~ Varieta*Localita + Error(Localita:factor(Blocco)), data=dataset)
## Warning in aov(Prod ~ Varieta * Localita +
## Error(Localita:factor(Blocco)), : Error() model is singular
summary(mod.aov)
## 
## Error: Localita:factor(Blocco)
##           Df Sum Sq Mean Sq F value   Pr(>F)    
## Localita   5  316.6   63.31   168.6 1.83e-14 ***
## Residuals 18    6.8    0.38                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Error: Within
##                   Df Sum Sq Mean Sq F value Pr(>F)    
## Varieta           11  70.03   6.366  58.411 <2e-16 ***
## Varieta:Localita  55  30.97   0.563   5.167 <2e-16 ***
## Residuals        198  21.58   0.109                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Gli obiettivi e il principio di fondo

La matrice GEint, per la sua dimensionalità, non è facilmente esaminabile e rappresentabile graficamente e crea quindi alcune difficoltà di studio.

L’analisi AMMI (Zobel et al., 1988; Gollob, 1968) può aiutarci a superare queste difficoltà in quanto ci fornisce un metodo per ridurre la dimensionalità della matrice GEint, senza perdere una grossa porzione delle informazioni sperimentali in essa contenute. In pratica, l’analisi AMMI non è altro che una Analisi delle Componenti principali sulla matrice doppiamente centrata ed è quindi basata sulla sua decomposizione ai valori singolari. Ricordiamo che una qualunque matrice X con j colonne e k righe può essere decomposta secondo la seguente espressione:

\[X = U D V^T\]

dove $U$ è la matrice dei primi n autovettori di $XX^T$ (di dimensione j x n), $V$ è la matrice dei primi n} autovettori di $X^T X$ (di dimensione k x n), $D$ è la matrice diagonale delle radici quadrate dei primi n autovalori di $XX^T$ (che coincidono con i primi n autovalori di $X^T X$). Il teorema di Eckart-Young (1936; si veda Gollob, 1968) stabilisce che la decomposizione è perfetta se n è uguale al rango di X, mentre la miglior approssimazione si ottiene conservando solo le prime r (con r < n) colonne delle matrici U e V e i primi r termini di D.

I calcoli

In R i calcoli si possono eseguire banalmente con la funzione rda() del package vegan. Usualmente, per l’analisi AMMI si preferisce uno scaling simmetrico, anche se questa non è una regola:

library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.4-3
ammi <- rda(GEint, scale=F)
summary(ammi, scaling=3)
## 
## Call:
## rda(X = GEint, scale = F) 
## 
## Partitioning of variance:
##               Inertia Proportion
## Total          0.7039          1
## Unconstrained  0.7039          1
## 
## Eigenvalues, and their contribution to the variance 
## 
## Importance of components:
##                          PC1    PC2     PC3     PC4     PC5
## Eigenvalue            0.4809 0.1263 0.05869 0.02382 0.01422
## Proportion Explained  0.6832 0.1794 0.08338 0.03385 0.02019
## Cumulative Proportion 0.6832 0.8626 0.94596 0.97981 1.00000
## 
## Scaling 3 for species and site scores
## * Both sites and species are scaled proportional to eigenvalues
##  on all dimensions
## * General scaling constant of scores:  1.66812 
## 
## 
## Species scores
## 
##            PC1      PC2      PC3     PC4     PC5
## bad_1 -0.83079  0.09477 -0.46711 -0.3173 -0.1507
## bad_2  0.04402 -0.41802  0.07037  0.3706 -0.4025
## bad_3  0.67043 -0.12977 -0.52477  0.1705  0.2977
## pap_1 -0.66137  0.51268  0.28858  0.3143  0.2210
## pap_2 -0.06863 -0.62703  0.41985 -0.2941  0.2079
## pap_3  0.84634  0.56736  0.21307 -0.2440 -0.1733
## 
## 
## Site scores (weighted sums of species scores)
## 
##                  PC1       PC2        PC3      PC4      PC5
## Chiaro_A    -0.60711 -0.383733  0.0008682  0.20801 -0.06349
## Chiaro_P     0.55193  0.026531 -0.0809519  0.04480  0.16410
## Collameno_A  0.08445 -0.542186 -0.0063099  0.17593  0.05688
## Collameno_P  0.80677 -0.065753 -0.1321600 -0.17195  0.07930
## Palomb_A    -0.32131  0.110117  0.5908136 -0.08279  0.38856
## Palomb_P     0.28105  0.345909  0.2816496  0.04235 -0.25302
## Scuro_A     -0.62639  0.139186 -0.1626008  0.01712 -0.08010
## Scuro_P      0.22961  0.076556  0.1817985 -0.20708 -0.24155
## Sicania_A   -0.06287 -0.323857 -0.3547204 -0.27987  0.09016
## Sicania_P   -0.21433  0.683297 -0.4016608  0.14811  0.15052
## Vesuvio_A   -0.43787 -0.007914  0.0202503 -0.29981 -0.17690
## Vesuvio_P    0.31606 -0.058153  0.0630236  0.40517 -0.11447

G <- summary(ammi, scaling=3)$specie
E <- summary(ammi, scaling=3)$sites
print(G%*%t(E), digits=3)
##       Chiaro_A Chiaro_P Colla_A Colla_P Palomb_A Palomb_P Scuro_A
## bad_1    0.411  -0.4572  -0.183  -0.572  -0.0309 -0.30757  0.6162
## bad_2    0.236  -0.0419   0.272  -0.042  -0.2057  0.00514 -0.0586
## bad_3   -0.341   0.4656   0.177   0.613  -0.4382 -0.07236 -0.3736
## pap_1    0.256  -0.3244  -0.268  -0.642   0.4993  0.03014  0.4264
## pap_2    0.208  -0.0676   0.292  -0.003   0.3062 -0.18299 -0.1342
## pap_3   -0.771   0.4256  -0.290   0.646  -0.1307  0.52764 -0.4761
##       Scuro_P Sicania_A Sicania_P Vesuvio_A Vesuvio_P
## bad_1 -0.1663    0.2624    0.3608    0.4753    -0.409
## bad_2  0.0114   -0.0324   -0.3290   -0.0544     0.239
## bad_3 -0.0586    0.1651    0.0485   -0.4069     0.221
## pap_1 -0.1786   -0.2949    0.4560    0.1581    -0.119
## pap_2  0.0233    0.1595   -0.5947    0.0949    -0.102
## pap_3  0.3689   -0.2599    0.0585   -0.2669     0.169

Le matrici G ed E (rispettivamente scores dei genotipi e degli ambienti) moltiplicate tra di loro restituiscono la matrice GEint. Ci si può a questo punto chiedere: si, ma cosa abbiamo ottenuto fin ora? In realtà nulla, abbiamo solo trovato una nuova forma algebrica per esprimere la matrice dell’interazione.

Ovviamente, noi non siamo interessati alla decomposizione completa di cui sopra, ma alla decomposizione parziale, ottenuta estraendo le prime r componenti principali. Se utilizzando la prima componente principale (cioè la prima colonna di G ed E, che chiameremo G1 ed E1, possiamo calcolare una matrice d’interazione “semplificata” (approssimata):

PC <- 1
G1 <- G[,1:PC]
E1 <- E[,1:PC]
print ( G1 %*% t(E1), digits=2)
##      Chiaro_A Chiaro_P Colla_A Colla_P Palomb_A Palomb_P Scuro_A
## [1,]    0.504   -0.459 -0.0702  -0.670    0.267   -0.233   0.520
## [2,]   -0.027    0.024  0.0037   0.036   -0.014    0.012  -0.028
## [3,]   -0.407    0.370  0.0566   0.541   -0.215    0.188  -0.420
## [4,]    0.402   -0.365 -0.0559  -0.534    0.213   -0.186   0.414
## [5,]    0.042   -0.038 -0.0058  -0.055    0.022   -0.019   0.043
## [6,]   -0.514    0.467  0.0715   0.683   -0.272    0.238  -0.530
##      Scuro_P Sicania_A Sicania_P Vesuvio_A Vesuvio_P
## [1,]  -0.191    0.0522    0.1781     0.364    -0.263
## [2,]   0.010   -0.0028   -0.0094    -0.019     0.014
## [3,]   0.154   -0.0421   -0.1437    -0.294     0.212
## [4,]  -0.152    0.0416    0.1418     0.290    -0.209
## [5,]  -0.016    0.0043    0.0147     0.030    -0.022
## [6,]   0.194   -0.0532   -0.1814    -0.371     0.267

Gli elementi contenuti in questa matrice hanno una devianza pari a 5.29, che costituisce il 68.3% della devianza degli elementi in GEint.

sum(unlist(GEint^2))
## [1] 7.742996
sum(unlist (G1 %*% t(E1))^2)
## [1] 5.290174

Allo stesso modo, se utilizziamo le prime due componenti principali possiamo osservare che utilizzando anche la PC2, la matrice risultante ha una devianza pari 5.29 + 1.388. Di conseguenza, l’inclusione della seconda PC consente di spiegare un ulteriore 17.9% della devianza dell’interazione.

Quante componenti principali includere nell’analisi AMMI?

Come nella PCA, la scelta del numero di componenti principali da considerare può essere semplicemente basata sulla percentuale di devianza spiegata. Ciò rientra perfettamente nella logica di esplorazione dei dati (Exploratory Data Analysis) che ci spinge ad eseguire l’analisi AMMI.

Tuttavia, se ci si vuole muovere in una logica più tradizionale di test d’ipotesi, si può considerare che i gradi di libertà associati ad ogni PC sono dati da (Zobel, 1988):

$$ df = j + k - 1 + 2n $$

dove \textit{j} è il numero delle varietà, \textit{k} è il numero degli ambienti ed \textit{n} è il numero delle PC considerate. Quindi in questo caso la PC1 ha 15 gradi di libertà, mentre la PC2 ne ha 13. Inoltre, le devianze che abbiamo appena calcolato vanno moltiplicate per 4, dato che abbiamo quattro repliche.

Se si dispone delle repliche (e quindi di una stima dell’errore ‘puro’), le informazioni precedenti possono essere utilizzate nell’ANOVA per verificare quante PC inserire, tenendo conto che la prova aveva quattro repliche. In questo caso, non sono necessarie altre PC oltre le prime 2.

Analisi della varianza

Secondo molti autori il test di F riportato nella tabella precedente è troppo poco conservativo e sono state proposte altre procedure più complesse (Cornelius, 1993).

Perchè l’analisi AMMI è utile? Il Biplot

A questo punto, l’utilità dell’analisi AMMI dovrebbe essere chiara: se la prima (o al limite le prime due) PC spiegano una gran parte della devianza dell’interazione, quest’ultima può essere descritta utilizzando una sola colonna (o al limite due) di G ed E. Di conseguenza G$_1$ ed E$_1$ possono essere facilmente plottati su un grafico.

In particolare, si può utilizzare un biplot comq quello in figura 1, dove possiamo rappresentare sulle ascisse la produzione media di ogni varietà e di ogni ambiente (gli effetti principali in t/ha, nel nostro caso) e sulle ordinate gli scores di ciascuna varietà ed ambiente (che a loro volta rappresentano l’interazione ‘genotipo x ambiente’. Il biplot diviene quindi uno strumento riassuntivo molto efficace, in quanto rappresenta un’altissima quota della devianza totale della prova (in questo caso il 91.6%).

La lettura del grafico è molto semplice: oltre al livello produttivo, le varietà e gli ambienti che hanno factor scores vicini a 0 non interagiscono (infatti il prodotto degli scores, cioè l’interazione nella sua forma semplificata, sarebbe vicino a 0).

Allo stesso modo, le varietà e gli ambienti che hanno factor scores dello stesso segno mostrano un interazione positiva (perchè tale è il prodotto di grandezze di ugual segno), tanto più alta quanto più alti sono i factor scores.

Viceversa per le varietà e gli ambienti che hanno factor scores di segno opposto interagiscono negativamente, in modo tanto più marcato quanto più alti sono i factor scores (in valore assoluto).

Si può ancora osservare che il gruppo di varietà indicate con A sone mediamente più produttive di quelle indicate con P. Nel primo gruppo, Palombino, Vesuvio, Scuro e Chiaro di Torre Lama sono state le più produttive nel 2001 in entrambe le località (PC score per l’ambiente vicino a quello della varietà), mentre Collameno e Sicania si sono rivelate in media meno produttive, ma anche meno influenzate dall’ambiente (scores quasi pari a 0).

Nel secondo gruppo (P), Scuro di Torre Lama, Sicania e Chiaro di Torre Lama si sono rivelate le più produttive; nel caso di Sicania, le prestazioni sono state basse nel 2003 in entrambe le località (score per l’ambiente molto lontano dallo score di varietà) mentre Chiaro di torre Lama non è andata particolarmente bene nel 2001.

Per quanto riguarda gli ambienti, il 2003 è stato l’anno più favorevole in entrambe le località per le varietà del gruppo P, mentre il 2001 è stato favorevole per le varietà del gruppo A. Al contrario, il 2002 è stato un anno poco produttivo per tutte le varietà.

Insomma, un solo grafico può darci una quantità notevole di informazioni, se solo lo si riesce a leggere in modo appropriato!

Se volessimo concentrarci sulla sola interazione, trascurando gli effetti principali, potremmo usare un biplot analogo a quelli utilizzati nel PCA

biplot(ammi, scaling = 3)

Figura 2. Biplot 2

Bibliografia essenziale

Annichiarico, P. (1997). Addittive main effects and multiplicative interaction (AMMI) analysis of genotype-location interaction in variety trials repeated over years. Theorethycal applied genetics, 94, 1072–1077.
Annichiarico, P. (1997). Joint regression analysis of genotype-environment interactions for cereals in Italy. Euphytica, 94, 53–62.
Annichiarico, P. (1997). Nuove metodologie per lo studio dell’interazione trattamento-ambiente e per la definizione della raccomandazione tecnica. Rivista di Agronomia,31, 817–823.
Annichiarico, P. (2000). Analisi della sperimentazione ripetuta in più ambienti. L’ Informatore agrario, 29–34.
Annichiarico, P., Bozzo, F., Parente, G., Gusmeroli, F., Mair, V., Marguerettaz, O., and Orlandi, D. (1995). Analysis of adaptation of grass/legume mixtures to Italian alpine and subalpine zones through an addittive main effects and multiplicative interaction model. Grass and Forage Science, 50, 405–413.
Ariyo, O. J. (1998). Use of additive main effects and multiplicative interaction model to analyse multilocation soybean varietal trials. J. Genet. and Breed, 129–134.
Cornelius, P. L. (1993). Statistical tests and retention of terms in the Addittive Main Effects and Multiplicative interaction model for cultivar trials. Crop Science, 33,1186–1193.
Crossa, J. (1990). Statistical Analyses of multilocation trials. Advances in Agronomy, 44, 55–85.
Gollob, H. F. (1968). A statistical model which combines features of factor analytic and analysis of variance techniques.Psychometrika, 33, 73–114.
Stagnari F., Onofri A., Jemison J., Monotti M. (2006). Multivariate analyses to discriminate the behaviour of faba bean (Vicia faba L. var. minor) varieties as affected by sowing time in cool, low rainfall Mediterranean environments. Agronomy For Sustainable Development, 27, 387–397.
Zobel, R. W., Wright, M.J., and Gauch, H. G. (1988). Statistical analysis of a yield trial. Agronomy Journal, 388–393.