ANOVA from scratch

Project developed for the “Computer methods” course in the Master in Methodology for behavioral sciences. Full description and examples can be found here and here.

Main function

ANOVA: Perform a one- or two-way fixed-effects ANOVA.

Arguments:
- factor: character vector of length 1 or 2 with the name of the predictors or factors. They must refer to a variable of type factor of a dataframe. In case the factor(s) are not of type factor, the function check_factor is called to transform the dataframe variables into factor type.
- vd: character vector of length 1 with the name of the outcome or dependent variable. It must refer to a numeric variable of a dataframe.
- data: dataframe containing the factor or factors and the dependent variable.
output:

Depending on the number of factors, returns the function one_way or the function two_way.

ANOVA <- function(factor,vd,data){
  
  num_fact <- length(factor)
  
  data <- check_factor(factor,data)
  
  switch(num_fact,
         
         return(one_way(num_fact,factor,vd,data)),
         return(two_way(num_fact,factor,vd,data)))
}

Auxiliary functions

check_factor: Checks that the predictors are type factor. If they are not, it transforms the variables into factors in the original dataframe itself and returns it.


check_factor <- function(vars,df){
  
  for (var in vars){
    
    if (!(is.factor(df[,var])))
      
      warning(paste0('se convertira ',var,' en tipo factor '))
      df[,var] <- as.factor(df[,var])
  }
  return(df)
}

anova_plot: Returns line graph with means for each level (one-way ANOVA) or for each combination of levels (two-way ANOVA). Allows exploration of the interaction effect by visual inspection.

anova_plot <- function(num_fact,factor,vd,data,medias=NULL,J=NULL){
  
  if(num_fact==1){
    
    labels <- levels(data[,factor])

    plot(1:J,medias,
         type='o', lwd=2,
         ylim=c(min(data[,vd]),max(data[,vd])),
         xlab=factor,ylab=vd,
         pch=19, col='#ff6b6b',
         xaxt='n')
    
    axis(1,1:J,labels=labels) # xticks
  }
  
  else{
    
    interaction.plot(x.factor = data[,factor[1]], trace.factor = data[,factor[2]], 
                 response = data[,vd], fun = mean, 
                 type = "b", legend = TRUE, lwd=2,
                 ylim=c(min(data[,vd]),max(data[,vd])),
                 xlab = factor[1], ylab=vd,trace.label=NULL,
                 pch=c(1,19), col = c('#4ecdc4', '#ff6b6b'))
  }
  
  grid(col = "lightgray", lty = "dotted",
         lwd = par("lwd"), equilogs = TRUE) # rejilla para ver mejor
}

one_way : Perform one-way fixed-effects ANOVA. The procedure is based on the chapter on one-way ANOVA in Pardo and San Martín (2015). The function provides a list with three dataframes containing the value of the F statistic, its critical level and three measures of effect size.


one_way <- function(num_fact,factor,vd,data){
  
  N <- nrow(data)                                   # Tamaño muestral
  J <- nlevels(data[,factor])                       # Número de niveles del factor A
  
 # ------------------------------------------------- Estimador variabilidad intra basado 
 #                                                   en varianzas de cada nivel j del factor

  varianzas <- aggregate(data[vd],
                         by=list(data[[factor]]),
                         FUN=var)[[vd]]
  
  n <- aggregate(data[vd],
                 by=list(data[[factor]]),
                 FUN=length)[[vd]]
  
  SCE <- sum((n-1)*varianzas)
  glE <- N-J
  MCE <- SCE/glE
  
  # ------------------------------------------------- Estimador variabilidad inter
  #                                                   basado en las medias del factor

  medias_A_df <- aggregate(data[vd],
                           by=list(data[[factor]]),
                           FUN=mean)
  medias_A <- medias_A_df[[vd]]             # el df es para la salida
  
  media_tot <- mean(data[,vd])
  SCI <- sum( n * (medias_A-media_tot)^2  )
  glI <- J-1
  MCI <- SCI/glI
  
  F_stat <- MCI/MCE
  
  p_val <- 1-pf(F_stat,glI,glE)
  
  # ------------------------------------------------- Salida
  
  anova_plot(num_fact,factor,vd,data,medias_A,J)

 # -------------------------------------------------
  
  names(medias_A_df) <- c(factor,'medias')
  
  sumas_cuad <- data.frame('SCT'=SCI+SCE,
                           'SCI'=SCI,
                           'SCE'=SCE)
  contraste <- data.frame('Factores'=factor,
                          'F'=F_stat,
                          'nivel crítico'=p_val)

  return(list('medias'=medias_A_df,
              'sumas cuadráticas'=sumas_cuad,
              'contraste'=contraste,
              'tamaño efecto'=effect_size(num_fact,N,
                                          glE,
                                          glA=glI,F_A=F_stat)))
}

two_way : Performs two-way ANOVA with fixed effects and assuming the same number of participants per combination of factor levels (balanced design). The procedure is based on the chapter on two-factor ANOVA in Pardo and San Martín (2015). The function provides a list with three dataframes containing the value of the F-statistic for each factor and for the interaction, the associated critical levels and three measures of effect size for each.


two_way <- function(num_fact,factor,vd,data){
  
  N <- nrow(data)                                 # Tamaño muestral
  J <- nlevels(data[,factor[1]])                  # Número de niveles del factor A
  K <- nlevels(data[,factor[2]])                  # Número de niveles del factor B
  n <- N/(J*K)                                    # Número de sujetos por combinación jk
                                                  # (se asume anova con diseño equilibrado)
  
  # -------------------------------------------------
  
  media_tot <- mean(data[,vd])
  
  SCT <- sum((data[vd]-media_tot)^2)
  glT <- N-1
 # ------------------------------------------------- Estimador variabilidad intra 
 #                                                   basado en varianzas de cada combinación jk

  varianzas <- aggregate(data[vd],
                         by=list(data[[factor[1]]],data[[factor[2]]]),
                         FUN=var)[[vd]]
  SCE <- sum((n-1)*varianzas)
  glE <- N-(J*K)
  MCE <- SCE/glE
  
 # ------------------------------------------------- Estimador variabilidad inter
 #                                                   basado en medias del factor A
  medias_A <- aggregate(data[vd],
                        by=list(data[[factor[1]]]),
                        FUN=mean)[[vd]]
  
  SCa <- n*K*sum( (medias_A-media_tot)^2 )
  glA <- J-1
  MCa <- SCa/glA
  F_stat_A <- MCa/MCE
  
  p_val_A <- 1-pf(F_stat_A,glA,glE)
  
 # ------------------------------------------------- Estimador variabilidad inter
 #                                                   basado en medias del factor B
         
  medias_B <- aggregate(data[vd],
                        by=list(data[[factor[2]]]),
                        FUN=mean)[[vd]]
  
  SCb <- n*J*sum( (medias_B-media_tot)^2 )
  glB <- K-1
  MCb <- SCb/glB
  F_stat_B <- MCb/MCE
  
  p_val_B <- 1-pf(F_stat_B,glB,glE)
  
  # ------------------------------------------------- Estimador variabilidad inter
  #                                                   basado en medias del factor interacción

  medias_inter_df <- aggregate(data[vd],
                        by=list(data[[factor[1]]],data[[factor[2]]]),
                        FUN=mean)
  medias_inter <- medias_inter_df[[vd]]               # el df es para la salida
  
  SCab <- SCT-(SCa+SCb+SCE)
  glAB <- (J-1)*(K-1)
  MCab <- SCab/glAB
  F_stat_AB <- MCab/MCE
  
  p_val_AB <- 1-pf(F_stat_AB,glAB,glE)

  # ------------------------------------------------- Ajuste condicional. Compara
  #                                                   modelo aditivo e interactivo
  
  aditiv_vs_inter <- ajuste_condicional(factor,vd,
                                        SCT,SCa,SCb,SCE,
                                        glT,glA,glB,glE)
  
  # ------------------------------------------------- Salida
  
  anova_plot(num_fact,factor,vd,data)
  
  # -------------------------------------------------


  names(medias_inter_df) <- c(factor,'medias') # nombrar adecuadamente df con medias
  
  sumas_cuad <- data.frame('SCT'=SCT,
                           'SCa'=SCa,
                           'SCb'=SCb,
                           'SCab'=SCab,
                           'SCE'=SCE)
  contraste <- data.frame('Factores'=c(factor,paste0(factor[1],'*',factor[2])),
                          'F'=c(F_stat_A,F_stat_B,F_stat_AB),
                          'nivel crítico'=c(p_val_A,p_val_B,p_val_AB))
  
  tamano_efecto <- data.frame(contraste[1],
                                       effect_size(num_fact,N,
                                                   glE,
                                                   glA,F_A=F_stat_A,
                                                   glB,F_B=F_stat_B,
                                                   glAB,F_AB=F_stat_AB))
  return(list('ajuste_condicional'=aditiv_vs_inter,
              'medias'=medias_inter_df,
              'sumas cuadráticas'=sumas_cuad,
              'contraste'=contraste,
              'tamaño efecto'=tamano_efecto))
}

ajuste_condicional: It compares the fit of the simple model that includes only main effects (additive model) versus that of the more complex model, which includes the interaction component (interactive model). A non-statistically significant result indicates that there is no evidence that the interactive model produces a significant reduction in prediction errors compared to the simpler model. In this case a warning is given to the user, indicating that the ANOVA model generated may not be optimal. The procedure has been copied from Ato and Vallejo (2015) (p.138).


ajuste_condicional <- function(factor,vd,
                               SCT,SCa,SCb,SCE,
                               glT,glA,glB,glE){
  
  SCE_aditivo <- SCT-(SCa+SCb)
  glE_aditivo <- glT-(glA+glB)
  
  SCE_inter <- SCE            # SCE de modelo inter = SCT-(SCa+SCb+SCab)
  glE_inter <- glE            # glE de modelo inter = glT-(glA+glB+glAB)
  
  F_stat <- ( (SCE_aditivo-SCE_inter)/(glE_aditivo-glE_inter) )/(SCE_inter/glE_inter)
  p_val <- 1-pf(F_stat,(glE_aditivo-glE_inter),(glE_inter))
  
 
 adit <-  paste0(vd,'=',factor[1],'+',factor[2])
 inter <- paste0(vd,'=',factor[1],'+',factor[2],'+',factor[1],'*',factor[2])
 
 comparacion <- data.frame('modelo aditivo'=adit,
                           'modelo interactivo'=inter,
                           'F'=F_stat,
                           'nivel crítico'=p_val)
  
  if (p_val>0.05) warning('Diferencia no significativa en grado de ajuste con 
                          alfa=0,05. El modelo interactivo podría no ser apropiado')
  
  return(data.frame('modelos'=c(adit,inter),
                    'F'=c(NA,F_stat),
                    'nivel crítico'=c(NA,p_val)))
}

effect_size : Calculates three measures of effect size for all factors involved in the ANOVA. Depending on the type of ANOVA (with 1 or 2 factors), it returns some formulas or others. All the formulas used have been copied directly from the chapters dedicated to ANOVA in Pardo and San Martín (2015). Factors are assumed to be fixed effects. Problem: when comparing results with those obtained with JASP software, it is observed that $\omega^2$ of the function is higher, the discrepancy being quite large in the case of two-factor ANOVA.

effect_size <- function(num_fact,N,
                        glE,
                        glA,F_A,
                        glB=NULL,F_B=NULL,
                        glAB=NULL,F_AB=NULL){
  etas <- numeric()
  omegas <- numeric()
  deltas <- numeric()
  
  for (gl_F in list(A=c(glA,F_A), B=c(glB,F_B), AB=c(glAB,F_AB))){
    
      eta <- (gl_F[1]*gl_F[2])/(gl_F[1]*gl_F[2]+glE)
      omega <- (gl_F[1]*(gl_F[2]-1))/(gl_F[1]*(gl_F[2]-1)+N)
      delta <- sqrt( omega/(1-omega) )
      etas <- c(etas,eta); omegas <- c(omegas,omega); deltas <- c(deltas,delta)
  }
  
  efs <- data.frame('delta Cohen' = deltas,
                    'eta cuadrado' = etas,
                    'omega cuadrado' = omegas)
  
  return(efs)
}

An example: two-way ANOVA

This data set, “Heart Rate”, (example JASP file) provides heart rates of male and female runners and generally sedentary participants following 6 minutes exercise (800 observations). Does gender affect average heart rate? Does lifestyle (runners and non-runners) have an effect on heart rate? Is there an interactive effect of lifestyle and gender on heart rate?

Variables:

Gender - Participant’s gender (Female, Male).

Group - Group of ‘Runners’ (averaging more than15 miles per week) and ‘Control’ group (generally sedentary participant).

Heart.Rate - Heart rate after six minutes of exercise.

link to data

Gender	Group	Heart.Rate
Female	Runners	119
Female	Runners	84
Female	Runners	89
Female	Runners	119
…	…	…

ANOVA(factor=c('Gender','Group'),vd='Heart.Rate',data=data2)

anova interaction plot

$ajuste_condicional
                               modelos        F nivel.crítico
1              Heart.Rate=Gender+Group       NA            NA
2 Heart.Rate=Gender+Group+Gender*Group 7.409481   0.006629953

$medias
  Gender   Group  medias
1 Female Control 148.000
2   Male Control 130.000
3 Female Runners 115.985
4   Male Runners 103.975

$sumas_cuadráticas
       SCT      SCa      SCb     SCab      SCE
1 407985.9 45030.01 168432.1 1794.005 192729.8

$contraste
      Factores          F nivel.crítico
1       Gender 185.979949   0.000000000
2        Group 695.647040   0.000000000
3 Gender*Group   7.409481   0.006629953

$tamaño_efecto
      Factores delta.Cohen eta.cuadrado omega.cuadrado
1       Gender  0.48085854  0.189392817    0.187800726
2        Group  0.93183089  0.466361694    0.464756575
3 Gender*Group  0.08950894  0.009222546    0.007948171

References

Pardo, A., & San Martín, R. (2015). Análisis de datos en ciencias sociales y de salud II. Editorial Síntesis.

Ato, M. & Vallejo, G. (2015). Diseños de investigación en Psicología. Ediciones Pirámide.