An official website of the United States government.

This is not the current EPA website. To navigate to the current EPA website, please go to www.epa.gov. This website is historical material reflecting the EPA website as it existed on January 19, 2021. This website is no longer updated and links to external websites and some internal pages may not work. More information »

CADDIS Volume 4

Using R for Parametric Regression

How to Fit Parametric Regressions

Single variable parametric regressions for presence/absence of different taxa (see Parametric Regression page, Equation 3) are computed using the generalized linear model (GLM) function in R.
 
First make sure that you have loaded the sample biological and environmental data and merged them into a single data frame called dfmerge (see Download Scripts and Sample Data in the Helpful Links box).
 
Also make sure that you have selected the taxa for which you wish to calculate environmental limits and saved them in the vector taxa.names (see the description in the R Scripts for Central Tendencies page in the Helpful Links box).
 

Each model is stored within a single list of models.

# Create storage list for models
modlist.glm <- as.list(rep(NA, times = length(taxa.names)))

for (i in 1:length(taxa.names)) {
  # Create a logical vector is true if taxon is present and false if 
  # taxon is absent.
  resp <- dfmerge[,taxa.names[i]] > 0

  modlist.glm <- glm(resp ~ poly(temp,2), data = dfmerge, 
                  family = "binomial")
  # Fit the regression model and store the results in a list.
  # Here, poly(temp,2) specifies that the
  # model is fitting using a second order polynomial of the 
  # explanatory variable.  glm calls the function that fits
  # Generalized Linear Models.  We specify in this case that
  # the response variable is distributed binomially.

  # Print out summary statistics for each model
  print(summary(modlist.glm))
}

To plot the model results (similar to those shown on the Parametric Regression page, Figure 5) run the following script.  

# Specify 3 plots per page
par(mfrow = c(1,3), pty = "s")        
for (i in 1:length(taxa.names)) {

  # Compute mean predicted probability of occurrence            
  # and standard errors about this predicted probability.
  predres <- predict(modlist.glm, type= "link", se.fit = T)
    
  # Compute approximate upper and lower 90% confidence limits
  up.bound.link <- predres$fit + 1.65*predres$se.fit
  low.bound.link <- predres$fit - 1.65*predres$se.fit
  mean.resp.link <- predres$fit

  # Convert from logit transformed values to probability.
  up.bound <- exp(up.bound.link)/(1+exp(up.bound.link))
  low.bound <- exp(low.bound.link)/(1+exp(low.bound.link))
  mean.resp <- exp(mean.resp.link)/(1+exp(mean.resp.link))

  # Sort the environmental variable.
  iord <- order(dfmerge$temp)        

  # Define bins to summarize observational data as
  # probabilities of occurrence
  nbin <- 20            
  
  # Define bin boundaries so each bin has approximately the same
  # number of observations.    
  cutp <- quantile(dfmerge$temp, 
                   probs = seq(from = 0, to = 1, length = 20))
  # Compute the midpoint of each bin                
  cutm <- 0.5*(cutp[-1] + cutp[-nbin])

  # Assign a factor to each bin                    
  cutf <- cut(dfmerge$temp, cutp, include.lowest = T)
    
  # Compute the mean of the presence/absence data within each bin.
  vals <- tapply(dfmerge[, taxa.names[i]] > 0, cutf, mean)
                        
  # Now generate the plot
  # Plot binned observational data as symbols.
  plot(cutm, vals, xlab = "Temperature", 
       ylab = "Probability of occurrence", ylim = c(0,1),
       main = taxa.names[i])        
  # Plot mean fit as a solid line.        
  lines(dfmerge$temp[iord], mean.resp[iord])
  # Plot confidence limits as dotted lines.                
  lines(dfmerge$temp[iord], up.bound[iord], lty = "dotted")
  lines(dfmerge$temp[iord], low.bound[iord], lty = "dotted")    

Top of Page