CADDIS Volume 4

Using R for Non-Parametric Regression

How to Fit Non-Parametric Regressions

Non-parametric regressions (see Non-Parametric Regression page, Equation 8) can be computed with a set of commands similar to those of parametric regressions (see the Parametric Regressions page in the Helpful Links Box). In this case, generalized additive models (GAM) are used to fit nonparametric curves to the data.

First, install the GAM library into R. Type at the R prompt:

install.packages("gam")

You will then need to select a mirror site from the provided list, and the package should install automatically.

Next, make sure that you have loaded the sample biological and environmental data and merged them into a single data frame called dfmerge (see the Download Scripts and Sample Data page in the Helpful Links Box).

Also make sure that you have selected the taxa for which you wish to calculate environmental limits and saved them in the vector taxa.names (see the description on the Central Tendencies page in the Helpful Links Box).

# Load GAM library
library(gam)
modlist.gam <- as.list(rep(NA, times = length(taxa.names)))
for (i in 1:length(taxa.names)) {
# Create a logical vector is true if taxon is
# present and false if taxon is absent.
resp <- dfmerge[, taxa.names[i]] > 0

# Fit the regression model, specifying two degrees of freedom
# to the curve fit.
modlist.gam <- gam(resp ~ s(temp, df = 2), data = dfmerge,
family = "binomial")

print(summary(modlist.gam))
}

To plot model results (similar to those shown in Figure 7 in the Non-Parametric Regression page) run the following script.

# Specify 3 plots per page
par(mfrow = c(1,3), pty = "s")
for (i in 1:length(taxa.names)) {

# Compute mean predicted probability of occurrence
# and standard errors about this predicted probability.
predres <- predict(modlist.gam, type= "link", se.fit = T)

# Compute approximate upper and lower 90% confidence limits
up.bound.link <- predres$fit + 1.65*predres$se.fit
low.bound.link <- predres$fit - 1.65*predres$se.fit
mean.resp.link <- predres$fit

# Convert from logit transformed values to probability.
up.bound <- exp(up.bound.link)/(1+exp(up.bound.link))
low.bound <- exp(low.bound.link)/(1+exp(low.bound.link))
mean.resp <- exp(mean.resp.link)/(1+exp(mean.resp.link))

# Sort the environmental variable.
iord <- order(dfmerge$temp)

# Define bins to summarize observational data as
# probabilities of occurrence
nbin <- 20

# Define bin boundaries so each bin has approximately the same
# number of observations.
cutp <- quantile(dfmerge$temp,
probs = seq(from = 0, to = 1, length = 20))

# Compute the midpoint of each bin
cutm <- 0.5*(cutp[-1] + cutp[-nbin])

# Assign a factor to each bin
cutf <- cut(dfmerge$temp, cutp, include.lowest = T)

# Compute the mean of the presence/absence data within each bin.
vals <- tapply(dfmerge[, taxa.names[i]] > 0, cutf, mean)

# Plot binned observational data as symbols.
plot(cutm, vals, xlab = "Temperature",
ylab = "Probability of occurrence", ylim = c(0,1),
main = taxa.names[i])
# Plot mean fit as a solid line
lines(dfmerge$temp[iord], mean.resp[iord])

# Plot confidence limits as dotted lines.
lines(dfmerge$temp[iord], up.bound[iord], lty = "dotted")
lines(dfmerge$temp[iord], low.bound[iord], lty = "dotted")
}

Top of Page