intSDM: R package to produce species distribution models in a reproducible framework

In support of our manuscript, we developed an R package to help construct point-process species distribution models (SDMs) from disparate in a simple and reproducible framework. This RMarkdown document presents an illustration of the package by creating an SDM for red-listed plant species obtained via the Vascular Plant Field Notes in Norway, as well as citizen-science data obtained from Global Biodiversity Information Facility (GBIF). The first step in exploring this document is to download the package using the following script:


if (requireNamespace(intSDM)) devtools::install_github('PhilipMostert/intSDM')

As well as any of the following packages required:


library(ggplot2)
library(sf)
library(sp)
library(dplyr) 
library(ggmap) 
library(maps)
library(PointedSDMs)
library(spatstat)
library(maptools)
library(INLA)
library(intSDM)
library(rgeos)
library(fields)
library(viridis)
library(ggpolypath)
library(RColorBrewer)

And finally by loading in some objects which will be required later on.


Projection <- CRS('+proj=utm +zone=32 +ellps=WGS84 +datum=WGS84 +units=m +no_defs')

norwayfill <- map("world", "norway", fill=TRUE, plot=FALSE, 
                  ylim=c(58,72), xlim=c(4,32))
IDs <- sapply(strsplit(norwayfill$names, ":"), function(x) x[1])
norway.poly <- map2SpatialPolygons(norwayfill, IDs = IDs, 
                                   proj4string = Projection)

Data wrangling

Before conducting any analysis, we explore and make some adjustments to the data which we will use later in our integrated model. The Vascular Plant Field Notes is a collection of observations provided by the Norwegian University of Science and Technology’s University Museum and the University of Oslo, containing records of standardized cross-lists of most vascular plants found in Norway.

In this analysis, we chose the three most abundant species available in our dataset. This was done mainly to make the example more computationally friendly, but also because a significant proportion of the plants in the dataset only had a handful of observations (a possible consequence of working with red-listed data), making inference on those species difficult.


data("PA_redlist")

The three species selected for this analysis from the Vascular Plant Field Notes (arnica montana, fraxinus excelsior and ulmus glabra) have records predominantly spread across the southern and eastern part of Norway. However the species ulmus glabra (which has the largest spread of the three species selected), has some of the records approaching the middle and middle-upper parts of Norway.

We then create a plot of our detection/non-detection data in each gridded sampling region.


ggplot() + 
  gg(norway.poly) + 
  gg(PA_redlist, aes(col = factor(individualCount))) +
  facet_grid(~species) +
  labs(x = 'Longitude', y = 'Latitude', col = 'Grid Observation') +
  scale_color_manual(labels = c('Absent', "Present"), values = c("#d11141", "#00aedb")) +
  ggtitle('Vascular Plant Field Notes') +
  theme_classic() +
  theme(legend.position="bottom",
        plot.title = element_text(hjust = 0.5))

We treated the Vascular Plant Field Notes as detection/non-detection data by creating a gridded map of sampling locations across our Norway polygon and noting which sampling locations contained which species of plants. That is, we treated these data as a Bernoulli random variable taking on unity if the species was detected in a given grid, otherwise taking on zero if it was not detected. To do this, we used a nearest neighbour classifier to assign each of the plant species’ location to its nearest grid point.

intSDMs

intSDMs essentially has two packages which assist in creating a reproducible SDM. The first of which is called structured_data, which aims to re-structure and and assign the relevant metadata to the structured datasets, to be used in the integrated model. The argument dataType may take on one of three values per dataset: “PA”,”PO” or “count”, which are used to define the underlying sampling process of the dataset. The rest of the arguments for the function (responsePA, trialsPA, responseCount, speciesName, coordinateNames) are merely used to specify the appropriate column names in the datasets for the relevant variables.


structured <- structured_data(PA_redlist, datasetType = 'PA',
                              speciesName = 'species',
                              responsePA = 'individualCount',
                              coordinateNames = colnames(PA_redlist@coords))

The package’s second function (species_model) is used as the template for the reproducible workflow: being able to produce a variety of different objects useful for the ecologist in their analysis. These objects may be controlled using the argument: return, which may take one of the following values: “boundary”, “species” ,“species plot”, “mesh”, “mesh plot”, “model”, “predictions” or “predictions map”.

One of the objects required for our model is an inla.mesh object, which we will use in the approximation of our spatial random fields.


mesh <- species_model(boundary = norway.poly,
                      return = 'mesh', limit = 5000, meshParameters = list(cutoff=0.08, max.edge=c(1, 3), offset=c(1,1)))

ggplot() +
  gg(mesh) +
      ggtitle('inla.mesh object') +
  theme_classic() +
  theme(legend.position="bottom",
        plot.title = element_text(hjust = 0.5))

To include PO data from GBIF, we specify the names of the species we want with the speciesNames arguments. These species will be selected around the area specified around either the spatial.polygons object specified with the argument boundary, or by selecting counties across Norway using the location argument. The limit argument is identical to the one used in spocc’s occ function: which defines the number of records to return.


species_plot <- species_model(speciesNames = unique(structured@dataPA$PA_redlist$species),
                               structuredData = structured,
                               boundary = norway.poly,
                               return = 'species plot', limit = 5000, mesh = mesh)
species_plot +
      ggtitle('Plot of the species data') +
  theme_classic() +
  theme(legend.position="bottom",
        plot.title = element_text(hjust = 0.5))

Finally, by specifying return = ‘predictions map’, we will run our SDM and subsequently produce a map of the log of the intensity function for the species across their studied map. We select the spatial covariates we want in the model using the worldclimCovariates argument, which may take on any of the nineteen names available from worldclim (see the following link). We furthermore scale these covariates using scale. In this example we only chose one covariate, related to the mean annual temperature. Due to the long time it requires to produce this map, inference is not made in this vignette. However the script is available below for the user to run the model themselves.


prediction_maps <- species_model(speciesNames = unique(structured@dataPA$PA_redlist$species),
                               scale = TRUE, structuredData = structured,
                               worldclimCovariates = 'Annual Mean Temperature', 
                               boundary = norway.poly,
                               return = 'predictions map', limit = 5000,
                               mesh = mesh)

prediction_maps