3  Spatial modelling with INLA and SPDE

Having covered the need-to-know material for INLA and SPDE, next I’m going to build a spatial model using both INLA and inlabru. Comparing the two packages is useful because, though they can be expected to give the same results in most cases, the implementation is a bit different. Knowing when using the more complicated INLA can be justified is a useful exercise, I think.

I found https://becarioprecario.bitbucket.io/spde-gitbook/ch-lcox.html and https://becarioprecario.bitbucket.io/inla-gitbook/ch-INLA.html good for details about inla and such specifications. However, this initial tutorial follows that of the book of Paula Moraga https://www.paulamoraga.com/book-spatial/sec-geostatisticaldataSPDE.html, because I found her writing a bit more concise and clear.

The model fit here is a simple geostatistical one: \[ Y_i \sim N(u_i, \sigma^2)\\ u_i = \beta_0\cdot\text{temp} + \beta_1\cdot\text{precipitation}_i + S(x_i). \] So it’s a typical Gaussian distributed variable with underlying latent structure. I’m not sure what the most accurate statistical way to say this is; potentially a spatial random effect “modelled as a Gaussian process”. This is what Paula Moraga calls “geostatistical data”, that is, measurements taken at discrete locations but used to describe or estimate a spatially continuous process, such as air pollution. It is very similar to a mixed model with a random effect. In this case the random effect is Gaussian distributed and spatially indexed.

It’s a simple model, so let’s get straight to data downloading and processing. I download air pollution data in the Netherlands for the year of 2023 from https://eeadmz1-downloads-webapp.azurewebsites.net/.

library(INLA)
Loading required package: Matrix
Loading required package: sp
This is INLA_24.05.01-1 built 2024-05-01 18:49:50 UTC.
 - See www.r-inla.org/contact-us for how to get help.
 - List available models/likelihoods/etc with inla.list.models()
 - Use inla.doc(<NAME>) to access documentation
library(inlabru)
Loading required package: fmesher
library(sf)
Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE
library(rnaturalearth)
library(ggplot2)
library(viridis)
Loading required package: viridisLite
library(terra)
terra 1.7.78
library(arrow)

Attaching package: 'arrow'
The following object is masked from 'package:terra':

    buffer
The following object is masked from 'package:utils':

    timestamp
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:terra':

    intersect, union
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(stringr)
library(readr)
#data_path <- file.path("D:", "data", "air_quality_data", "belgium_eea", "E1a")
data_path <- file.path("D:", "data", "air_quality_data", "netherlands_eea", "E1a")
d <- open_dataset(data_path)
d
FileSystemDataset with 199 Parquet files
12 columns
Samplingpoint: string
Pollutant: int32
Start: timestamp[ns] not null
End: timestamp[ns] not null
Value: decimal128(38, 18)
Unit: string
AggType: string
Validity: int32 not null
Verification: int32 not null
ResultTime: timestamp[ns] not null
DataCapture: decimal128(38, 18)
FkObservationLog: string
# 8 Nitrogen dioxide
# 5 Particulate matter < 10 µm
# 6001  Particulate matter < 2.5 µm
d %>% group_by(Samplingpoint) |> filter(Pollutant==6001, Value>-1) |> collect() -> df
df
# A tibble: 442,873 × 12
# Groups:   Samplingpoint [53]
   Samplingpoint   Pollutant Start               End                 Value Unit 
   <chr>               <int> <dttm>              <dttm>              <dbl> <chr>
 1 NL/SPO-NL00007…      6001 2023-01-01 01:00:00 2023-01-01 02:00:00 131.  ug.m…
 2 NL/SPO-NL00007…      6001 2023-01-01 02:00:00 2023-01-01 03:00:00  86.2 ug.m…
 3 NL/SPO-NL00007…      6001 2023-01-01 03:00:00 2023-01-01 04:00:00  30.7 ug.m…
 4 NL/SPO-NL00007…      6001 2023-01-01 04:00:00 2023-01-01 05:00:00  11   ug.m…
 5 NL/SPO-NL00007…      6001 2023-01-01 05:00:00 2023-01-01 06:00:00   9.1 ug.m…
 6 NL/SPO-NL00007…      6001 2023-01-01 06:00:00 2023-01-01 07:00:00   5.3 ug.m…
 7 NL/SPO-NL00007…      6001 2023-01-01 07:00:00 2023-01-01 08:00:00   2.8 ug.m…
 8 NL/SPO-NL00007…      6001 2023-01-01 08:00:00 2023-01-01 09:00:00   2.3 ug.m…
 9 NL/SPO-NL00007…      6001 2023-01-01 10:00:00 2023-01-01 11:00:00  -0.6 ug.m…
10 NL/SPO-NL00007…      6001 2023-01-01 11:00:00 2023-01-01 12:00:00   1.4 ug.m…
# ℹ 442,863 more rows
# ℹ 6 more variables: AggType <chr>, Validity <int>, Verification <int>,
#   ResultTime <dttm>, DataCapture <dbl>, FkObservationLog <chr>

Add locations to the sampling stations.

df
# A tibble: 442,873 × 12
# Groups:   Samplingpoint [53]
   Samplingpoint   Pollutant Start               End                 Value Unit 
   <chr>               <int> <dttm>              <dttm>              <dbl> <chr>
 1 NL/SPO-NL00007…      6001 2023-01-01 01:00:00 2023-01-01 02:00:00 131.  ug.m…
 2 NL/SPO-NL00007…      6001 2023-01-01 02:00:00 2023-01-01 03:00:00  86.2 ug.m…
 3 NL/SPO-NL00007…      6001 2023-01-01 03:00:00 2023-01-01 04:00:00  30.7 ug.m…
 4 NL/SPO-NL00007…      6001 2023-01-01 04:00:00 2023-01-01 05:00:00  11   ug.m…
 5 NL/SPO-NL00007…      6001 2023-01-01 05:00:00 2023-01-01 06:00:00   9.1 ug.m…
 6 NL/SPO-NL00007…      6001 2023-01-01 06:00:00 2023-01-01 07:00:00   5.3 ug.m…
 7 NL/SPO-NL00007…      6001 2023-01-01 07:00:00 2023-01-01 08:00:00   2.8 ug.m…
 8 NL/SPO-NL00007…      6001 2023-01-01 08:00:00 2023-01-01 09:00:00   2.3 ug.m…
 9 NL/SPO-NL00007…      6001 2023-01-01 10:00:00 2023-01-01 11:00:00  -0.6 ug.m…
10 NL/SPO-NL00007…      6001 2023-01-01 11:00:00 2023-01-01 12:00:00   1.4 ug.m…
# ℹ 442,863 more rows
# ℹ 6 more variables: AggType <chr>, Validity <int>, Verification <int>,
#   ResultTime <dttm>, DataCapture <dbl>, FkObservationLog <chr>
df <- df %>%
   #mutate(Samplingpoint = str_remove(Samplingpoint, "BE/"))
   mutate(Samplingpoint = str_remove(Samplingpoint, "NL/"))
#station_location_path <- file.path("D:", "data", "air_quality_data", "eea_stations_2023", "belgium_stations_2023.csv")
station_location_path <- file.path("D:", "data", "air_quality_data", "eea_stations_2023", "nl_stations_2023.csv")
station_locations <- read_csv(station_location_path)
Rows: 4685 Columns: 27
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (19): Country, Air Quality Network, Air Quality Network Name, Air Qualit...
dbl  (8): Year, Air Pollution Level, Data Coverage, Verification, Longitude,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
station_locations <- rename(station_locations, Samplingpoint = 'Sampling Point Id')
station_locations %>% 
  distinct(Samplingpoint, Longitude, Latitude) -> unique_locations
merged_df <- df %>% 
  left_join(unique_locations %>% select(Samplingpoint, Longitude, Latitude), by = "Samplingpoint")
merged_df
# A tibble: 442,873 × 14
# Groups:   Samplingpoint [53]
   Samplingpoint   Pollutant Start               End                 Value Unit 
   <chr>               <int> <dttm>              <dttm>              <dbl> <chr>
 1 SPO-NL00007_06…      6001 2023-01-01 01:00:00 2023-01-01 02:00:00 131.  ug.m…
 2 SPO-NL00007_06…      6001 2023-01-01 02:00:00 2023-01-01 03:00:00  86.2 ug.m…
 3 SPO-NL00007_06…      6001 2023-01-01 03:00:00 2023-01-01 04:00:00  30.7 ug.m…
 4 SPO-NL00007_06…      6001 2023-01-01 04:00:00 2023-01-01 05:00:00  11   ug.m…
 5 SPO-NL00007_06…      6001 2023-01-01 05:00:00 2023-01-01 06:00:00   9.1 ug.m…
 6 SPO-NL00007_06…      6001 2023-01-01 06:00:00 2023-01-01 07:00:00   5.3 ug.m…
 7 SPO-NL00007_06…      6001 2023-01-01 07:00:00 2023-01-01 08:00:00   2.8 ug.m…
 8 SPO-NL00007_06…      6001 2023-01-01 08:00:00 2023-01-01 09:00:00   2.3 ug.m…
 9 SPO-NL00007_06…      6001 2023-01-01 10:00:00 2023-01-01 11:00:00  -0.6 ug.m…
10 SPO-NL00007_06…      6001 2023-01-01 11:00:00 2023-01-01 12:00:00   1.4 ug.m…
# ℹ 442,863 more rows
# ℹ 8 more variables: AggType <chr>, Validity <int>, Verification <int>,
#   ResultTime <dttm>, DataCapture <dbl>, FkObservationLog <chr>,
#   Longitude <dbl>, Latitude <dbl>
station_locations_sf <- st_as_sf(station_locations, coords = c("Longitude", "Latitude"))
air_sf <- st_as_sf(merged_df, coords = c("Longitude", "Latitude"))
st_crs(air_sf) <- "EPSG:4326"
air_sf
Simple feature collection with 442873 features and 12 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3.8386 ymin: 50.888 xmax: 6.9195 ymax: 53.3304
Geodetic CRS:  WGS 84
# A tibble: 442,873 × 13
# Groups:   Samplingpoint [53]
   Samplingpoint   Pollutant Start               End                 Value Unit 
 * <chr>               <int> <dttm>              <dttm>              <dbl> <chr>
 1 SPO-NL00007_06…      6001 2023-01-01 01:00:00 2023-01-01 02:00:00 131.  ug.m…
 2 SPO-NL00007_06…      6001 2023-01-01 02:00:00 2023-01-01 03:00:00  86.2 ug.m…
 3 SPO-NL00007_06…      6001 2023-01-01 03:00:00 2023-01-01 04:00:00  30.7 ug.m…
 4 SPO-NL00007_06…      6001 2023-01-01 04:00:00 2023-01-01 05:00:00  11   ug.m…
 5 SPO-NL00007_06…      6001 2023-01-01 05:00:00 2023-01-01 06:00:00   9.1 ug.m…
 6 SPO-NL00007_06…      6001 2023-01-01 06:00:00 2023-01-01 07:00:00   5.3 ug.m…
 7 SPO-NL00007_06…      6001 2023-01-01 07:00:00 2023-01-01 08:00:00   2.8 ug.m…
 8 SPO-NL00007_06…      6001 2023-01-01 08:00:00 2023-01-01 09:00:00   2.3 ug.m…
 9 SPO-NL00007_06…      6001 2023-01-01 10:00:00 2023-01-01 11:00:00  -0.6 ug.m…
10 SPO-NL00007_06…      6001 2023-01-01 11:00:00 2023-01-01 12:00:00   1.4 ug.m…
# ℹ 442,863 more rows
# ℹ 7 more variables: AggType <chr>, Validity <int>, Verification <int>,
#   ResultTime <dttm>, DataCapture <dbl>, FkObservationLog <chr>,
#   geometry <POINT [°]>

Then examine the time series

distinct_stations <- unique(air_sf$Samplingpoint)[1:12]
plotting_stations <- air_sf[air_sf$Samplingpoint%in%distinct_stations,]
plotting_stations
Simple feature collection with 98311 features and 12 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 4.781 ymin: 50.888 xmax: 5.9869 ymax: 52.394
Geodetic CRS:  WGS 84
# A tibble: 98,311 × 13
# Groups:   Samplingpoint [12]
   Samplingpoint   Pollutant Start               End                 Value Unit 
   <chr>               <int> <dttm>              <dttm>              <dbl> <chr>
 1 SPO-NL00007_06…      6001 2023-01-01 01:00:00 2023-01-01 02:00:00 131.  ug.m…
 2 SPO-NL00007_06…      6001 2023-01-01 02:00:00 2023-01-01 03:00:00  86.2 ug.m…
 3 SPO-NL00007_06…      6001 2023-01-01 03:00:00 2023-01-01 04:00:00  30.7 ug.m…
 4 SPO-NL00007_06…      6001 2023-01-01 04:00:00 2023-01-01 05:00:00  11   ug.m…
 5 SPO-NL00007_06…      6001 2023-01-01 05:00:00 2023-01-01 06:00:00   9.1 ug.m…
 6 SPO-NL00007_06…      6001 2023-01-01 06:00:00 2023-01-01 07:00:00   5.3 ug.m…
 7 SPO-NL00007_06…      6001 2023-01-01 07:00:00 2023-01-01 08:00:00   2.8 ug.m…
 8 SPO-NL00007_06…      6001 2023-01-01 08:00:00 2023-01-01 09:00:00   2.3 ug.m…
 9 SPO-NL00007_06…      6001 2023-01-01 10:00:00 2023-01-01 11:00:00  -0.6 ug.m…
10 SPO-NL00007_06…      6001 2023-01-01 11:00:00 2023-01-01 12:00:00   1.4 ug.m…
# ℹ 98,301 more rows
# ℹ 7 more variables: AggType <chr>, Validity <int>, Verification <int>,
#   ResultTime <dttm>, DataCapture <dbl>, FkObservationLog <chr>,
#   geometry <POINT [°]>
plotting_stations %>% 
  ggplot(aes(x=Start, y=Value)) + 
    geom_line() + 
    facet_wrap(~ Samplingpoint) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(title = "Time series of monitoring stations", x = "Time, hourly", y = "Conc (μg/m³)")

air_sf <- air_sf %>% filter(Start==as.POSIXct("2023-01-01 16:00:00  "))
dim(air_sf)
[1] 52 13
air_sf
Simple feature collection with 52 features and 12 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3.8386 ymin: 50.888 xmax: 6.9195 ymax: 53.3304
Geodetic CRS:  WGS 84
# A tibble: 52 × 13
# Groups:   Samplingpoint [52]
   Samplingpoint   Pollutant Start               End                 Value Unit 
 * <chr>               <int> <dttm>              <dttm>              <dbl> <chr>
 1 SPO-NL00007_06…      6001 2023-01-01 16:00:00 2023-01-01 17:00:00  4.4  ug.m…
 2 SPO-NL00012_06…      6001 2023-01-01 16:00:00 2023-01-01 17:00:00  2.9  ug.m…
 3 SPO-NL00014_06…      6001 2023-01-01 16:00:00 2023-01-01 17:00:00  3.8  ug.m…
 4 SPO-NL00016_06…      6001 2023-01-01 16:00:00 2023-01-01 17:00:00  8    ug.m…
 5 SPO-NL00017_06…      6001 2023-01-01 16:00:00 2023-01-01 17:00:00  3.5  ug.m…
 6 SPO-NL00131_06…      6001 2023-01-01 16:00:00 2023-01-01 17:00:00  8.21 ug.m…
 7 SPO-NL00136_06…      6001 2023-01-01 16:00:00 2023-01-01 17:00:00  1.16 ug.m…
 8 SPO-NL00230_06…      6001 2023-01-01 16:00:00 2023-01-01 17:00:00  2.98 ug.m…
 9 SPO-NL00240_06…      6001 2023-01-01 16:00:00 2023-01-01 17:00:00  2.17 ug.m…
10 SPO-NL00241_06…      6001 2023-01-01 16:00:00 2023-01-01 17:00:00  6.33 ug.m…
# ℹ 42 more rows
# ℹ 7 more variables: AggType <chr>, Validity <int>, Verification <int>,
#   ResultTime <dttm>, DataCapture <dbl>, FkObservationLog <chr>,
#   geometry <POINT [°]>

Then let’s load a map of the Netherlands. The border is from https://service.pdok.nl/kadaster/bestuurlijkegrenzen/atom/bestuurlijke_grenzen.xml.

map <- st_read(file.path("D:", "data", "maps", "netherlands_bestuurlijkegrenzen_2021", "bestuurlijkegrenzen.gpkg"))
Multiple layers are present in data source D:\data\maps\netherlands_bestuurlijkegrenzen_2021\bestuurlijkegrenzen.gpkg, reading layer `gemeenten'.
Use `st_layers' to list all layer names and their type in a data source.
Set the `layer' argument in `st_read' to read a particular layer.
Warning in CPL_read_ogr(dsn, layer, query, as.character(options), quiet, :
automatically selected the first layer in a data source containing more than
one.
Reading layer `gemeenten' from data source 
  `D:\data\maps\netherlands_bestuurlijkegrenzen_2021\bestuurlijkegrenzen.gpkg' 
  using driver `GPKG'
Simple feature collection with 352 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 10425.16 ymin: 306846.2 xmax: 278026.1 ymax: 621876.3
Projected CRS: Amersfoort / RD New
map <- st_union(map)
map <- st_as_sf(map)
map <- st_transform(map, crs = st_crs(air_sf))
# raster grid covering map
grid <- terra::rast(map, nrows = 100, ncols = 100)
grid
class       : SpatRaster 
dimensions  : 100, 100, 1  (nrow, ncol, nlyr)
resolution  : 0.03919561, 0.02826056  (x, y)
extent      : 3.307938, 7.227498, 50.75037, 53.57642  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
# coordinates of all cells
xy <- terra::xyFromCell(grid, 1:ncell(grid))
# transform points to a sf object
dp <- st_as_sf(as.data.frame(xy), coords = c("x", "y"),
                 crs = st_crs(map))

# indices points within the map
indicespointswithin <- which(st_intersects(dp, map,
                                           sparse = FALSE))
length(indicespointswithin)
[1] 4944
# points within the map
dp <- st_filter(dp, map)

# plot
ggplot() + geom_sf(data = map) +
  geom_sf(data = dp)

More data process

library(geodata)
# With geodata library
save_path <- file.path("D:", "data", "air_quality_data", "aux_variables")
covtemp <- worldclim_global(var = "tavg", res = 10,
                            path = save_path)
covprec <- worldclim_global(var = "prec", res = 10,
                            path = save_path)
# Extract at observed locations
air_sf$covtemp <- extract(mean(covtemp), st_coordinates(air_sf))[, 1]
air_sf$covprec <- extract(mean(covprec), st_coordinates(air_sf))[, 1]
# Extract at prediction locations
dp$covtemp <- extract(mean(covtemp), st_coordinates(dp))[, 1]
dp$covprec <- extract(mean(covprec), st_coordinates(dp))[, 1]

ggplot() + geom_sf(data = map) +
  geom_sf(data = air_sf, aes(col = Value)) +
  scale_color_viridis()

ggplot() + geom_sf(data = map) +
  geom_sf(data = air_sf, aes(col = covtemp)) +
  scale_color_viridis()

ggplot() + geom_sf(data = map) +
  geom_sf(data = air_sf, aes(col = covprec)) +
  scale_color_viridis()

ggplot() + geom_sf(data = map) +
  geom_sf(data = dp, aes(col = covtemp)) +
  scale_color_viridis()

ggplot() + geom_sf(data = map) +
  geom_sf(data = dp, aes(col = covprec)) +
  scale_color_viridis()

Looks like there is a couple of nans in there. Let’s impute using the mean.

dp <- dp %>% mutate(covprec=ifelse(is.na(covprec), mean(covprec, na.rm=TRUE), covprec),
                    covtemp=ifelse(is.na(covtemp), mean(covtemp, na.rm=TRUE), covtemp))
ggplot() + geom_sf(data = map) +
  geom_sf(data = dp, aes(col = covtemp)) +
  scale_color_viridis()

ggplot() + geom_sf(data = map) +
  geom_sf(data = dp, aes(col = covprec)) +
  scale_color_viridis()

Saw in the forum https://groups.google.com/g/r-inla-discussion-group/c/z_v2oIh2egs it’s good to work with units of kilometers which is intuitive from previous spatial data work.

st_crs("EPSG:3857")$proj4string
[1] "+proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs"
projMercator<-"+proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0
+x_0=0 +y_0=0 +k=1 +units=km +nadgrids=@null +wktext +no_defs"
air_sf_project <- st_transform(air_sf, crs = projMercator)
dp <- st_transform(dp, crs = projMercator)
# Observed coordinates
coo <- st_coordinates(air_sf_project)

# Predicted coordinates
coop <- st_coordinates(dp)
summary(dist(coo)) # summary of distances between locations
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
  0.5852  70.0344 109.2560 131.9875 179.8815 472.4047 

Mesh options are from https://punama.github.io/BDI_INLA/#:~:text=stack%20function.,list%20of%20effects%20(effects). max.edge controls the largest triangle edge length, and providing it with a vector c(inner, outer) sets the max edge for inside the boundary and outside the boundary. The purpose of this is to avoid boundary effects in the estimation of the model, where boundary values have high variance. Lindgren and Rue (2015) suggest to extend the domain by some amount. This is briefly discussed in the book by Blangiardo (2015).

cutoff is the minimum allowed distance between points. Otherwise, points are replaced by a single vertex. For areas of high clusters of points, this could be useful to reduce redundancy. If no boundary is set the mesh is created on the convex hull of the observations.

This was inspired from https://punama.github.io/BDI_INLA/. Thanks! ::: {.cell}

mesh0 <- inla.mesh.2d(loc = coo, max.edge=c(200, 500))
mesh1 <- inla.mesh.2d(loc = coo, max.edge=c(200, 500), cutoff=1)
mesh2 <- inla.mesh.2d(loc = coo, max.edge=c(100, 150), cutoff=1)
mesh3 <- inla.mesh.2d(loc = coo, max.edge=c(100), cutoff=1)
mesh4 <- inla.mesh.2d(loc = coo, max.edge=c(500, 750))
mesh5 <- inla.mesh.2d(loc = coo, max.edge=c(500, 750), cutoff=1)
plot(mesh0)

plot(mesh1)

plot(mesh2)

plot(mesh3)

plot(mesh4)

plot(mesh5)

# Using mesh0
mesh <-inla.mesh.2d(loc = coo, max.edge=c(200, 500), crs=st_crs(air_sf_project))
mesh$n
[1] 280

:::

Next we construct the A matrix that “A that projects the GRF from the observations to the vertices of the triangulated mesh.” (Moraga 2024). This A matrix has a row for each ovservation, and columns equal to the number of vertices in the mesh. That is shown above, with mesh$n. Below, two different meshes are generated. One for the observation locations, and one for the prediction locations. We need to set these both at once, because to make predictions in INLA we have to have that all pre-specified, unlike with the typical modelling style in R.

Additionally, the index allows us to extract fitted values from the model.

The smoothness parameter is set to 1, where in the spatial case \(d=2\) and \(\alpha=\nu + d/2 = 2\), seen in the code below.

plot(mesh)
points(coo, col = "red")
axis(1)
axis(2)

spde <- inla.spde2.matern(mesh = mesh, alpha = 2, constr = TRUE)
indexs <- inla.spde.make.index("s", spde$n.spde)
lengths(indexs)
      s s.group  s.repl 
    280     280     280 
# Make the projection matrices
A <- inla.spde.make.A(mesh = mesh, loc = coo)
Ap <- inla.spde.make.A(mesh = mesh, loc = coop)
dim(A)
[1]  52 280
dim(Ap)
[1] 4944  280

Then we have to make the INLA stack. This is done because we need to combine the data for estimation with the prediction data, as well as the projection matrices. It contains the response data, the list of covariate data–here the temperature and precipitation–the projection matrices, and the indices.

# stack for estimation stk.e
stk.e <- inla.stack(tag = "est",
data = list(y = air_sf_project$Value), A = list(1, A),
effects = list(data.frame(b0 = rep(1, nrow(A)),
covtemp = air_sf_project$covtemp, covprec = air_sf_project$covprec),
s = indexs))
#stk.e

# stack for prediction stk.p
stk.p <- inla.stack(tag = "pred",
data = list(y = NA), A = list(1, Ap),
effects = list(data.frame(b0 = rep(1, nrow(Ap)),
covtemp = dp$covtemp, covprec = dp$covprec),
s = indexs))
#stk.p

# stk.full has stk.e and stk.p
stk.full <- inla.stack(stk.e, stk.p)

Finally, we can specify the model in INLA. All the hard work has been done above and at least the model specification in INLA is easier. This model has mean specified by \[ \mu_i = \beta_0 + \beta_1 \times \text{temp}_i + \beta_2 \times \text{prec}_i + S(x_i), \] so there is some contribution from fixed effects as well as a unknown latent process modelled as a zero-mean Gaussian Random Field with Matern covariance function. This puts us well within INLA territory. The model equation can be seen quite clearly in the formula variable below.

formula <- y ~ 0 + b0 + covtemp + covprec + f(s, model = spde)
res <- inla(formula, family = "gaussian",
       data = inla.stack.data(stk.full),
       control.predictor = list(compute = TRUE,
                                A = inla.stack.A(stk.full)),
       control.compute = list(return.marginals.predictor = TRUE))

Notice that I am passing a few options to the inla call. Importantly, control.compute. We have a nice description of the control options at https://becarioprecario.bitbucket.io/inla-gitbook/ch-INLA.html#sec:controlops. It controls what quantities are actually computed and returned during the INLA estimation. For example, there are a few different information criteria that it can return.

control.predictor will compute the posterior marginals of the parameters.

The options: ::: {.cell}

inla.set.control.compute.default()
$openmp.strategy
[1] "default"

$hyperpar
[1] TRUE

$return.marginals
[1] TRUE

$return.marginals.predictor
[1] FALSE

$dic
[1] FALSE

$mlik
[1] TRUE

$cpo
[1] FALSE

$po
[1] FALSE

$waic
[1] FALSE

$residuals
[1] FALSE

$q
[1] FALSE

$config
[1] FALSE

$likelihood.info
[1] FALSE

$smtp
NULL

$graph
[1] FALSE

$internal.opt
NULL

$save.memory
NULL

$control.gcpo
$enable
[1] FALSE

$num.level.sets
[1] -1

$size.max
[1] 32

$strategy
[1] "posterior" "prior"    

$groups
NULL

$selection
NULL

$group.selection
NULL

$friends
NULL

$weights
NULL

$verbose
[1] FALSE

$epsilon
[1] 0.005

$prior.diagonal
[1] 1e-04

$correct.hyperpar
[1] TRUE

$keep
NULL

$remove
NULL

$remove.fixed
[1] TRUE

attr(,"class")
[1] "ctrl_gcpo"        "inla_ctrl_object"

attr(,"class")
[1] "ctrl_compute"     "inla_ctrl_object"

::: Once the model is fit, we can inspect the fixed parameters and estimated latent field, as well as the hyperparameters for the latent field. ::: {.cell}

res$summary.fixed
                mean         sd  0.025quant     0.5quant 0.975quant
b0      -15.29769449 21.3472624 -57.7595559 -15.17245556 26.4583704
covtemp   2.17003514  2.0949918  -1.8002534   2.09513247  6.5392273
covprec  -0.01550201  0.2034683  -0.4214077  -0.01404086  0.3818577
                mode          kld
b0      -15.17140288 4.819848e-08
covtemp   2.09359434 3.231952e-07
covprec  -0.01411427 5.717319e-08
# Latent field is here
# res$summary.random$s
# res$summary.hyperpar

index <- inla.stack.index(stack = stk.full, tag = "pred")$data
pred_mean <- res$summary.fitted.values[index, "mean"]
pred_ll <- res$summary.fitted.values[index, "0.025quant"]
pred_ul <- res$summary.fitted.values[index, "0.975quant"]
grid$mean <- NA
grid$ll <- NA
grid$ul <- NA

length(pred_mean)
[1] 4944
length(indicespointswithin)
[1] 4944
grid$mean[indicespointswithin] <- pred_mean
grid$ll[indicespointswithin] <- pred_ll
grid$ul[indicespointswithin] <- pred_ul
  
summary(grid) # negative values for the lower limit
      mean             ll               ul        
 Min.   :1.338   Min.   :-4.722   Min.   : 3.733  
 1st Qu.:3.626   1st Qu.: 0.359   1st Qu.: 6.530  
 Median :4.665   Median : 1.462   Median : 7.813  
 Mean   :4.595   Mean   : 1.368   Mean   : 7.874  
 3rd Qu.:5.553   3rd Qu.: 2.405   3rd Qu.: 9.087  
 Max.   :8.316   Max.   : 6.753   Max.   :12.860  
 NA's   :5056    NA's   :5056     NA's   :5056    
library(rasterVis)
Loading required package: lattice
levelplot(grid, layout = c(1, 3),
names.attr = c("Mean", "2.5 percentile", "97.5 percentile"))

:::

When compute=TRUE in control.predictor, we can also obtain the output below. https://www.paulamoraga.com/book-geospatial/sec-inla.html is good for info about this. ::: {.cell}

res$summary.fitted.values
res$summary.linear.predictor

::: as well as marginals.linear.predictor and marginals.fitted.values.

From https://becarioprecario.bitbucket.io/inla-gitbook/ch-INLA.html#model-fitting-strategies we can also see the differences using the different fitting strategies, set by the control.inla argument.

We can change both the Gaussian approximation strategy for the posterior full conditional distributions, as well as the integration strategy used to integrate out the \(\theta_{-k}\) parameters to get the marginal distribution for \(\theta_k\).

The "grid" option is the most costly, compared to the central composite design ("ccd"). In the empirical bayes option ("eb"), the posterior mode is used as the integration point.

approx_strategy <- c("gaussian", "simplified.laplace", "laplace")
int_strategy <- c("ccd", "grid", "eb")
models <- c("iid", "matern")
fits <- matrix(nrow=length(approx_strategy)*length(int_strategy)*length(models), ncol=3)
fits_marginals <- vector(mode="list", length=length(approx_strategy)*length(int_strategy)*length(models))
index_f <- 0
model_names <- c()
for (m in models){
  for(a in approx_strategy){
    for (i in int_strategy){
      index_f <- index_f + 1
      if (m=="matern"){
        formula_approx <- y ~ 0 + b0 + covtemp + covprec + f(s, model = spde)
      }
      else{
        formula_approx <- y ~ 0 + b0 + covtemp + covprec + f(s, model = "iid")
      }
      print(paste(a, ", ", i, ", ", m))
      model_names <- c(model_names, paste(a, ", ", i, ", ", m))
      fit_approx <- inla(formula, family = "gaussian",
         data = inla.stack.data(stk.full),
         control.inla = list(strategy = a, int.strategy = i),
         control.compute = list(cpo = TRUE, dic = TRUE, waic = TRUE),
         control.predictor = list(compute = TRUE, A = inla.stack.A(stk.full)))
      fits[index_f,] <- fit_approx$summary.fixed$mean
      fits_marginals[[index_f]] <- fit_approx$marginals.fixed$b0
  }
  }
}
[1] "gaussian ,  ccd ,  iid"
[1] "gaussian ,  grid ,  iid"
[1] "gaussian ,  eb ,  iid"
[1] "simplified.laplace ,  ccd ,  iid"
[1] "simplified.laplace ,  grid ,  iid"
[1] "simplified.laplace ,  eb ,  iid"
[1] "laplace ,  ccd ,  iid"
[1] "laplace ,  grid ,  iid"
[1] "laplace ,  eb ,  iid"
[1] "gaussian ,  ccd ,  matern"
[1] "gaussian ,  grid ,  matern"
[1] "gaussian ,  eb ,  matern"
[1] "simplified.laplace ,  ccd ,  matern"
[1] "simplified.laplace ,  grid ,  matern"
[1] "simplified.laplace ,  eb ,  matern"
[1] "laplace ,  ccd ,  matern"
[1] "laplace ,  grid ,  matern"
[1] "laplace ,  eb ,  matern"

Let’s compare the means of the parameters ::: {.cell}

fits_df <- as.data.frame(fits)
names(fits_df) <- c("b0", "covtemp", "covprec")
fits_df$models <- model_names
fits_df
          b0  covtemp     covprec                               models
1  -13.85789 2.049883 -0.01973532               gaussian ,  ccd ,  iid
2  -15.29769 2.170035 -0.01550201              gaussian ,  grid ,  iid
3  -13.11329 1.992509 -0.02262044                gaussian ,  eb ,  iid
4  -13.85789 2.049883 -0.01973532     simplified.laplace ,  ccd ,  iid
5  -15.29769 2.170035 -0.01550201    simplified.laplace ,  grid ,  iid
6  -13.11329 1.992509 -0.02262044      simplified.laplace ,  eb ,  iid
7  -13.85789 2.049883 -0.01973532                laplace ,  ccd ,  iid
8  -15.29769 2.170035 -0.01550201               laplace ,  grid ,  iid
9  -13.11329 1.992509 -0.02262044                 laplace ,  eb ,  iid
10 -13.85789 2.049883 -0.01973532            gaussian ,  ccd ,  matern
11 -15.29769 2.170035 -0.01550201           gaussian ,  grid ,  matern
12 -13.11329 1.992509 -0.02262044             gaussian ,  eb ,  matern
13 -13.85789 2.049883 -0.01973532  simplified.laplace ,  ccd ,  matern
14 -15.29769 2.170035 -0.01550201 simplified.laplace ,  grid ,  matern
15 -13.11329 1.992509 -0.02262044   simplified.laplace ,  eb ,  matern
16 -13.85789 2.049883 -0.01973532             laplace ,  ccd ,  matern
17 -15.29769 2.170035 -0.01550201            laplace ,  grid ,  matern
18 -13.11329 1.992509 -0.02262044              laplace ,  eb ,  matern

::: Also compare the posteriors themselves ::: {.cell}

# Combine dataframes
df_list <- lapply(fits_marginals, function(m) {
  data.frame(x = m[,1], y = m[,2]) 
})
new_df <- do.call(rbind, df_list) 
# Add an ID column
row_counts <- sapply(fits_marginals, nrow)
ids <- factor(rep(1:length(fits_marginals), each = row_counts))
Warning in rep(1:length(fits_marginals), each = row_counts): first element used
of 'each' argument
new_df$id <- ids
length(ids)
[1] 774
new_df <- new_df[1:86,]
ggplot(new_df, aes(x = x, y = y, colour = id)) + 
  geom_line() 

::: Wow so with this data set there is not really much different between meshes or with estimation methods.

Another useful feature is the set of functions that can be used to manipulate the marginal distributions. Might be useful to compute posterior quantities such as KL divergence: https://becarioprecario.bitbucket.io/inla-gitbook/ch-INLA.html#sec:marginals

Again from https://www.paulamoraga.com/book-geospatial/sec-inla.html is helpful.

inla.mmarginal(res$marginals.fixed$b0)
[1] -14.90955
# Should be the same?
res$summary.fixed$mode[1]
[1] -15.1714

Next, let’s write that model in INLABRU and check if the estimates are the same. Then, we also look at how to fit the LGCP in INLA/INLABRU.

Wne can use the same mesh and spde function as we used before. And also the same formula! (I think). But what is especially nice about INLABRU is that we don’t have to set up the stack with the projection matrices. We just model the response variable as a function of the covariates in the dataset we set up earlier and the locations of the observations. In INLABRU, we can use the sf dataset object directly.

INLA ::: {.cell}

formula <- Value ~ Intercept(1) + covtemp + covprec + f(geometry, model = spde)
# Fit the model for inlabru
fit <- bru(formula, data = air_sf_project, family = "gaussian")
# Summarize the results
summary(fit)
inlabru version: 2.12.0
INLA version: 24.05.01-1
Components:
Intercept: main = linear(1), group = exchangeable(1L), replicate = iid(1L), NULL
covtemp: main = linear(covtemp), group = exchangeable(1L), replicate = iid(1L), NULL
covprec: main = linear(covprec), group = exchangeable(1L), replicate = iid(1L), NULL
f: main = spde(geometry), group = exchangeable(1L), replicate = iid(1L), NULL
Likelihoods:
  Family: 'gaussian'
    Tag: ''
    Data class: 'sf', 'grouped_df', 'tbl_df', 'tbl', 'data.frame'
    Response class: 'numeric'
    Predictor: Value ~ .
    Used components: effects[Intercept, covtemp, covprec, f], latent[]
Time used:
    Pre = 1.72, Running = 0.754, Post = 0.346, Total = 2.82 
Fixed effects:
             mean     sd 0.025quant 0.5quant 0.975quant    mode kld
Intercept -15.298 21.347    -57.760  -15.172     26.458 -15.171   0
covtemp     2.170  2.095     -1.800    2.095      6.539   2.094   0
covprec    -0.016  0.203     -0.421   -0.014      0.382  -0.014   0

Random effects:
  Name    Model
    f SPDE2 model

Model hyperparameters:
                                          mean    sd 0.025quant 0.5quant
Precision for the Gaussian observations  0.308 0.080      0.177    0.299
Theta1 for f                             1.928 0.531      0.908    1.920
Theta2 for f                            -4.134 0.692     -5.533   -4.122
                                        0.975quant   mode
Precision for the Gaussian observations      0.489  0.283
Theta1 for f                                 2.999  1.883
Theta2 for f                                -2.808 -4.070

Deviance Information Criterion (DIC) ...............: 231.65
Deviance Information Criterion (DIC, saturated) ....: 71.69
Effective number of parameters .....................: 16.69

Watanabe-Akaike information criterion (WAIC) ...: 232.14
Effective number of parameters .................: 14.00

Marginal log-Likelihood:  -145.00 
 is computed 
Posterior summaries for the linear predictor and the fitted values are computed
(Posterior marginals needs also 'control.compute=list(return.marginals.predictor=TRUE)')

::: That gives nearly the same estimates as from INLA. ::: {.cell}

map_prj <- st_transform(map, crs = projMercator)
predictions1 <- predict(fit, newdata=dp, formula = ~ Intercept + covtemp + covprec + f)
predictions2 <- predict(fit, newdata=dp, formula = ~ f)
ggplot() +
geom_sf(data=predictions1, aes(color=mean)) +
  scale_colour_gradient(low = "blue", high = "yellow")

# Check the contribution of just the spatial field
ggplot() +
geom_sf(data=predictions2, aes(color=mean)) +
  scale_colour_gradient(low = "blue", high = "yellow")

::: That concludes this section, showing how to fit spatial model in INLA and inlabru. There are a few things more to do: simulating data and seeing how well INLA recovers the parameters, and fitting a temporal model in both packages. But I’ll leave these for later.