# Chapter 5 Modeling Occupancy

## 5.1 Introduction

In this chapter, we’ll cover the basic steps for applying **occupancy models** to eBird data. In Chapter 4, we used analytical approaches that accounted for variation in detectability. We included covariates that are known to influence detectability (e.g. duration, time of day) that we incorporated into the model alongside the covariates that influence occurrence. In contrast, occupancy models separate the ecological process of *species occurrence* and the *observation process* of species detection. These two processes are modeled together, but estimated as separate processes. This modeling framework allows us to account for detectabilty when estimating species occurrence. In this book, we won’t delve into the mechanics of occupancy models; however, there is a wealth of background literature on occupancy modeling and readers wishing to learn more about this field may want to consult the book on the topic by MacKenzie et al. (2017).

The application of these models typically requires data from repeated sampling visits (occasions) to a single site during a time frame over which there is appearance or disappearance of birds from the site (the population is *closed*). Although eBird checklists do not naturally meet these requirements, it is possible to apply occupancy models to eBird data by extracting a subset of the data that do conform to a repeat-sampling data structure. Here, we present a simple example of how to process eBird data to meet the assumptions of occupancy models. To illustrate our example, we apply a single-season occupancy model to estimate occupancy and detection probabilities for Wood Thrush in the month of June for BCR 27.

This chapter differs from the previous chapter on modeling encounter rate in two important ways. First, the random forest model used in Chapter 4 is an example of a machine learning approach, while the occupancy models used in this chapter use a more traditional likelihood approach. This latter class of statistical models are widely used for addressing specific questions and hypotheses, while the goal of machine learning is primarily to identify patterns and make predictions (Bzdok, Altman, and Krzywinski 2018). Second, machine learning approaches can accommodate complex non-linear effects and interactions between covariates, and are useful when modeling habitat associations that can vary across large spatial and temporal scales. In contrast, occupancy models are well suited for describing linear effects and simpler interactions. In this example, we specifically focus on the mechanics of filtering and formatting the data to fit occupancy models, and less on the specifics of model selection and how to choose suitable predictor variables for estimating detection and occupancy probabilities. The predictors we do include are informed by our inferences on variable importance scores from the random forest model in Chapter 4, as well as our existing knowledge of the species being modeled.

If you worked through the previous chapters, you should have all the data necessary for this chapter. You can also download the data package, and unzip it to your project directory.

```
library(auk)
library(lubridate)
library(sf)
library(dggridR)
library(unmarked)
library(raster)
library(ebirdst)
library(MuMIn)
library(AICcmodavg)
library(fields)
library(tidyverse)
# resolve namespace conflicts
select <- dplyr::select
projection <- raster::projection
# set random number seed to insure fully repeatable results
set.seed(1)
# setup output directory for saved results
if (!dir.exists("output")) {
dir.create("output")
}
# ebird data
ebird <- read_csv("data/ebd_woothr_june_bcr27_zf.csv") %>%
mutate(year = year(observation_date),
# occupancy modeling requires an integer response
species_observed = as.integer(species_observed))
# modis land cover covariates
habitat <- read_csv("data/pland-elev_location-year.csv") %>%
mutate(year = as.integer(year))
# combine ebird and modis data
ebird_habitat <- inner_join(ebird, habitat, by = c("locality_id", "year"))
# prediction surface
pred_surface <- read_csv("data/pland-elev_prediction-surface.csv")
# latest year of landcover data
max_lc_year <- pred_surface$year[1]
r <- raster("data/prediction-surface.tif")
# load gis data for making maps
map_proj <- st_crs(102003)
ne_land <- read_sf("data/gis-data.gpkg", "ne_land") %>%
st_transform(crs = map_proj) %>%
st_geometry()
bcr <- read_sf("data/gis-data.gpkg", "bcr") %>%
st_transform(crs = map_proj) %>%
st_geometry()
ne_country_lines <- read_sf("data/gis-data.gpkg", "ne_country_lines") %>%
st_transform(crs = map_proj) %>%
st_geometry()
ne_state_lines <- read_sf("data/gis-data.gpkg", "ne_state_lines") %>%
st_transform(crs = map_proj) %>%
st_geometry()
```

## 5.2 Data preparation

First, we need to extract a subset of the eBird data that meets the assumptions of occupancy models, then we perform spatiotemporal subsampling to deal with spatial bias in the data.

Let’s start by filtering our data to include only checklists with 5 or fewer observers, to reduce sources of variation in detectability, and because there are very few checklists with more than 5 observers. In addition, we’ll subset observations to the most recent year for which we have data (2019) to fit a single-season occupancy model.

```
# filter prior to creating occupancy model data
ebird_filtered <- filter(ebird_habitat,
number_observers <= 5,
year == max(year))
```

In some situations, you may want to further filter the data based on the results of an exploratory analysis similar to the one conducted in Section 2.5. However, we won’t further filter the observations in eBird for our occupancy example. Given the extra constraints for data suitable for occupancy modeling, it may be useful to retain more checklists at this stage.

### 5.2.1 Data formatting

Next, we need to generate detection histories for each location we define as a site. In this example, we define the month of June as the time period over which we assume that the population is closed for Wood Thrush in BCR 27, and a site is defined as a specific location (latitude/longitude) that is visited at least twice by the same observer within our defined period of closure (i.e. the month of June).

The `auk`

function `filter_repeat_visits()`

is designed to extract a subset of eBird data suitable for occupancy modeling. Using the function, we first filter the data to only sites that have at least 2 visits (`min_obs`

). We also define the maximum number of repeat visits (`max_obs`

) as 10 visits or checklists. When a specific site has been visited more than 10 times, the function will randomly select 10 checklists from all the visits to that site. If we had data from more than one year, we would use `annual_closure = TRUE`

to determine that populations are closed within years, but *not closed* between years (i.e. species occurrence can change between years). If we want to define periods of closure within years, we can define these in terms of the number of days using `n_days`

. For example, `n_days = 10`

would define contiguous sets of 10 days, starting with the earliest observation date in the data, and use these as consecutive periods of closure. Here we don’t define `n_days`

so we are treating all the June 2019 checklists as a single season and with a closed population. Finally, `site_vars`

specifies the set of variables that defines a site. In this example, a site is defined jointly by the location and observer IDs. Any set of variables in the data can be used to define sites. For example, `site_vars = "locality_id"`

could be used to define sites using the location regardless of observer.

```
occ <- filter_repeat_visits(ebird_filtered,
min_obs = 2, max_obs = 10,
annual_closure = TRUE,
date_var = "observation_date",
site_vars = c("locality_id", "observer_id"))
# entire data set
nrow(ebird_habitat)
#> [1] 48445
# reduced data set
nrow(occ)
#> [1] 3724
# number of individual sites
n_distinct(occ$site)
#> [1] 988
```

This function `filter_repeat_visits()`

added three new columns to the dataset: `site`

is a unique site ID (here, location and observer), `closure_id`

identifies the primary period of closure (in this example the year), and `n_observations`

is the number of visits to each site. Our capture histories are now properly formatted for occupancy models and ready to be analyzed. Note that we’ve made a trade off in sample size, dropping from 10,414 checklists to 3,724 checklists over 988 sites.

We’ll use our filtered observations to fit a single-season occupancy model using the `unmarked`

R package. For additional details on the type of data format required for this package, consult the documentation for the `unmarked`

function `formatWide()`

. The `auk`

function `format_unmarked_occu()`

converts data from a vertical format in which each row is an observation (as in the EBD) to a horizontal detection history required by `unmarked`

, where each row is a site.

At this stage, we need to specify which variables will be ecological process (i.e. occupancy) covariates and which will be observational process (i.e. detection) covariates. Occupancy covariates (`site_covs`

) will be unique at the level of the site, while detection covariates (`obs_covs`

) will be unique for each site and sampling occasion (i.e. checklist).

For this example, we’ll use MODIS land cover variables as habitat covariates for modeling the occupancy probability of Wood Thrush. Based on predictor importance measures from Chapter 4, we include deciduous broadleaf forest and mixed forest as habitat types for which we expect positive relationships with occupancy, and croplands and urban, for which we expect negative relationships.

To estimate detection probability, we include five effort variables that are related to the detection process. Habitat type has been shown to affect detectability, for example, some species are harder to detect in densely forested habitats relative to more open habitat types. So we also include deciduous broadleaf forest and mixed forest as variables in the detectability model. Occupancy models allow us to tease apart the differing effects of habitat on both detection and occupancy probabilities.

```
# format for unmarked
occ_wide <- format_unmarked_occu(occ,
site_id = "site",
response = "species_observed",
site_covs = c("n_observations",
"latitude", "longitude",
"pland_04_deciduous_broadleaf",
"pland_05_mixed_forest",
"pland_12_cropland",
"pland_13_urban"),
obs_covs = c("time_observations_started",
"duration_minutes",
"effort_distance_km",
"number_observers",
"protocol_type",
"pland_04_deciduous_broadleaf",
"pland_05_mixed_forest"))
```

### 5.2.2 Spatial subsampling

As discussed in Section 4.3, spatial subsampling of eBird observations reduces spatial bias. We’ll use the same hexagonal subsampling approach as in Chapter 4; however, here we’ll subsample at the level of sites rather than observations. For this example, we will sample one site per 5 km grid cell. Note, that because we included `observer_id`

in the site definition, this subsampling process will only select one set of visits from *one observer* to *one site*, within each 5km cell.

```
# generate hexagonal grid with ~ 5 km betweeen cells
dggs <- dgconstruct(spacing = 5)
# get hexagonal cell id for each site
occ_wide_cell <- occ_wide %>%
mutate(cell = dgGEO_to_SEQNUM(dggs, longitude, latitude)$seqnum)
# sample one site per grid cell
occ_ss <- occ_wide_cell %>%
group_by(cell) %>%
sample_n(size = 1) %>%
ungroup() %>%
select(-cell)
# calculate the percent decrease in the number of sites
1 - nrow(occ_ss) / nrow(occ_wide)
```

This resulted in a 41% decrease in the number of sites.

### 5.2.3 `unmarked`

object

Finally, we’ll convert this data frame of observations into an `unmarked`

object so we can start fitting occupancy models.

```
occ_um <- formatWide(occ_ss, type = "unmarkedFrameOccu")
summary(occ_um)
#> unmarkedFrame Object
#>
#> 584 sites
#> Maximum number of observations per site: 10
#> Mean number of observations per site: 3.86
#> Sites with at least one detection: 64
#>
#> Tabulation of y observations:
#> 0 1 <NA>
#> 2125 128 3587
#>
#> Site-level covariates:
#> n_observations latitude longitude pland_04_deciduous_broadleaf pland_05_mixed_forest
#> Min. : 2.00 Min. :29.7 Min. :-91.4 Min. :0.000 Min. :0.000
#> 1st Qu.: 2.00 1st Qu.:31.2 1st Qu.:-85.6 1st Qu.:0.000 1st Qu.:0.000
#> Median : 2.00 Median :33.1 Median :-81.4 Median :0.000 Median :0.000
#> Mean : 3.86 Mean :33.2 Mean :-82.0 Mean :0.052 Mean :0.047
#> 3rd Qu.: 5.00 3rd Qu.:35.1 3rd Qu.:-78.3 3rd Qu.:0.000 3rd Qu.:0.000
#> Max. :10.00 Max. :37.4 Max. :-75.5 Max. :1.000 Max. :1.000
#> pland_12_cropland pland_13_urban
#> Min. :0.000 Min. :0.000
#> 1st Qu.:0.000 1st Qu.:0.000
#> Median :0.000 Median :0.000
#> Mean :0.025 Mean :0.159
#> 3rd Qu.:0.000 3rd Qu.:0.203
#> Max. :1.000 Max. :1.000
#>
#> Observation-level covariates:
#> time_observations_started duration_minutes effort_distance_km number_observers protocol_type
#> Min. : 0 Min. : 1 Min. :0 Min. :1 Length:5840
#> 1st Qu.: 8 1st Qu.: 16 1st Qu.:0 1st Qu.:1 Class :character
#> Median :10 Median : 35 Median :0 Median :1 Mode :character
#> Mean :12 Mean : 53 Mean :1 Mean :1
#> 3rd Qu.:16 3rd Qu.: 70 3rd Qu.:1 3rd Qu.:1
#> Max. :24 Max. :300 Max. :5 Max. :4
#> NA's :3587 NA's :3587 NA's :3587 NA's :3587
#> pland_04_deciduous_broadleaf pland_05_mixed_forest
#> Min. :0 Min. :0
#> 1st Qu.:0 1st Qu.:0
#> Median :0 Median :0
#> Mean :0 Mean :0
#> 3rd Qu.:0 3rd Qu.:0
#> Max. :1 Max. :1
#> NA's :3587 NA's :3587
```

## 5.3 Occupancy modeling

Now that we’ve created a data frame with detection histories and covariates, we can use `unmarked`

to fit a single-season occupancy model. As mentioned above, readers can discover more about occupancy models and the variety of modeling approaches in MacKenzie et al. (2017). Here, we simply fit a single-season occupancy model to our data using the `occu()`

function, specifying the detection and occupancy covariates, respectively, via a double right-hand sided formula of the form `~ detection covariates ~ occupancy covariates`

.

```
# fit model
occ_model <- occu(~ time_observations_started +
duration_minutes +
effort_distance_km +
number_observers +
protocol_type +
pland_04_deciduous_broadleaf +
pland_05_mixed_forest
~ pland_04_deciduous_broadleaf +
pland_05_mixed_forest +
pland_12_cropland +
pland_13_urban,
data = occ_um)
# look at the regression coefficients from the model
summary(occ_model)
#>
#> Call:
#> occu(formula = ~time_observations_started + duration_minutes +
#> effort_distance_km + number_observers + protocol_type + pland_04_deciduous_broadleaf +
#> pland_05_mixed_forest ~ pland_04_deciduous_broadleaf + pland_05_mixed_forest +
#> pland_12_cropland + pland_13_urban, data = occ_um)
#>
#> Occupancy (logit-scale):
#> Estimate SE z P(>|z|)
#> (Intercept) -2.030 0.231 -8.773 1.74e-18
#> pland_04_deciduous_broadleaf 6.742 1.958 3.442 5.77e-04
#> pland_05_mixed_forest 0.889 0.789 1.126 2.60e-01
#> pland_12_cropland -1.116 1.906 -0.586 5.58e-01
#> pland_13_urban -2.011 0.968 -2.078 3.77e-02
#>
#> Detection (logit-scale):
#> Estimate SE z P(>|z|)
#> (Intercept) -1.24620 0.56367 -2.211 0.02704
#> time_observations_started -0.00505 0.02993 -0.169 0.86592
#> duration_minutes 0.00650 0.00317 2.053 0.04005
#> effort_distance_km -0.25544 0.13388 -1.908 0.05639
#> number_observers 0.16704 0.30811 0.542 0.58773
#> protocol_typeTraveling 0.84136 0.38631 2.178 0.02941
#> pland_04_deciduous_broadleaf -0.77822 0.53981 -1.442 0.14940
#> pland_05_mixed_forest 2.87185 1.00784 2.849 0.00438
#>
#> AIC: 691
#> Number of sites: 584
#> optim convergence code: 0
#> optim iterations: 67
#> Bootstrap iterations: 0
```

### 5.3.1 Assessment

Although few goodness-of-fit tests exist for occupancy models, we demonstrate how to perform the MacKenzie and Bailey (2004) goodness-of-fit test. This approach calculates a Pearson’s chi-square fit statistic from the observed and expected frequencies of detection histories for a given model. For this example, we use the `mb.gof.test()`

test function in the `AICcmodavg`

package, which can handle occupancy models produced by the `occu()`

function in `unmarked`

. Note that to produce accurate results, this process requires simulating a large number of bootstrap samples, which can take a long time to run. To keep the execution times reasonable, we set `nsim = 10`

to simulate 10 samples for this example; however, when running this under regular circumstances, you should increase this to a much higer number of simulations (e.g., nsim = 1000).

```
occ_gof <- mb.gof.test(occ_model, nsim = 10, plot.hist = FALSE)
# hide the chisq table to give simpler output
occ_gof$chisq.table <- NULL
print(occ_gof)
```

```
#>
#> MacKenzie and Bailey goodness-of-fit for single-season occupancy model
#>
#> Chi-square statistic = 1555
#> Number of bootstrap samples = 1000
#> P-value = 0.621
#>
#> Quantiles of bootstrapped statistics:
#> 0% 25% 50% 75% 100%
#> 537 1387 1715 2254 35800
#>
#> Estimate of c-hat = 0.73
```

For this example, the probability of getting the calculated chi-square statistic under a null sampling distribution is indicated by the p-value of 0.621. As p > 0.1 there is no reason to consider a lack of fit. We also get an estimate of the overdispersion parameter (c-hat) for the model, which is derived by dividing the observed chi-square statistic by the mean of the statistics obtained from simulation. In this example, c-hat = 0.73, which is very close to c-hat = 1, indicating that the variance is not greater than the mean, and that there is no evidence for overdispersion. Again, under regular circumstances we would want to run many more simulations, but based on this smaller run, the test statistics suggest that there is no evidence of lack of fit of this model to these data. For more details on this test, see MacKenzie and Bailey (2004).

### 5.3.2 Model selection

So far, we have a single global model that includes all of the covariates we believe will influence the occupancy and detection probabilities. In general, we suggest that careful consideration be used when choosing the set of candidate models to be run and compared during model selection. In this example, we will use the `dredge()`

function to generate a set of candidate models using different combinations of the covariates in the global model. Since we know from prior experience that the effort covariates are almost always important, we’ll lock these variables in, and consider a candidate set consisting of all possible combinations of the ecological covariates in the occupancy submodel.

```
# get list of all possible terms, then subset to those we want to keep
det_terms <- getAllTerms(occ_model) %>%
# retain the detection submodel covariates
discard(str_detect, pattern = "psi")
# dredge all possibe combinations of the occupancy covariates
occ_dredge <- dredge(occ_model, fixed = det_terms)
# model comparison
mc <- select(occ_dredge, starts_with("psi(p"), df, AICc, delta, weight)
# shorten names for printing
names(mc) <- names(mc) %>%
str_extract("(?<=psi\\(pland_[0-9]{2}_)[a-z_]+") %>%
coalesce(names(mc))
mc %>%
mutate_all(~ round(., 3)) %>%
knitr::kable()
```

deciduous_broadleaf | mixed_forest | cropland | urban | df | AICc | delta | weight |
---|---|---|---|---|---|---|---|

7.08 | -2.09 | 11 | 689 | 0.000 | 0.366 | ||

6.88 | 0.935 | -1.95 | 12 | 690 | 0.783 | 0.248 | |

6.92 | -1.271 | -2.15 | 12 | 690 | 1.543 | 0.169 | |

6.74 | 0.889 | -1.116 | -2.01 | 13 | 691 | 2.459 | 0.107 |

7.97 | 1.262 | 11 | 693 | 4.335 | 0.042 | ||

8.31 | 10 | 693 | 4.578 | 0.037 | |||

7.91 | 1.243 | -0.598 | 12 | 695 | 6.301 | 0.016 | |

8.23 | -0.759 | 11 | 695 | 6.463 | 0.014 | ||

-3.06 | 10 | 717 | 28.051 | 0.000 | |||

-2.271 | -3.14 | 11 | 717 | 28.324 | 0.000 | ||

0.961 | -2.89 | 11 | 717 | 28.459 | 0.000 | ||

0.883 | -2.098 | -2.98 | 12 | 718 | 29.001 | 0.000 | |

1.446 | 10 | 729 | 40.192 | 0.000 | |||

1.404 | -1.446 | 11 | 730 | 41.535 | 0.000 | ||

9 | 731 | 41.922 | 0.000 | ||||

-1.649 | 10 | 732 | 43.041 | 0.000 |

The corrected Akaike Information Criterion (AICc) measures the likelihood of each model to have generated the data we observed, adjusted for the number of parameters in the model. Lower values indicate models with a better fit to the data, penalizing for the added number of parameters. Delta is the difference in AICc values between the given model and the model that is most likely to have generated the data (i.e. the one with the lowest AICc), and is a relative measure conditional on the candidate set of models. Finally, the AIC weight is a transformation of delta that can be interpreted as the probability that the given model is the most likely one of the candidate models to have generated the data, and is also conditional on the candidate model set. The first four columns give the model coefficients for each of the habitat covariates in the occupancy submodel; missing values indicate that that covariate was not included in the given model.

A quick look at the results of dredging reveals that for the Wood Thrush example there is not a clear single model, or even a small set of models, that are most likely to have generated our data. This is evident from the low AIC weight for the top model and the large number of models with moderate AIC weights. Given this, and the fact that all of our effects are linear and use the same family and link function, we’ll average across all models, weighted by AICc, to produce a model-averaged prediction. However, there may be scenarios in which there is a clear set of high performing models, in which case you can use the `get.models()`

function to extract just these models prior to averaging. For the sake of efficiency, we’ll only average the top models, which we’ll define as those cumulatively comprising 95% of the weights. This will trivially impact the results since the models with lower support have very small weights and therefore contribute little to the weighted-average predictions.

```
# select models with the most suport for model averaging
occ_dredge_95 <- get.models(occ_dredge, subset = cumsum(weight) <= 0.95)
# average models based on model weights
occ_avg <- model.avg(occ_dredge_95, fit = TRUE)
# model coefficients
t(occ_avg$coefficients)
#> full subset
#> psi(Int) -2.01638 -2.01638
#> psi(pland_04_deciduous_broadleaf) 6.99654 6.99654
#> psi(pland_13_urban) -1.95991 -2.05220
#> p(Int) -1.26469 -1.26469
#> p(duration_minutes) 0.00654 0.00654
#> p(effort_distance_km) -0.25472 -0.25472
#> p(number_observers) 0.16099 0.16099
#> p(pland_04_deciduous_broadleaf) -0.77321 -0.77321
#> p(pland_05_mixed_forest) 2.98243 2.98243
#> p(protocol_typeTraveling) 0.84620 0.84620
#> p(time_observations_started) -0.00557 -0.00557
#> psi(pland_05_mixed_forest) 0.40717 0.95693
#> psi(pland_12_cropland) -0.35917 -1.21116
```

### 5.3.3 Exploring detection model submodel

A unique feature of occupancy models, is that we can investigate whether certain covariates influence detection separately from any influencing occurrence, which wasn’t possible using the machine learning approach in Chapter 4. Specifically, we have already seen that the habitat covariates influence occupancy, but we can assess how these covariates affect detection probability, conditional on a bird being present and potentially detected. We included deciduous broadleaf forest (`pland_04`

) and mixed forest (`pland_05`

) as detection covariates in the global model, and we can compare a set of models with and without these covariates to assess their importance. Here we’ll demonstrate how to manually define and compare a set of models rather than use the `dredge()`

function as we did in the previous section. Let’s start by defining the candidate model set. To do so, we explicitly define a null detection model with no habitat covariates, then add in additional terms using `update.formula()`

.

```
# define a null detection model
det_mod <- ~ time_observations_started +
duration_minutes +
effort_distance_km +
number_observers +
protocol_type ~
pland_04_deciduous_broadleaf +
pland_05_mixed_forest +
pland_12_cropland +
pland_13_urban
# define and fit candidate models
mods <- list(det_mod_null = det_mod,
det_mod_dec = update.formula(det_mod, ~ . +
pland_04_deciduous_broadleaf ~ .),
det_mod_mix = update.formula(det_mod, ~ . +
pland_05_mixed_forest ~ .),
global = update.formula(det_mod,
~ . +
pland_04_deciduous_broadleaf +
pland_05_mixed_forest ~ .)) %>%
map(occu, data = occ_um)
```

Now we can perform model selection on this set to compare the different candidate models.

```
mod_sel <- fitList(fits = mods) %>%
modSel()
mod_sel
#> nPars AIC delta AICwt cumltvWt
#> global 13 690.60 0.00000 0.478 0.48
#> det_mod_mix 12 690.60 0.00057 0.478 0.96
#> det_mod_dec 12 696.58 5.98599 0.024 0.98
#> det_mod_null 11 696.99 6.39447 0.020 1.00
```

From these results, it’s clear that including these forest habitat covariates–especially deciduous broadleaf forest–leads to a marked improvement in model performance, as observed by the differences in AIC and the AIC weights. Let’s look at the coefficients from the global model for these covariates to see how they’re impacting detection and occupancy.

```
coef(occ_model) %>%
enframe() %>%
filter(str_detect(name, "pland_0"))
#> # A tibble: 4 x 2
#> name value
#> <chr> <dbl>
#> 1 psi(pland_04_deciduous_broadleaf) 6.74
#> 2 psi(pland_05_mixed_forest) 0.889
#> 3 p(pland_04_deciduous_broadleaf) -0.778
#> 4 p(pland_05_mixed_forest) 2.87
```

The `psi()`

coefficients are from the occupancy submodel and the `p()`

coefficients are from the detection submodel. With this in mind, we see that increasing forest cover increases occupancy probability, since Wood Thrush are forest-associated species, but it decreases detection probability, since birds are harder to see and hear in dense forest. The ability to tease apart the differing effects that covariates have on detection and occupancy is one of the strengths of occupancy modeling.

## 5.4 Prediction

In this section, we’ll estimate the distribution of Wood Thrush in BCR 27. Similar to Section 3.4, we’ll generate a prediction surface using the PLAND land cover covariates summarized on a regular grid of points across BCR 27. For this, we’ll use the `predict()`

function to estimate occupancy probabilities, standard errors, and confidence intervals. When we use `predict()`

on the output of `get.models()`

it will make predictions for each of the selected models, then average the predictions using the AIC weights to produce the final prediction for each location.

Recall that when we predicted encouter rate, we had to include effort variables in our prediction surface. We don’t need to do that here because the occupancy submodel doesn’t depend on the effort covariates; these only occur in the detection submodel.

```
occ_pred <- predict(occ_avg,
newdata = as.data.frame(pred_surface),
type = "state")
# add to prediction surface
pred_occ <- bind_cols(pred_surface,
occ_prob = occ_pred$fit,
occ_se = occ_pred$se.fit) %>%
select(latitude, longitude, occ_prob, occ_se)
```

Next, we’ll convert this data frame to spatial features using `sf`

, then rasterize the points using the prediction surface raster template.

```
r_pred <- pred_occ %>%
# convert to spatial features
st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>%
st_transform(crs = projection(r)) %>%
# rasterize
rasterize(r)
r_pred <- r_pred[[c("occ_prob", "occ_se")]]
# save the raster
tif_dir <- "output"
if (!dir.exists(tif_dir)) {
dir.create(tif_dir)
}
writeRaster(r_pred[["occ_prob"]],
filename = file.path(tif_dir, "occupancy-model_prob_woothr.tif"),
overwrite = TRUE)
writeRaster(r_pred[["occ_se"]],
filename = file.path(tif_dir, "occupancy-model_se_woothr.tif"),
overwrite = TRUE)
```

Finally, we can map these predictions!

```
# project predictions
r_pred_proj <- projectRaster(r_pred, crs = map_proj$proj4string, method = "ngb")
par(mfrow = c(2, 1))
for (nm in names(r_pred)) {
r_plot <- r_pred_proj[[nm]]
par(mar = c(3.5, 0.25, 0.25, 0.25))
# set up plot area
plot(bcr, col = NA, border = NA)
plot(ne_land, col = "#dddddd", border = "#888888", lwd = 0.5, add = TRUE)
# occupancy probability or standard error
if (nm == "occ_prob") {
title <- "Wood Thrush Occupancy Probability"
brks <- seq(0, 1, length.out = 21)
lbl_brks <- seq(0, 1, length.out = 11) %>%
round(2)
} else {
title <- "Wood Thrush Occupancy Uncertainty (SE)"
mx <- ceiling(1000 * cellStats(r_plot, max)) / 1000
brks <- seq(0, mx, length.out = 21)
lbl_brks <- seq(0, mx, length.out = 11) %>%
round(2)
}
pal <- abundance_palette(length(brks) - 1)
plot(r_plot,
col = pal, breaks = brks,
maxpixels = ncell(r_plot),
legend = FALSE, add = TRUE)
# borders
plot(bcr, border = "#000000", col = NA, lwd = 1, add = TRUE)
plot(ne_state_lines, col = "#ffffff", lwd = 0.75, add = TRUE)
plot(ne_country_lines, col = "#ffffff", lwd = 1.5, add = TRUE)
box()
# legend
par(new = TRUE, mar = c(0, 0, 0, 0))
image.plot(zlim = range(brks), legend.only = TRUE,
breaks = brks, col = pal,
smallplot = c(0.25, 0.75, 0.06, 0.09),
horizontal = TRUE,
axis.args = list(at = lbl_brks, labels = lbl_brks,
fg = "black", col.axis = "black",
cex.axis = 0.75, lwd.ticks = 0.5,
padj = -1.5),
legend.args = list(text = title,
side = 3, col = "black",
cex = 1, line = 0))
}
```

### References

Bzdok, Danilo, Naomi Altman, and Martin Krzywinski. 2018. “Points of Significance: Statistics Versus Machine Learning.” *Nature Methods* 15 (April): 233–34. https://doi.org/10.1038/nmeth.4642.

MacKenzie, Darryl I., and Larissa L. Bailey. 2004. “Assessing the Fit of Site-Occupancy Models.” *Journal of Agricultural, Biological, and Environmental Statistics* 9 (3): 300–318.

MacKenzie, Darryl I., James D. Nichols, J. Andrew Royle, Kenneth H. Pollock, Larissa Bailey, and James E. Hines. 2017. *Occupancy Estimation and Modeling: Inferring Patterns and Dynamics of Species Occurrence*. Elsevier.