The study and conservation of the natural world relies on detailed information about the distributions, abundances, environmental associations, and population trends of species over time. For many taxa, this information is challenging to obtain at relevant geographic scales. The goal of the eBird Status and Trends project is to use data from eBird, the global community science bird monitoring program administered by The Cornell Lab of Ornithology, to generate a reliable, standardized source of biodiversity information for the world’s bird populations. To translate the eBird observations into robust data products, we use machine learning to fill spatiotemporal gaps, using local land cover descriptions derived from NASA MODIS and other remote sensing data, while controlling for biases inherent in species observations collected by community scientists.
This data set provides estimates of the full annual cycle
distributions, abundances, and environmental associations for
rscales::comma(nrow(ebirdst::ebirdst_runs))` species for
the year 2021. For each species, distribution and abundance estimates
are available for all 52 weeks of the year across a regular grid of
locations that cover the globe at a resolution of 2.96 km X 2.96 km.
Variation in detectability associated with the search effort is
controlled by standardizing the estimates as the expected occurrence
rate and count of the species on a 1 hour, 1 km checklist by an expert
eBird observer at the optimal time of day for detecting the species.
To describe how each species is associated with features of its local environment, estimates of the relative importance of each remotely sensed variable (e.g. land cover, elevation, etc), are available throughout the year at a monthly temporal and regional spatial resolution. Additionally, to assess estimate quality, we provide upper and lower confidence bounds for abundance estimates and regional-seasonal scale validation metrics for the underlying statistical models. For more information about the data products see the FAQ and summaries. See Fink et al. (2019) for more information about the analysis used to generate these data.
After completing the Access Request Form, you will be provided a
Status and Trends access key, which you will need when downloading data.
To store the key so the package can access it when downloading data, use
"XXXXX" is the access key provided to you. Restart
R after setting the access key.
Throughout the package vignettes, a simplified example dataset is used consisting of Yellow-bellied Sapsucker in Michigan. This dataset is designed to be small for faster download and is accessible without a key. The following will download the example dataset to you computer:
ebirdst_download() downloads data to a
sensible persistent directory on your computer. You can see what that
directory is with the function
ebirdst_data_dir(). You can
change the default download location to a directory of your choosing by
setting the environment variable
example by calling
usethis::edit_r_environ() and adding a
line such as
To download the data package for a species, provide the species code,
English common name, or scientific name to
ebirdst_download(). Note that data for all other species
requires an API key to access. IMPORTANT: after downloading a
data package, do not change the file structure or rename files. Doing so
will prevent you from being able to work with the data in
The full list of the species with data available for download is
available in the data frame
glimpse(ebirdst_runs) #> Rows: 2,070 #> Columns: 23 #> $ species_code <chr> "grerhe1", "higtin1", "gretin1", … #> $ scientific_name <chr> "Rhea americana", "Nothocercus bo… #> $ common_name <chr> "Greater Rhea", "Highland Tinamou… #> $ resident <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRU… #> $ breeding_quality <dbl> 2, 1, 2, 3, 3, 2, 2, 3, 2, 2, 2, … #> $ breeding_range_modeled <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRU… #> $ breeding_start <date> NA, NA, NA, NA, NA, NA, NA, NA, … #> $ breeding_end <date> NA, NA, NA, NA, NA, NA, NA, NA, … #> $ nonbreeding_quality <dbl> 2, 1, 2, 3, 3, 2, 2, 3, 2, 2, 2, … #> $ nonbreeding_range_modeled <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRU… #> $ nonbreeding_start <date> NA, NA, NA, NA, NA, NA, NA, NA, … #> $ nonbreeding_end <date> NA, NA, NA, NA, NA, NA, NA, NA, … #> $ postbreeding_migration_quality <dbl> 2, 1, 2, 3, 3, 2, 2, 3, 2, 2, 2, … #> $ postbreeding_migration_range_modeled <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRU… #> $ postbreeding_migration_start <date> NA, NA, NA, NA, NA, NA, NA, NA, … #> $ postbreeding_migration_end <date> NA, NA, NA, NA, NA, NA, NA, NA, … #> $ prebreeding_migration_quality <dbl> 2, 1, 2, 3, 3, 2, 2, 3, 2, 2, 2, … #> $ prebreeding_migration_range_modeled <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRU… #> $ prebreeding_migration_start <date> NA, NA, NA, NA, NA, NA, NA, NA, … #> $ prebreeding_migration_end <date> NA, NA, NA, NA, NA, NA, NA, NA, … #> $ resident_quality <dbl> 2, 1, 2, 3, 3, 2, 2, 3, 2, 2, 2, … #> $ resident_start <date> 2021-01-04, 2021-01-04, 2021-01-… #> $ resident_end <date> 2021-12-28, 2021-12-28, 2021-12-…
which is included in this package. If you’re working in RStudio, you
View() to interactively explore this data frame.
You can also consult the Status and Trends
landing page to see the full list of species.
All species go through a process of expert human review prior to
being released. The
ebirdst_runs data frame also contains
information from this review process. Reviewers assess each of the four
seasons: breeding, non-breeding, pre-breeding migration, and
post-breeding migration. Resident (i.e., non-migratory) species are
identified by having
TRUE in the
ebirdst_runs, and these species are assessed
across the whole year rather than seasonally.
contains two important pieces of information for each season: a
quality rating and seasonal dates.
The seasonal dates define the weeks that fall within each season. Breeding and non-breeding season dates are defined for each species as the weeks during those seasons when the species’ population does not move. For this reason, these seasons are also described as stationary periods. Migration periods are defined as the periods of movement between the stationary non-breeding and breeding seasons. Note that for many species these migratory periods include not only movement from breeding grounds to non-breeding grounds, but also post-breeding dispersal, molt migration, and other movements.
Reviewers also examine the model estimates for each season to assess the amount of extrapolation or omission present in the model, and assign an associated quality rating ranging from 0 (lowest quality) to 3 (highest quality). Extrapolation refers to cases where the model predicts occurrence where the species is known to be absent, while omission refers to the model failing to predict occurrence where a species is known to be present.
A rating of 0 implies this season failed review and model results should not be used at all for this period. Ratings of 1-3 correspond to a gradient of more to less extrapolation and/or omission, and we often use a traffic light analogy when referring to them:
eBird Status and Trends data packages are identified by the 6-letter
eBird species code (e.g.
yepsap for Yellow-bellied
Sapsucker) and the Status and Trends estimation year (2021 for this
version of the R package). They are stored within sub-directories that
correspond to these variables, and you can get the path to this
# for non-example data use the species code or name instead of "example_data" path <- get_species_path("example_data")
Within this data package directory, the following files and directories will be present:
weekly/: a directory containing weekly estimates of occurrence, count, relative abundance, and percent of population on a regular grid in GeoTIFF format at three resolutions. See below for more details.
seasonal/: a directory containing seasonal estimates of occurrence, count, relative abundance, and percent of population on a regular grid in GeoTIFF format at three resolutions. These are derived from the corresponding weekly raster data. Dates defining the boundary of each season are set on a species-specific basis by an expert reviewer familiar with the species. These dates are available in the
ebirdst_runsdata frame. Only seasons that passed the expert review process are included. See below for more details.
ranges/: a directory containing GeoPackages storing range boundary polygons. See below for more details.
stixel_summary.db: an SQLite database containing information on habitat associations, including predictor importance (PI) and partial dependence (PD) estimates.
predictions.db: model predictions for a test dataset held out of the model fitting. These predictions are used for calculating predictive performance metrics (PPMs) using the
config.json: run-specific parameters, mostly for internal use, but also containing useful parameters for mapping the abundance data. These parameters can be loaded with
The vignette will cover the raster and range data.
The core raster data products are the weekly estimates of occurrence,
count, relative abundance, and percent of population. These are all
stored in the widely used GeoTIFF raster format, and we refer to them as
“weekly cubes” (e.g. the “weekly abundance cube”). All cubes have 52
weeks and cover the entire globe, even for species with ranges only
covering a small region. They come with areas of predicted and assumed
zeroes, such that any cells that are
NA represent areas
where we didn’t produce model estimates.
All estimates are the median expected value for a 1km, 1 hour eBird Traveling Count by an expert eBird observer at the optimal time of day and for optimal weather conditions to observe the given species.
occurrence: the expected probability of encountering a species.
count: the expected count of a species, conditional on its occurrence at the given location.
abundance: the expected relative abundance of a species, computed as the product of the probability of occurrence and the count conditional on occurrence. In addition to the median relative abundance, upper and lower confidence intervals (CIs) are provided, defined at the 10th and 90th quantile of relative abundance, respectively.
precent-population: the percent of the total relative abundance within each cell. This is a derived product calculated by dividing each cell value in the relative abundance raster by the sum of all cell values
All predictions are made on a standard 2.96 x 2.96 km global grid, however, for convenience lower resolution GeoTIFFs are also provided, which are typically much faster to work with. However, note that to keep file sizes small, the example dataset only contains low resolution data. The three resolutions are:
hr): the native 2.96 km resolution data
hrdata aggregated by a factor of 3 in each direction resulting in a resolution of 8.89 km
hrdata aggregated by a factor of 9 in each direction resulting in a resolution of 26.7 km
The weekly cubes use the following naming convention:
metric is typically
for the relative abundance CIs, which use
upper. The function
load_raster() is used to
load these data into R and takes arguments for
resolution. For example,
# weekly, low res, median occurrence occ_lr <- load_raster(path, product = "occurrence", resolution = "lr") occ_lr #> class : RasterStack #> dimensions : 626, 1502, 940252, 52 (nrow, ncol, ncell, nlayers) #> resolution : 26665.26, 26665.28 (x, y) #> extent : -20015109, 20036111, -6684911, 10007555 (xmin, xmax, ymin, ymax) #> crs : +proj=sinu +lon_0=0 +x_0=0 +y_0=0 +R=6371007.181 +units=m +no_defs #> names : w2021.01.04, w2021.01.11, w2021.01.18, w2021.01.25, w2021.02.01, w2021.02.08, w2021.02.15, w2021.02.22, w2021.03.01, w2021.03.08, w2021.03.15, w2021.03.22, w2021.03.29, w2021.04.05, w2021.04.12, ... #> min values : 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... #> max values : 0.012316062, 0.008908954, 0.005192222, 0.005261947, 0.008574143, 0.014774577, 0.018329952, 0.013035353, 0.007518290, 0.005063397, 0.004252218, 0.005209569, 0.010943777, 0.050108384, 0.208685473, ... # use parse_raster_dates() to get the date associated which each raster layer parse_raster_dates(occ_lr) #>  "2021-01-04" "2021-01-11" "2021-01-18" "2021-01-25" "2021-02-01" #>  "2021-02-08" "2021-02-15" "2021-02-22" "2021-03-01" "2021-03-08" #>  "2021-03-15" "2021-03-22" "2021-03-29" "2021-04-05" "2021-04-12" #>  "2021-04-19" "2021-04-26" "2021-05-03" "2021-05-10" "2021-05-17" #>  "2021-05-24" "2021-05-31" "2021-06-07" "2021-06-14" "2021-06-21" #>  "2021-06-28" "2021-07-06" "2021-07-13" "2021-07-20" "2021-07-27" #>  "2021-08-03" "2021-08-10" "2021-08-17" "2021-08-24" "2021-08-31" #>  "2021-09-07" "2021-09-14" "2021-09-21" "2021-09-28" "2021-10-05" #>  "2021-10-12" "2021-10-19" "2021-10-26" "2021-11-02" "2021-11-09" #>  "2021-11-16" "2021-11-23" "2021-11-30" "2021-12-07" "2021-12-14" #>  "2021-12-21" "2021-12-28" # weekly, low res, abundance confidence intervals abd_lower <- load_raster(path, product = "abundance", metric = "lower", resolution = "lr") abd_upper <- load_raster(path, product = "abundance", metric = "upper", resolution = "lr")
The GeoTIFFs use the same Sinusoidal projection as NASA MODIS data. This projection is ideal for analysis, as it is an equal are projection, but is not ideal for mapping since it introduces significant distortion.
The seasonal raster estimates are provided for the same set of
products and at the same three resolutions as the weekly estimates.
They’re derived from the weekly data by taking the cell-wise mean or max
across the weeks within each season. The seasonal boundary dates are
defined through a process of expert review of each species, and are
available in the data frame
ebirdst_runs. Each season is
also given a quality score from 0 (fail) to 3 (high quality), and
seasons with a score of 0 are not provided.
The seasonal GeoTIFFs use the following naming convention:
metric is either
max. The function
load_raster(period = "seasonal") is used to load these data
into R and takes arguments for
resolution. For example,
# seasonal, low res, mean relative abundance abd_seasonal_mean <- load_raster(path, product = "abundance", period = "seasonal", metric = "mean", resolution = "lr") # season that each layer corresponds to names(abd_seasonal_mean) #>  "breeding" "nonbreeding" "prebreeding_migration" #>  "postbreeding_migration" # just the breeding season layer abd_seasonal_mean[["breeding"]] #> class : RasterLayer #> band : 1 (of 4 bands) #> dimensions : 626, 1502, 940252 (nrow, ncol, ncell) #> resolution : 26665.26, 26665.28 (x, y) #> extent : -20015109, 20036111, -6684911, 10007555 (xmin, xmax, ymin, ymax) #> crs : +proj=sinu +lon_0=0 +x_0=0 +y_0=0 +R=6371007.181 +units=m +no_defs #> source : yebsap-example_abundance_seasonal_mean_lr_2021.tif #> names : breeding #> values : 0, 0.8364842 (min, max) # seasonal, low res, max occurrence occ_seasonal_max <- load_raster(path, product = "occurrence", period = "seasonal", metric = "max", resolution = "lr")
Finally, as a convenience, the data products include year-round
rasters summarizing the mean or max across all weeks that fall within a
season that passed the expert review process. These can be accessed
similarly to the seasonal products, just with
period = "full-year" instead. For example, these can layers
be used in conservation planning to assess the most important sites
across the full range and full annual cycle of a species.
# full year, low res, maximum relative abundance abd_fy_max <- load_raster(path, product = "abundance", period = "full-year", metric = "max", resolution = "lr")
Seasonal range polygons are defined as the boundaries of non-zero
seasonal relative abundance estimates, which are then (optionally)
smoothed to produce more aesthetically pleasing polygons using the
smoothr package. They are provided in the widely used
GeoPackage format, with file naming convention:
raw refers to the polygons derived directly from
the raster data and
smooth refers to the smoothed polygons.
Note that only low and medium resolution ranges are provided. These
range polygons can be loaded with
# seasonal, low res, smoothed ranges ranges <- load_ranges(path, resolution = "lr") ranges #> Simple feature collection with 4 features and 8 fields #> Geometry type: MULTIPOLYGON #> Dimension: XY #> Bounding box: xmin: -146.8346 ymin: 8.264292 xmax: -57.58661 ymax: 64.5614 #> Geodetic CRS: WGS 84 #> # A tibble: 4 × 9 #> species_code scientific_n…¹ commo…² predi…³ type season start_date end_date #> <chr> <chr> <chr> <int> <chr> <chr> <date> <date> #> 1 yebsap Sphyrapicus v… Yellow… 2021 range breed… 2021-05-24 2021-08-17 #> 2 yebsap Sphyrapicus v… Yellow… 2021 range nonbr… 2021-11-23 2021-03-08 #> 3 yebsap Sphyrapicus v… Yellow… 2021 range postb… 2021-08-24 2021-11-16 #> 4 yebsap Sphyrapicus v… Yellow… 2021 range prebr… 2021-03-15 2021-05-17 #> # … with 1 more variable: geom <MULTIPOLYGON [°]>, and abbreviated variable #> # names ¹scientific_name, ²common_name, ³prediction_year # subset to just the breeding season range using dplyr range_breeding <- filter(ranges, season == "breeding")
The two SQLite database contained within the species data package,
provide tabular data with information about modeled relationships
between observations and the ecological covariates, as well as data that
can be used to assess predictive performance. Note that these SQLite
databases are quite large (many GBs in size) and are therefore note
downloaded by default. To access these tabular data, you must use the
tifs_only = FALSE in the
Fink, D., T. Auer, A. Johnston, V. Ruiz‐Gutierrez, W.M. Hochachka, S. Kelling. 2019. Modeling avian full annual cycle distribution and population trends with citizen science data. Ecological Applications, 00(00):e02056. https://doi.org/10.1002/eap.2056