2021 Changelog
Data Version: 2021 (available November 2022)
Citation:
Fink, D., T. Auer, A. Johnston, M. Strimas-Mackey, S. Ligocki, O.
Robinson, W. Hochachka, L. Jaromczyk, A. Rodewald, C. Wood, I. Davies,
A. Spencer. 2022. eBird Status and Trends, Data Version: 2021; Released:
2022. Cornell Lab of Ornithology, Ithaca, New York. https://doi.org/10.2173/ebirdst.2021
eBird Checklists
- CHANGED: checklists are included for January 1 2007 through December
31 2021, updated from January 1 2006 through December 31 2020.
- CHANGED: Observations reported as escapees under the new eBird
exotic species protocols are excluded from analysis.
Workflow and Code Changes
General
- ADDED: Prediction grid locations for the ocean are now available as
a choice to model a species as land or water.
Spatiotemporal Partitioning
- CHANGED: The adaptive partitioning algorithm (AdaSTEM) now grid
samples the training data before stixels are defined.
- CHANGED: The projection initialization of each stixel iteration is
now fully randomized, previously it was constrained to keep boundaries
in the ocean.
- CHANGED: Stixels are now allowed to recurse one size smaller, to
approximately 90km on a side, and remain one size larger (3000km on a
side), except for resident-specific stixels where the maximum remains
1500km on a side, for computational reasons.
- CHANGED: There is now a separate AdaSTEM partitioning for residents
that uses the full year of data instead of a 28 day window. The training
data for these partitions are also grid sampled before definition. The
stixel parameters are set to have a maximum of 65,000 checklists per
stixel over the full year, after grid sampling, and a minimum of 6,500
checklists per stixel (e.g., stixels are not allowed to be subdivided if
they contain less than this amount).
Model Ensemble
- CHANGED: Models are now run for 200 replicates (folds).
- CHANGED: The percent above threshold (PAT) cutoff has been replaced
with a data-driven maximization of the MCC-F1 curve (https://arxiv.org/abs/2006.11278), constrained between
0.05 and 0.25. The training data are grid sampled before optimizing
using the MCC-F1 curve and 25 realizations are done before taking the
median PAT value. For migrants, this is done weekly, for residents
across the whole year.
- CHANGED: The process for selecting the ensemble support cutoff or
threshold (the number of models required to show predictions) has been
updated to have the training data grid sampled first, then optimized for
a true positive rate of 99%, with the cutoff constrained between 0.5 and
0.9. For migrants, this is done weekly, for residents across the whole
year. This process is done 25 times and then a median threshold value is
selected.
- CHANGED: The site selection probability layer has been significantly
improved. In the binary classification model, prediction grid locations
that are >= 50% overlapped by a 1.5km buffer of checklist locations
have been removed. This resolves the previous, erroneously low values in
dense, urban areas and more accurately reflects the true probability of
site selection in these areas. This change only impacts species
estimates in places with a site selection probability value of less than
0.5%, where species estimates are masked.
Base Model
- CHANGED: The grid sample method now retains all unique values of
factor variables (e.g., island).
- CHANGED: The grid sampler oversamples detections to achieve 25%
detection probability in the training dataset. Previously the grid
sampler would often overshoot the 25% target and excessively duplicate
detections. This has been corrected so that oversampling never yields
detection probabilities greater than 25% and detections are duplicated
at most 25 times.
- CHANGED: Mean spatial coverage of each stixel is now correctly
estimated as the proportion of 3 km pixels that contain checklists.
Fit and Predict
- CHANGED: Maximization of partial dependencies for prediction (e.g.,
CCI) no longer allows selection of the highest and lowest extreme
quantile values, to prevent extrapolation.
Residents
- CHANGED: Along with a resident-specific AdaSTEM partitioning,
resident models now predict all weeks of the year in a single stixel.
Previously, resident models used data from the whole year for training,
but only predicted the four weeks in a stixel, similar to the way
migrants are modeled.
Data Products
- CHANGED: The occurrence model prediction values for effort variables
are now set at 1 hour and 1 kilometer. Previously, the effort variable
values used for the occurrence model prediction were the same as those
used for the occurrence model, which sought to maximize detection by
optimizing the distance and duration effort variables to capture as much
signal as possible, up to 12 hours (6 hours in this version) and 10
kilometers. These prediction values are retained for the
presence/absence estimation.
- CHANGED: The prediction value for Checklist Calibration Index (CCI)
is now maximized within each stixel using the partial dependencies.
Previously, the value for was set at a fixed value of 1.85 for all
species and stixels.
- CHANGED: Partial dependencies are now only generated for the first
50 folds, to reduce computational cost.
- CHANGED: To show “year-round” on a seasonal map now requires only
0.1% overlap between breeding and non-breeding seasons. Previously, all
four seasons and an overlap of greater than 5% was required.
- REMOVED: Habitat plots and numerical summaries have been removed
from the website.
2020 Changelog
Data Version: 2020 (available June 2022)
Citation:
Fink, D., T. Auer, A. Johnston, M. Strimas-Mackey, O. Robinson, S.
Ligocki, W. Hochachka, L. Jaromczyk, C. Wood, I. Davies, M. Iliff, L.
Seitz. 2021. eBird Status and Trends, Data Version: 2020; Released:
2021. Cornell Lab of Ornithology, Ithaca, New York. https://doi.org/10.2173/ebirdst.2020
eBird Checklists
- CHANGED: checklists are included for January 1 2006 through December
31 2020, updated from January 1 2005 through April 15 2020.
- CHANGED: all species now use all data globally and are not run for
spatial subsets. Previously, primarily Western Hemisphere species were
run only for that spatial extent.
- CHANGED: checklists using the Stationary protocol now include tracks
and are used as long as the distance of the track for this protocol type
is less than 700 meters.
- CHANGED: The spatial location for checklists at eBird Hotspots has
been changed from the user-reported location to the centroid of all
tracks associated with the hotspot.
- FIXED: Previously, some historical checklists that lacked complete
effort information had been included. These have now been excluded.
Environmental Covariates
- CHANGED: SRTM15+ ~250m elevation and
bathymetry replaces the ~1 kilometer SRTM30+ elevation and
bathymetry product.
- CHANGED: The single year of Nighttime Lights has been replaced with
by-year assignment for 2014-2020 using the EOG Annual VNL v2
product.
- CHANGED: The Global Intertidal Change dataset has been updated to
version 1.2 which includes a new three-year time step covering 2017
through 2019.
- CHANGED: Continents now have unique identifiers in the island
categorization. Previously, all continents were treated as the same
“mainland” value.
- ADDED: Hourly weather variables have been assigned at 30 kilometer
spatial resolution using the Copernicus
ERA5 reanalysis product.
- ADDED: 90m eastness and northness (combined slope and aspect)
topographic variables from Amatulli et
al. 2020 are included in addition to 1 kilometer eastness and
northness.
- FIXED: Source data updated for 2017-2019 for MCD12Q1 which
had reported classification errors.
Workflow and Code Changes
Spatiotemporal Partitioning
- CHANGED: The adaptive partitioning algorithm (AdaSTEM) now uses an
Icosahedron Gnomic projection that generates partitions with largely
conformal stixel boundaries across the globe.
- CHANGED: The temporal width of AdaSTEM partitions has been changed
from 30.5 days to 28 days.
Model Ensemble
- CHANGED: The percent above threshold (PAT) cutoff for 3km grid cells
to be reported as present has changed from 0.1 to 0.143, to accommodate
increased occurrence rates as a result of including hourly weather to
account for variation in detection rates.
Resident Methodology
- CHANGED: Residents now have a suite of independent settings designed
for species with strong spatiotemporal stationarity. These include the
following:
- Each stixel loads the full year of training and test data, not just
the 28 day window associated with the given stixel.
- The DAY predictor is encoding cyclically using sin and cosin
transformation to allow the model to wrap the year.
- The spatiotemporal grid sampling now seeks a maximum sample size of
65,000 checklists in a given stixel (for migrants this value is
5,000).
Data Products
- CHANGED: The count model prediction values for effort variables are
now set at 1 hour and 1 kilometer. Previously, the effort variables used
for the count model prediction were the same as those used for the
occurrence model, which sought to maximize detection by optimizing the
distance and duration effort variables to capture as much signal as
possible, up to 12 hours (6 hours in this version) and 10
kilometers.
- CHANGED: Zeroes in data products that are outside of the prediction
area for species (also known as assumed zeroes) now require, on average,
across the up-to 100 models in the ensemble, 0.5% of 3km grid cells
filled with at least 1 checklist for a given week to be reported as
zero. Previously, this was 0.1% of 3km grid cells. This has been
adjusted to offer a more appropriately conservative representation of
where absence can be assumed based on overall data volume.
- ADDED: Locations (3km grid cells) with less than a 0.5% mean site
selection probability are now masked out of the final data products and
reported as NA. Mean site selection probability is calculated weekly in
a species-agnostic AdaSTEM workflow that estimates the probability that
a location of a given habitat configuration will be visited in a given
region and season.
- ADDED: Spatial representations of predictive performance metrics and
other individual model-level summaries are being generated as 27km
GeoTIFFs for each week of the year. The spatialization is done by
assigning the stixel-level values to every 27km grid cell within the
stixel and then averaging across stixels to determine regional
metrics.
- FIXED: The Caspian Sea is now masked out of all data products.
- CHANGED: Raw test data that does not receive model predictions has
been removed from the calculation of predictive performance metrics.
Previously, this type of test data was used as a form of assumed absence
in the calculation of binary predictive performance metrics.
- ADDED: Predictions to 3km grid cells now include a standardization
of hourly weather within each individual model. The hourly weather
values set for prediction are based on a maximization of occurrence
estimates between the 80th and 90th percentiles.
- CHANGED: Calculation of individual model partial dependencies now
uses train out of bag data. Previously, train in bag data was used.
- ADDED: Predictor Importance and Partial Dependency products are now
included for both occurrence rate and count models. Previously, these
products were only available for the occurrence rate model.
- CHANGED: The time covariate used in the models, calculated as the
difference between the local checklist time and solar noon at the
checklist location, has been changed to use the temporal midpoint of the
checklist for the calculation. Previously, the time at the start of the
checklist had been used for this calculation.
- FIXED: The temporal centroid of individual models, used with
predictor importance and partial dependencies, has been changed to
represent the mean date of train in bag data. Previously, this was a
mean of all train, test, and all four weeks of 3km grid cell location
data.
- CHANGED: Regional habitat association charts are based on a weighted
summary of stixel-level predictor importance and partial dependence
estimates, with the weighting determined by the proportion of the region
covered by each stixel. Previously, stixel centroids were used to
determine the set of stixels contributing to a given region, with crude
approximations of the stixels as rectangles in lat-lon coordinates being
used to determine the overlap-based weighting. Now, the exact stixel
shape is used when calculating regional habitat associations, by
considering the exact set of 27km grid cells falling within each stixel,
to determine both the set of stixels used in habitat summarization and
the overlap-based weighting for a given region.
- CHANGED: Habitat and regional abundance and range statistical
summaries are now computed for all species, globally, using the Natural
Earth Data Admin 1 data for summarization.
Expert Review
- CHANGED: Animations are no longer being reviewed for resident
species.
2019 Changelog
Data Version: 2019 (currently available)
Citation:
Fink, D., T. Auer, A. Johnston, M. Strimas-Mackey, O. Robinson, S.
Ligocki, W. Hochachka, C. Wood, I. Davies, M. Iliff, L. Seitz. 2020.
eBird Status and Trends, Data Version: 2019; Released: 2020. Cornell Lab
of Ornithology, Ithaca, New York. https://doi.org/10.2173/ebirdst.2019
eBird Checklists
- CHANGED: Checklists are included for January 1, 2005 through April
15, 2020, updated from January 1, 2014 through December 31, 2018.
- ADDED: Include checklists from the International
Shorebird Survey (ISS) as complete for shorebird species.
- CHANGED: Checklists where “slashes” (representing two similar
species) are non-zero now have child species set to “X” (present-only,
no count info).
- FIXED: Subspecies did not always roll up to species-level
correctly.
Workflow and Code Changes
Spatiotemporal Partitioning
- CHANGED: The adaptive partitioning algorithm (AdaSTEM) now uses
projected coordinates (sinusoidal) and meters instead of unprojected
coordinates and degrees.
- CHANGED: AdaSTEM partitions are now 1500 kilometers on a side at
their largest and 187 kilometers on a side at their smallest.
- CHANGED: AdaSTEM rules now split partitions if they contain more
than 16,000 checklists or are larger than 1500 kilometers on a
side.
- CHANGED: AdaSTEM now reverts individual partitions back to the next
largest size if any of the partition children contain less than 500
checklists and are not mostly open water. Partitions are never allowed
to revert back to partitions that are 1500 kilometers or more on a
side.
Model Ensemble
- ADDED: Individual models now report 0 for predictions if the
training data set contains less than 10 positive observations of a
species and the mean spatial coverage within the model is greater than
or equal to 5%.
- CHANGED: Range boundaries are now set weekly to have the highest
level of ensemble support, between 50% and 95% of models, while
including at least 99.5% of positive observations, changed from being
fixed at 75% of models in previous versions.
- CHANGED: Zeroes in data products that are outside of the prediction
area for species (also known as assumed zeroes) are now based on the
mean spatial coverage of checklists within those areas. For locations
where species-specific models did not report zero or non-zero
predictions, locations need to have, on average, across the up-to 100
models in the ensemble, 0.1% of 3km grid cells filled with at least 1
checklist for a given week to be reported as zero. Previously, these
locations required 95% of models at a given location to have had at
least 50 complete checklists for the given week.
Seasonal Products
- ADDED: When averaging weekly estimates to represent resident
species, reviewers select a subset of weeks, as opposed to having
previously averaged the entire year.
Data Products
- ADDED: There are now 184 species modeled at a fully global extent.
The overall species total is now 807.
Expert Review
- ADDED: Expert reviewers now assign quality scores for the full-year,
animations, and all seasons.