National Park Visitation Forecasting

Introduction

I was reading an article about forecasting visitation to National Parks where it was claimed that the Park Service uses a simple rolling 5 year trend for visitation forecasting ¹. This surprised me. Certainly, they would use a more sophisticated forecasting method. Soon after, I discovered a dataset of reservations at national parks from recreation.gov which I suspected could be used to forecast visitation. This proved to be true by utilizing a “pickup” method to be described below. The results are promising and could be further strengthened with more data of higher quality which almost certainly exists.

Data Sources

Code

import datetime
import sqlite3
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from statsmodels.tsa.holtwinters import SimpleExpSmoothing
from statsmodels.tsa.ar_model import AutoReg, ar_select_order

##################
# Setup parameters
##################
LEADTIME = datetime.timedelta(180)
PARK_CODE = "JOTR"
PARK_NAME = "Joshua Tree National Park"

########################
# Import visitation data
########################
visits = pd.read_csv(
    "../data/nps_visits.csv").astype({"date": "datetime64[s]"})
visits_jt = visits.loc[visits["park_code"] == PARK_CODE].set_index("date")[
    "visits"]

#########################
# Import reservation data
#########################
conn = sqlite3.connect("../data/reservations/reservations.db")

SQL_QUERY = (
    """
      SELECT parentlocation, park, facilityid, startdate, enddate, orderdate,
        numberofpeople
      FROM reservations
      WHERE parentlocation = ?
      AND orderdate NOT LIKE '0%'
      AND startdate NOT LIKE '0%'
      AND enddate NOT LIKE '0%';
    """
)
df = pd.read_sql_query(SQL_QUERY, conn, params=(PARK_NAME,))
conn.close()

############
# Clean data
############
# For numberofpeople, replace nulls with mode
mode_value = df["numberofpeople"].mode().iloc[0]
df["numberofpeople"] = df["numberofpeople"].fillna(mode_value).astype(int)

# Convert strings to datetime
date_cols = ["startdate", "enddate", "orderdate"]
df[date_cols] = df[date_cols].apply(pd.to_datetime)

# Drop reservations where start date is before order date
df = df.loc[~(df["startdate"] < df["orderdate"])]
# Drop reservations where start date is after end date
df = df.loc[~(df["startdate"] > df["enddate"])]

# Cast facilityid as category
df = df.astype({"facilityid": "category"})

Data for visitations was scraped from the NPS site. It includes monthly recreation visitation counts for each national park unit since 1979 or the parks opening, whichever is later.

The reservation data was downloaded from recreation.gov². There are many fields in the reservation datasets but for the purposes of this report, only the following fields were used: park unit, reservation start date, end date, order date, and number of people per reservation.

Scripts for the data download, cleaning, and database creation can be found on the github repo.

Model Explanation

Warning

Lots of equations in this section.

There isn’t much information about pickup forecasting online. This is the paper that I used the most to derive details. The pickup model hinges on using reservations to inform the forecast of usage: the idea being that the relationship between reservations and usage in the past can be used with current reservations to predict future usage.

As an example, suppose on July 3, 2025 a hotel had 25 rooms booked for the night of July 17, 2025, (14 days later) and is looking to forecast the actual number of rooms ultimately booked. The previous year, on July 3, 2024, there were 23 rooms booked for the night of July 17, 2024 and 100 rooms were ultimately booked. Here, the “pickup”³ is \[ \frac{100 \mbox{ rooms}}{23 \mbox{ rooms}} \approx 4.35 \]

which is the ratio of rooms that are “picked up”. One could then estimate the ultimate bookings for July 17, 2025 by multiplying this pickup by the number of reservations for July 17, 2025, 14 days prior: \[ \hat{f} = 4.35 \cdot 25 \mbox{ rooms} = 109 \mbox{ rooms} \]

More formally, let $f_p(t+h)$ represent the forecast of visitors at national park $p$ during month $t+h$ where the forecast is made at time $t$ with a lead time of $h$. The pickup model used here is defined by \[ f_p(t+h) = \pi_p(t+h) \cdot \hat{r}_p(t+h) \]

where $\pi_p(t+h)$ is the pickup used for month $t+h$ with lead-time $h$ and $\hat{r}_p(t+h)$ is the forecasted number of reservations for month $t+h$ with lead-time $h$ at national park $p$.

Pickup Definition ($\pi_p$)

There are numerous methods for calculating the pickup $\pi_p(t+h)$ but for the model utilized below, it is defined as follows. Let \[ R = [r_p(t+h), r_p(t-1\mbox{ year }+h), ... ] \]

represent the sequence of reservations for date $t$ at each year in the data and let \[V = [v_p(t+h), v_p(t-1\mbox{ year }+h), ... ]\]

be the sequence of visitation counts for date $t+h$ at each year overlapping $R$. We define the pickup for date $t+h$ as \[ \pi_p(t+h) = SimpExpSm_\alpha\bigg(\bigg[\frac{V_i}{R_i}\bigg]_i\bigg) \] where $SimpExpSm_\alpha$ is the first forecast of simple exponential smoothing model with smoothing factor $\alpha$.

It’s not necessary to use a fancy forecasting model like exponential smoothing for $\pi_p$. Initially, I was just using the historical average, i.e., $\pi_p(t+h) = \mbox{ mean}\big(\big[\frac{V_i}{R_i}\big]\big)$. But assigning more weight to more recent pickups produced better forecasts and exponential smoothing does just that. Almost any other time series model could be used here as $\big[\frac{V_i}{R_i}\big]$ is just a time series itself.

Reservation Forecast Definition ($\hat{r}_p$)

Traditionally (does that word apply to this forecasting method? lol) $\hat{r}_p(t+h)$ is defined as the number of reservations for month $t+h$ at time $t$: \[ \hat{r}_p(t+h) = R_i = r_p(t+h) \]

but this proved to be problematic as the sequence contained a fair bit of noise, especially during slow seasons when reservations were low. However, using the historical average was also not ideal as it provided too much significance to historical values of $R_i$. Using a simple exponential smoothing forecast on the sequence $[R_i]$ proved useful, reducing noise but not erasing trends. Thus, we define \[ \hat{r}_p(t+h) = SimpExpSm_\alpha\big(\big[R_i\big]\big) \]

where $\alpha$ is the smoothing factor.

EDA/Pre-visualizations

For the sake of exposition, in this document I’ll limit the analysis to only Joshua Tree National Park. I’ve run the model on other parks and found similar results.

To get a broad idea of how correlated reservations are with visitations, we can aggregate reservation start dates by month and plot that against visitations for that month.

Code

#######################################
# Plot visits and reservations together
#######################################

# Reservations by year
orders = (
    df.groupby(df["startdate"].dt.year,
               observed=True)["numberofpeople"].sum().reset_index())
orders = orders.loc[(orders["startdate"] > 2006) &
                    (orders["startdate"] < 2024)]

# Visits by year
visits_yearly = visits.loc[
    (visits["park_code"] == PARK_CODE)
    & (visits["date"].dt.year > 2006)
    & (visits["date"].dt.year < 2024)
]
visits_yearly = visits_yearly.groupby(
    visits_yearly["date"].dt.year, observed=True)["visits"].sum()

# Create a new figure and add traces from both
fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=visits_yearly.index,
        y=visits_yearly,
        name="Visits",
        yaxis="y",
    )
)

fig.add_trace(
    go.Scatter(
        x=orders["startdate"],
        y=orders["numberofpeople"],
        name="Reservations",
        yaxis="y2"))

# Optional layout tweaks
fig.update_layout(
    title="Joshua Tree National Park Visitors and Reservations",
    xaxis=dict(showgrid=False),
    yaxis=dict(title="Reservations", showgrid=False,),
    yaxis2=dict(
        title="Visits", overlaying="y", side="right", rangemode="tozero",
        showgrid=False),
    barmode="stack", template="plotly_white",
    legend=dict(x=1.1, y=1),)

fig.show()

Figure 1: Yearly Visitation and Reservations

And let’s look at seasonality. The bars represent total average visitation and the lines are yearly reservation counts for each month given a lead time of 180 days:

Code

month_order = ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
               "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]

# Reservations by month
df['monthyear'] = df['startdate'].dt.to_period('M').dt.to_timestamp()
start_res = df.groupby('monthyear')['numberofpeople'].sum().reset_index()

# Visits by month
visits_monthly = visits.loc[
    (visits["park_code"] == PARK_CODE)
    & (visits["date"].dt.year > 2006)
    & (visits["date"].dt.year < 2024)
]
visits_monthly = (
    visits_monthly.groupby(
        visits_monthly["date"].dt.strftime("%b"),
        observed=True)["visits"].sum().reindex(month_order))

fig = go.Figure()

# Visitation numbers as a bar graph
fig.add_trace(
    go.Bar(
        x=visits_monthly.index,
        y=visits_monthly,
        name="Visits",
    )
)

start_years = start_res["monthyear"].dt.year.unique()
for year in start_years:
    year_data = start_res[start_res["monthyear"].dt.year == year]
    fig.add_trace(
        go.Scatter(
            x=year_data["monthyear"].dt.strftime("%b"),
            y=year_data["numberofpeople"],
            yaxis="y2",
            name=str(year),
            mode="lines",
        )
    )

# Optional layout tweaks
fig.update_layout(
    title="Joshua Tree National Park Average Visitors and Reservations by Month",
    yaxis=dict(
        title="Reservations",
        showgrid=False,
    ),
    yaxis2=dict(
        title="Visits", overlaying="y", side="right", rangemode="tozero", showgrid=False
    ),
    template="plotly_white",
    legend=dict(x=1.1, y=1),
)

fig.show()

Figure 2: Visits and Reservations by Month

Results

And now we fit the model. The two smoothing parameters were chosen based off of experimenting to see which yielded the best results.

Code

######################
# Prep for forecasting
######################

beginning = df["startdate"].min()
min_year = beginning.year
end = df["startdate"].max()
max_year = end.year

# Visits for just whatever park we're interested in
visits_jt = visits.loc[visits["park_code"] == PARK_CODE].set_index("date")[
    "visits"]

####################
# Reservation matrix
####################
df["startmonthyear"] = df["startdate"].dt.to_period("M").dt.to_timestamp()

# This is not the exact lead time to the reservation but the lead time to the
# beginning of the month of the reservation which is what we need for forecasting
# monthly visitation before that month occurs.
df["monthleadtime"] = df["startmonthyear"] - df["orderdate"]

# This is the sum of reservations for each startdate with minimum monthleadtime.
res_matrix = (
    df.loc[df["monthleadtime"] >= LEADTIME]
    .groupby("startdate")["numberofpeople"]
    .sum()
    .reset_index()
)

# Aggregate to monthyears
res_matrix["startmonthyear"] = (
    res_matrix["startdate"].dt.to_period("M").dt.to_timestamp()
)
res_matrix = res_matrix.groupby("startmonthyear")["numberofpeople"].sum()
# Fill out the entire daterange index
full_index = pd.date_range(
    start=f"{min_year}-01-01", end=f"{max_year}-12-01", freq="MS"
)
res_matrix = res_matrix.reindex(full_index, fill_value=0)


###############
# Pickup matrix
###############
def pickup(y: int, month_of_year: int) -> float:
    """
    Calculate pickup given year, month, and leadtime.
    """
    reservation_count = res_matrix[datetime.datetime(y, month_of_year, 1)]
    reservation_count = reservation_count.mean()

    if reservation_count < 100:
        return np.nan

    return visits_jt[datetime.datetime(y, month_of_year, 1)] / reservation_count


pickup_matrix = {}
for year in range(min_year, max_year):
    for month in range(1, 13):
        pickup_matrix[(year, month, LEADTIME)] = pickup(year, month)

############################################################
# Forecast model for month=t with lead time=l
############################################################
PICKUP_SMOOTHING = 0.8  # .94 was great when tsa function didn't have +1
RES_SMOOTHING = 0.6  # .96


def v_forecast(
        monthyear: datetime.datetime, leadtime: datetime.timedelta) -> float:
    """Multiplicative, (classical/advanced?), historical average pickup model.
    leadtime is timedelta in days.

    Will only generate a forecast if the month of interest is at least one year after
    the beginning of the relevant data.
    Will also only generate a forecast if the month of interest is at least the
    leadtime after the beginning.
    """
    if (monthyear > beginning + datetime.timedelta(366)) & (
        monthyear - leadtime > beginning
    ):
        return pickup_est(monthyear, leadtime) * tsa_res_count(
            monthyear.year, monthyear.month
        )
    return 0


def pickup_est(
        monthyear: datetime.datetime, leadtime: datetime.timedelta) -> float:
    """Estimator of pickup. Currently using a simple exponential smoothing model
    to generate estimator. This weights recent pickups more heavily
    """
    yearly_pickups = pd.Series(
        {
            datetime.datetime(y, monthyear.month, 1): pickup_matrix[
                (y, monthyear.month, leadtime)
            ]
            for y in range(min_year, monthyear.year)
        }
    )

    # If there are never any valid previous pickup values, then return nan
    if yearly_pickups.empty or yearly_pickups.isna().all():
        return np.nan

    # If only one usable value, return it
    non_na_values = yearly_pickups.dropna()
    if len(non_na_values) == 1 or non_na_values.nunique() == 1:
        return float(non_na_values.iloc[0])

    # Fill missing values with the mean (or consider ffill/bfill if trend exists)
    yearly_pickups = yearly_pickups.fillna(non_na_values.mean())

    yearly_pickups.index = pd.DatetimeIndex(
        pd.to_datetime(yearly_pickups.index)
    ).to_period("Y")
    yearly_pickups = yearly_pickups.sort_index()

    model = SimpleExpSmoothing(yearly_pickups).fit(
        smoothing_level=PICKUP_SMOOTHING, optimized=False)
    return model.forecast(1).iloc[0]


def tsa_res_count(y: int, month_of_year: int) -> float:
    """
    Use a exponential smoothing function to forecast what "should" be the next reservation count.
    """
    res_trend = pd.Series(
        {
            datetime.datetime(year, month_of_year, 1): res_matrix[
                datetime.datetime(year, month_of_year, 1)
            ]
            for year in range(min_year, y + 1)
        }
    )

    # If there is only one year of history, then return that one pickup
    if len(res_trend) == 1:
        return float(res_trend.iloc[0])
    # If there are never any valid previous pickup values, then return nan
    if res_trend.isna().all():
        return np.nan
    # If no trend or level, then return 0 to stop below from complaining.
    if (res_trend == 0).all():
        return 0

    res_trend.index = pd.DatetimeIndex(
        pd.to_datetime(res_trend.index)).to_period("Y")
    res_trend = res_trend.sort_index()
    model = SimpleExpSmoothing(res_trend).fit(
        smoothing_level=RES_SMOOTHING, optimized=False)
    return model.forecast(1).iloc[0]


#######################
# Model fit and display
#######################
visits_df = visits_jt.loc[(pd.to_datetime(
    visits_jt.index).year >= 2008)].reset_index()
visits_df["pickup_model"] = visits_df["date"].apply(
    lambda date: v_forecast(date, LEADTIME))

visits_df = visits_df.melt(id_vars="date", value_vars=[
                           "pickup_model", "visits"])

px.line(
    visits_df.sort_values("date"),
    x="date",
    y="value",
    color="variable",
    title=f"Actual Visits to Joshua Tree NP and {LEADTIME.days} Day Lead-time Forecast",
    template="plotly_white",
)

Figure 3: Actual Visitation vs Forecast

We can see from this graph that there’s a gross overestimate for forecasting for a few months, eg. Dec 2021, March 2023, etc. This effect can be attenuated by increasing the smoothing factors in the model at the expense of model accuracy at detecting quick changes in visitation (the fast growth beginning at the end of 2014 and continuing through 2019).

Code

# Unmelt table
model_df = visits_df.pivot(index='date', columns = 'variable', values='value')


def mape(actual: pd.Series, forecast: pd.Series) -> float:
    """
    Mean Absolute Percentage Error
    """

    actual = actual.replace(0, None)

    return ((actual - forecast) / actual).abs().mean() * 100

def mse(actual: pd.Series, forecast: pd.Series) -> float:
    """
    Mean Absolute Percentage Error
    """

    actual = actual.replace(0, None)

    return pow(actual - forecast, 2).mean()

def rsquared(actual: pd.Series, forecast: pd.Series) -> float:
    """
    R squared
    """
    ss_res = pow(actual - forecast, 2).sum()
    ss_tot = pow(actual - actual.mean(), 2).sum()

    return 1 - ss_res / ss_tot

print("Pickup model:")
print(f"MAPE = {mape(model_df['visits'], model_df['pickup_model']):.2f}")
print(f"MSE = {mse(model_df['visits'], model_df['pickup_model']):.2f}")
print(f"R^2 = {rsquared(model_df['visits'], model_df['pickup_model']):.2f}")

Pickup model:
MAPE = 30.86
MSE = 7654758303.19
R^2 = 0.11

Comparison to Other Models

AR(5) Model

The National Park publication referenced above claims to use “trend line extensions based on actual visitation data from the previous five years” but I don’t know where the numbers in the chart (pages 51-61) are coming from. A heirachical autoregressive model was fit in Clark, et al and serves as reference point for establishing a baseline comparison. Here, to capture seasonality and avoid the complexity of more advanced models (eg SARIMA), I’ve created an AR(5) model for each month, i.e. visitation forecasts for month $M$ and year $Y$ are created by fitting an AR(5) model on the time series of visitations on month $M$ and for years preceding $Y$.

Code

####################
# Five year AR model
####################

visits_jt = visits.loc[visits["park_code"] == PARK_CODE].set_index("date")[
    "visits"]

visits_jt.index = pd.to_datetime(visits_jt.index, errors="coerce")
visits_jt = visits_jt.asfreq("MS")
visits_jt = visits_jt.sort_index()

predictions = pd.Series(index=visits_jt.index, dtype=float)
sarima_preds = pd.Series(index=visits_jt.index, dtype=float)

# Iterate over all dates where we want to make a forecast
for date in visits_jt.index:
    target_month = date.month
    target_year = date.year

    # Subset of past years' same month
    mask = (visits_jt.index.month == target_month) & (
        visits_jt.index.year < target_year
    )
    subset = visits_jt.loc[mask].dropna()

    if len(subset) >= 12:
        lag_order = ar_select_order(subset, maxlag=5, ic="aic")
        model = AutoReg(subset, lags=lag_order.ar_lags, old_names=False)
        result = model.fit()
        predictions.loc[date] = result.forecast(1).iloc[0]
    else:
        # Not enough data to fit AR(5); optionally set prediction to NaN
        predictions.loc[date] = np.nan

predictions.name = "ar_model"


models_wide = visits_df.pivot(index="date", columns="variable", values="value")
models_wide = models_wide.merge(predictions, on="date")

plot_df = models_wide.reset_index().melt(
    id_vars="date", value_vars=["visits", 'ar_model'])

px.line(
    plot_df.sort_values("date"),
    x="date",
    y="value",
    color="variable",
    title=f"Actual Visits to Joshua Tree NP and {LEADTIME.days} Day Lead-time AR(5) Forecast",
    template="plotly_white",
).show()

print("AR(5) model:")
print(f"MAPE = {mape(models_wide['visits'], models_wide['ar_model']):.2f}")
print(f"MSE = {mse(models_wide['visits'], models_wide['ar_model']):.2f}")
print(f"R^2 = {rsquared(models_wide['visits'], models_wide['ar_model']):.2f}")

AR(5) model:
MAPE = 16.86
MSE = 3472426698.64
R^2 = 0.53

SARIMA

To be continued…

Direct exponential smoothing

To be continued…

Potential Improvements

Taking park facility into consideration

Each reservation is associated to facility with each park unit. I initially thought it would be useful to weight reservations at new facilities less heavily but the results did not seem improved. However, it’s possible that this was an oversight as the pickup model was not fully developed at the time.

More advanced forecasting models for pickup and reservation trends.

Currently using a simple exponential smoothing model for both. Could try a Holt-Winters model (see here) or others that allow for trends. Could also try Holt-Winters that includes a seasonality component.

Pickup method vs (S)ARIMA(X) model

Could try forecasting with an ARIMA model, using reservations as an exogenous variable and taking into account the obvious seasonality. The current autoregressive model actually has a lead time of 1 year which puts it at quite a disadvantage to the pickup model. This lead time could be reduced by expanding an AR model to a more complex SARIMA model.

Hierarchical Modeling

See here

Conclusion

The autoregressive model is suprisingly effective given it’s simplicity. Currently it outpreforms the pickup model as presented but it should be noted that this is not at all and apples to apples comparison. For one, the autoregressive model here has an inherent lead time of 1 year as opposed to a variable lead time of the pickup method. Additionally, the pickup model is not “fit” on the data. The smoothing parameters can be tuned but this has been done only by ad hoc experimentation. This could be automated and there are certainly many ways in which the current pickup model could be optimized (eg, more elegant noise reduction, MLE fitting of pickup estimators, trying different time series models instead of exponential smoothing, …)

Footnotes

The evidence provided for this was this document ↩︎
There’s an API but, at least as of July 2025, it does not contain as much data as is contained in the published csv files↩︎
This is a multiplicative pickup as opposed to an additive pickup. This distinction and the distinction between classical vs advanced pickup models is left as an exercise to the reader of the aforementioned paper.↩︎

--- title: "National Park Visitation Forecasting" toc: true toc-location: left format: html: code-fold: true code-tools: true jupyter: python3 --- ## Introduction I was reading [an article](https://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=1027&context=envs_stures) about forecasting visitation to National Parks where it was claimed that the Park Service uses a simple rolling 5 year trend for visitation forecasting ^[The evidence provided for this was [this document](https://irma.nps.gov/DataStore/DownloadFile/548275)]. This surprised me. Certainly, they would use a more sophisticated forecasting method. Soon after, I discovered a dataset of reservations at national parks from recreation.gov which I suspected could be used to forecast visitation. This proved to be true by utilizing a "pickup" method to be described below. The results are promising and could be further strengthened with more data of higher quality which almost certainly exists. ## Data Sources ```{python} import datetime import sqlite3 import pandas as pd import numpy as np import plotly.express as px import plotly.graph_objects as go from statsmodels.tsa.holtwinters import SimpleExpSmoothing from statsmodels.tsa.ar_model import AutoReg, ar_select_order ################## # Setup parameters ################## LEADTIME = datetime.timedelta(180) PARK_CODE = "JOTR" PARK_NAME = "Joshua Tree National Park" ######################## # Import visitation data ######################## visits = pd.read_csv( "../data/nps_visits.csv").astype({"date": "datetime64[s]"}) visits_jt = visits.loc[visits["park_code"] == PARK_CODE].set_index("date")[ "visits"] ######################### # Import reservation data ######################### conn = sqlite3.connect("../data/reservations/reservations.db") SQL_QUERY = ( """ SELECT parentlocation, park, facilityid, startdate, enddate, orderdate, numberofpeople FROM reservations WHERE parentlocation = ? AND orderdate NOT LIKE '0%' AND startdate NOT LIKE '0%' AND enddate NOT LIKE '0%'; """ ) df = pd.read_sql_query(SQL_QUERY, conn, params=(PARK_NAME,)) conn.close() ############ # Clean data ############ # For numberofpeople, replace nulls with mode mode_value = df["numberofpeople"].mode().iloc[0] df["numberofpeople"] = df["numberofpeople"].fillna(mode_value).astype(int) # Convert strings to datetime date_cols = ["startdate", "enddate", "orderdate"] df[date_cols] = df[date_cols].apply(pd.to_datetime) # Drop reservations where start date is before order date df = df.loc[~(df["startdate"] < df["orderdate"])] # Drop reservations where start date is after end date df = df.loc[~(df["startdate"] > df["enddate"])] # Cast facilityid as category df = df.astype({"facilityid": "category"}) ``` Data for visitations was scraped from the NPS site. It includes monthly recreation visitation counts for each national park unit since 1979 or the parks opening, whichever is later. The reservation data was downloaded from recreation.gov^[There's an API but, at least as of July 2025, it does not contain as much data as is contained in the published csv files]. There are many fields in the reservation datasets but for the purposes of this report, only the following fields were used: park unit, reservation start date, end date, order date, and number of people per reservation. Scripts for the data download, cleaning, and database creation can be found on the github repo. ## Model Explanation ::: {.callout-warning} Lots of equations in this section. ::: There isn't much information about pickup forecasting online. [This](https://www.scitepress.org/papers/2016/56319/56319.pdf) is the paper that I used the most to derive details. The pickup model hinges on using reservations to inform the forecast of usage: the idea being that the relationship between reservations and usage in the past can be used with current reservations to predict future usage. As an example, suppose on July 3, 2025 a hotel had 25 rooms booked for the night of July 17, 2025, (14 days later) and is looking to forecast the actual number of rooms ultimately booked. The previous year, on July 3, 2024, there were 23 rooms booked for the night of July 17, 2024 and 100 rooms were ultimately booked. Here, the "pickup"^[This is a multiplicative pickup as opposed to an additive pickup. This distinction and the distinction between classical vs advanced pickup models is left as an exercise to the reader of the aforementioned paper.] is $$ \frac{100 \mbox{ rooms}}{23 \mbox{ rooms}} \approx 4.35 $$ which is the ratio of rooms that are "picked up". One could then estimate the ultimate bookings for July 17, 2025 by multiplying this pickup by the number of reservations for July 17, 2025, 14 days prior: $$ \hat{f} = 4.35 \cdot 25 \mbox{ rooms} = 109 \mbox{ rooms} $$ More formally, let $f_p(t+h)$ represent the forecast of visitors at national park $p$ during month $t+h$ where the forecast is made at time $t$ with a lead time of $h$. The pickup model used here is defined by $$ f_p(t+h) = \pi_p(t+h) \cdot \hat{r}_p(t+h) $$ where $\pi_p(t+h)$ is the pickup used for month $t+h$ with lead-time $h$ and $\hat{r}_p(t+h)$ is the forecasted number of reservations for month $t+h$ with lead-time $h$ at national park $p$. ### Pickup Definition ($\pi_p$) There are numerous methods for calculating the pickup $\pi_p(t+h)$ but for the model utilized below, it is defined as follows. Let $$ R = [r_p(t+h), r_p(t-1\mbox{ year }+h), ... ] $$ represent the sequence of reservations for date $t$ at each year in the data and let $$V = [v_p(t+h), v_p(t-1\mbox{ year }+h), ... ]$$ be the sequence of visitation counts for date $t+h$ at each year overlapping $R$. We define the pickup for date $t+h$ as $$ \pi_p(t+h) = SimpExpSm_\alpha\bigg(\bigg[\frac{V_i}{R_i}\bigg]_i\bigg) $$ where $SimpExpSm_\alpha$ is the first forecast of simple exponential smoothing model with smoothing factor $\alpha$. It's not necessary to use a fancy forecasting model like exponential smoothing for $\pi_p$. Initially, I was just using the historical average, i.e., $\pi_p(t+h) = \mbox{ mean}\big(\big[\frac{V_i}{R_i}\big]\big)$. But assigning more weight to more recent pickups produced better forecasts and exponential smoothing does just that. Almost any other time series model could be used here as $\big[\frac{V_i}{R_i}\big]$ is just a time series itself. ### Reservation Forecast Definition ($\hat{r}_p$) Traditionally (does that word apply to this forecasting method? lol) $\hat{r}_p(t+h)$ is defined as the number of reservations for month $t+h$ at time $t$: $$ \hat{r}_p(t+h) = R_i = r_p(t+h) $$ but this proved to be problematic as the sequence contained a fair bit of noise, especially during slow seasons when reservations were low. However, using the historical average was also not ideal as it provided too much significance to historical values of $R_i$. Using a simple exponential smoothing forecast on the sequence $[R_i]$ proved useful, reducing noise but not erasing trends. Thus, we define $$ \hat{r}_p(t+h) = SimpExpSm_\alpha\big(\big[R_i\big]\big) $$ where $\alpha$ is the smoothing factor. ## EDA/Pre-visualizations For the sake of exposition, in this document I'll limit the analysis to only Joshua Tree National Park. I've run the model on other parks and found similar results. To get a broad idea of how correlated reservations are with visitations, we can aggregate reservation start dates by month and plot that against visitations for that month. ```{python} # | label: fig-yearly # | fig-cap: "Yearly Visitation and Reservations" ####################################### # Plot visits and reservations together ####################################### # Reservations by year orders = ( df.groupby(df["startdate"].dt.year, observed=True)["numberofpeople"].sum().reset_index()) orders = orders.loc[(orders["startdate"] > 2006) & (orders["startdate"] < 2024)] # Visits by year visits_yearly = visits.loc[ (visits["park_code"] == PARK_CODE) & (visits["date"].dt.year > 2006) & (visits["date"].dt.year < 2024) ] visits_yearly = visits_yearly.groupby( visits_yearly["date"].dt.year, observed=True)["visits"].sum() # Create a new figure and add traces from both fig = go.Figure() fig.add_trace( go.Scatter( x=visits_yearly.index, y=visits_yearly, name="Visits", yaxis="y", ) ) fig.add_trace( go.Scatter( x=orders["startdate"], y=orders["numberofpeople"], name="Reservations", yaxis="y2")) # Optional layout tweaks fig.update_layout( title="Joshua Tree National Park Visitors and Reservations", xaxis=dict(showgrid=False), yaxis=dict(title="Reservations", showgrid=False,), yaxis2=dict( title="Visits", overlaying="y", side="right", rangemode="tozero", showgrid=False), barmode="stack", template="plotly_white", legend=dict(x=1.1, y=1),) fig.show() ``` And let's look at seasonality. The bars represent total average visitation and the lines are yearly reservation counts for each month given a lead time of `{python} LEADTIME.days` days: ```{python} # | label: fig-weekly # | fig-cap: "Visits and Reservations by Month" month_order = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"] # Reservations by month df['monthyear'] = df['startdate'].dt.to_period('M').dt.to_timestamp() start_res = df.groupby('monthyear')['numberofpeople'].sum().reset_index() # Visits by month visits_monthly = visits.loc[ (visits["park_code"] == PARK_CODE) & (visits["date"].dt.year > 2006) & (visits["date"].dt.year < 2024) ] visits_monthly = ( visits_monthly.groupby( visits_monthly["date"].dt.strftime("%b"), observed=True)["visits"].sum().reindex(month_order)) fig = go.Figure() # Visitation numbers as a bar graph fig.add_trace( go.Bar( x=visits_monthly.index, y=visits_monthly, name="Visits", ) ) start_years = start_res["monthyear"].dt.year.unique() for year in start_years: year_data = start_res[start_res["monthyear"].dt.year == year] fig.add_trace( go.Scatter( x=year_data["monthyear"].dt.strftime("%b"), y=year_data["numberofpeople"], yaxis="y2", name=str(year), mode="lines", ) ) # Optional layout tweaks fig.update_layout( title="Joshua Tree National Park Average Visitors and Reservations by Month", yaxis=dict( title="Reservations", showgrid=False, ), yaxis2=dict( title="Visits", overlaying="y", side="right", rangemode="tozero", showgrid=False ), template="plotly_white", legend=dict(x=1.1, y=1), ) fig.show() ``` ## Results And now we fit the model. The two smoothing parameters were chosen based off of experimenting to see which yielded the best results. ```{python} # | label: fig-model # | fig-cap: "Actual Visitation vs Forecast" ###################### # Prep for forecasting ###################### beginning = df["startdate"].min() min_year = beginning.year end = df["startdate"].max() max_year = end.year # Visits for just whatever park we're interested in visits_jt = visits.loc[visits["park_code"] == PARK_CODE].set_index("date")[ "visits"] #################### # Reservation matrix #################### df["startmonthyear"] = df["startdate"].dt.to_period("M").dt.to_timestamp() # This is not the exact lead time to the reservation but the lead time to the # beginning of the month of the reservation which is what we need for forecasting # monthly visitation before that month occurs. df["monthleadtime"] = df["startmonthyear"] - df["orderdate"] # This is the sum of reservations for each startdate with minimum monthleadtime. res_matrix = ( df.loc[df["monthleadtime"] >= LEADTIME] .groupby("startdate")["numberofpeople"] .sum() .reset_index() ) # Aggregate to monthyears res_matrix["startmonthyear"] = ( res_matrix["startdate"].dt.to_period("M").dt.to_timestamp() ) res_matrix = res_matrix.groupby("startmonthyear")["numberofpeople"].sum() # Fill out the entire daterange index full_index = pd.date_range( start=f"{min_year}-01-01", end=f"{max_year}-12-01", freq="MS" ) res_matrix = res_matrix.reindex(full_index, fill_value=0) ############### # Pickup matrix ############### def pickup(y: int, month_of_year: int) -> float: """ Calculate pickup given year, month, and leadtime. """ reservation_count = res_matrix[datetime.datetime(y, month_of_year, 1)] reservation_count = reservation_count.mean() if reservation_count < 100: return np.nan return visits_jt[datetime.datetime(y, month_of_year, 1)] / reservation_count pickup_matrix = {} for year in range(min_year, max_year): for month in range(1, 13): pickup_matrix[(year, month, LEADTIME)] = pickup(year, month) ############################################################ # Forecast model for month=t with lead time=l ############################################################ PICKUP_SMOOTHING = 0.8 # .94 was great when tsa function didn't have +1 RES_SMOOTHING = 0.6 # .96 def v_forecast( monthyear: datetime.datetime, leadtime: datetime.timedelta) -> float: """Multiplicative, (classical/advanced?), historical average pickup model. leadtime is timedelta in days. Will only generate a forecast if the month of interest is at least one year after the beginning of the relevant data. Will also only generate a forecast if the month of interest is at least the leadtime after the beginning. """ if (monthyear > beginning + datetime.timedelta(366)) & ( monthyear - leadtime > beginning ): return pickup_est(monthyear, leadtime) * tsa_res_count( monthyear.year, monthyear.month ) return 0 def pickup_est( monthyear: datetime.datetime, leadtime: datetime.timedelta) -> float: """Estimator of pickup. Currently using a simple exponential smoothing model to generate estimator. This weights recent pickups more heavily """ yearly_pickups = pd.Series( { datetime.datetime(y, monthyear.month, 1): pickup_matrix[ (y, monthyear.month, leadtime) ] for y in range(min_year, monthyear.year) } ) # If there are never any valid previous pickup values, then return nan if yearly_pickups.empty or yearly_pickups.isna().all(): return np.nan # If only one usable value, return it non_na_values = yearly_pickups.dropna() if len(non_na_values) == 1 or non_na_values.nunique() == 1: return float(non_na_values.iloc[0]) # Fill missing values with the mean (or consider ffill/bfill if trend exists) yearly_pickups = yearly_pickups.fillna(non_na_values.mean()) yearly_pickups.index = pd.DatetimeIndex( pd.to_datetime(yearly_pickups.index) ).to_period("Y") yearly_pickups = yearly_pickups.sort_index() model = SimpleExpSmoothing(yearly_pickups).fit( smoothing_level=PICKUP_SMOOTHING, optimized=False) return model.forecast(1).iloc[0] def tsa_res_count(y: int, month_of_year: int) -> float: """ Use a exponential smoothing function to forecast what "should" be the next reservation count. """ res_trend = pd.Series( { datetime.datetime(year, month_of_year, 1): res_matrix[ datetime.datetime(year, month_of_year, 1) ] for year in range(min_year, y + 1) } ) # If there is only one year of history, then return that one pickup if len(res_trend) == 1: return float(res_trend.iloc[0]) # If there are never any valid previous pickup values, then return nan if res_trend.isna().all(): return np.nan # If no trend or level, then return 0 to stop below from complaining. if (res_trend == 0).all(): return 0 res_trend.index = pd.DatetimeIndex( pd.to_datetime(res_trend.index)).to_period("Y") res_trend = res_trend.sort_index() model = SimpleExpSmoothing(res_trend).fit( smoothing_level=RES_SMOOTHING, optimized=False) return model.forecast(1).iloc[0] ####################### # Model fit and display ####################### visits_df = visits_jt.loc[(pd.to_datetime( visits_jt.index).year >= 2008)].reset_index() visits_df["pickup_model"] = visits_df["date"].apply( lambda date: v_forecast(date, LEADTIME)) visits_df = visits_df.melt(id_vars="date", value_vars=[ "pickup_model", "visits"]) px.line( visits_df.sort_values("date"), x="date", y="value", color="variable", title=f"Actual Visits to Joshua Tree NP and {LEADTIME.days} Day Lead-time Forecast", template="plotly_white", ) ``` We can see from this graph that there's a gross overestimate for forecasting for a few months, eg. Dec 2021, March 2023, etc. This effect can be attenuated by increasing the smoothing factors in the model at the expense of model accuracy at detecting quick changes in visitation (the fast growth beginning at the end of 2014 and continuing through 2019). ```{python} # Unmelt table model_df = visits_df.pivot(index='date', columns = 'variable', values='value') def mape(actual: pd.Series, forecast: pd.Series) -> float: """ Mean Absolute Percentage Error """ actual = actual.replace(0, None) return ((actual - forecast) / actual).abs().mean() * 100 def mse(actual: pd.Series, forecast: pd.Series) -> float: """ Mean Absolute Percentage Error """ actual = actual.replace(0, None) return pow(actual - forecast, 2).mean() def rsquared(actual: pd.Series, forecast: pd.Series) -> float: """ R squared """ ss_res = pow(actual - forecast, 2).sum() ss_tot = pow(actual - actual.mean(), 2).sum() return 1 - ss_res / ss_tot print("Pickup model:") print(f"MAPE = {mape(model_df['visits'], model_df['pickup_model']):.2f}") print(f"MSE = {mse(model_df['visits'], model_df['pickup_model']):.2f}") print(f"R^2 = {rsquared(model_df['visits'], model_df['pickup_model']):.2f}") ``` ## Comparison to Other Models ### AR(5) Model The [National Park publication](https://irma.nps.gov/DataStore/DownloadFile/548275) referenced above claims to use "trend line extensions based on actual visitation data from the previous five years" but I don't know where the numbers in the chart (pages 51-61) are coming from. A heirachical autoregressive model was fit in [Clark, et al](https://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=1027&context=envs_stures) and serves as reference point for establishing a baseline comparison. Here, to capture seasonality and avoid the complexity of more advanced models (eg SARIMA), I've created an AR(5) model for each month, i.e. visitation forecasts for month $M$ and year $Y$ are created by fitting an AR(5) model on the time series of visitations on month $M$ and for years preceding $Y$. ```{python} #################### # Five year AR model #################### visits_jt = visits.loc[visits["park_code"] == PARK_CODE].set_index("date")[ "visits"] visits_jt.index = pd.to_datetime(visits_jt.index, errors="coerce") visits_jt = visits_jt.asfreq("MS") visits_jt = visits_jt.sort_index() predictions = pd.Series(index=visits_jt.index, dtype=float) sarima_preds = pd.Series(index=visits_jt.index, dtype=float) # Iterate over all dates where we want to make a forecast for date in visits_jt.index: target_month = date.month target_year = date.year # Subset of past years' same month mask = (visits_jt.index.month == target_month) & ( visits_jt.index.year < target_year ) subset = visits_jt.loc[mask].dropna() if len(subset) >= 12: lag_order = ar_select_order(subset, maxlag=5, ic="aic") model = AutoReg(subset, lags=lag_order.ar_lags, old_names=False) result = model.fit() predictions.loc[date] = result.forecast(1).iloc[0] else: # Not enough data to fit AR(5); optionally set prediction to NaN predictions.loc[date] = np.nan predictions.name = "ar_model" models_wide = visits_df.pivot(index="date", columns="variable", values="value") models_wide = models_wide.merge(predictions, on="date") plot_df = models_wide.reset_index().melt( id_vars="date", value_vars=["visits", 'ar_model']) px.line( plot_df.sort_values("date"), x="date", y="value", color="variable", title=f"Actual Visits to Joshua Tree NP and {LEADTIME.days} Day Lead-time AR(5) Forecast", template="plotly_white", ).show() print("AR(5) model:") print(f"MAPE = {mape(models_wide['visits'], models_wide['ar_model']):.2f}") print(f"MSE = {mse(models_wide['visits'], models_wide['ar_model']):.2f}") print(f"R^2 = {rsquared(models_wide['visits'], models_wide['ar_model']):.2f}") ``` ### SARIMA To be continued... ### Direct exponential smoothing To be continued... ## Potential Improvements ### Taking park facility into consideration Each reservation is associated to facility with each park unit. I initially thought it would be useful to weight reservations at new facilities less heavily but the results did not seem improved. However, it's possible that this was an oversight as the pickup model was not fully developed at the time. ### More advanced forecasting models for pickup and reservation trends. Currently using a simple exponential smoothing model for both. Could try a Holt-Winters model (see [here](https://otexts.com/fpp2/holt.html)) or others that allow for trends. Could also try Holt-Winters that includes a seasonality component. ### Pickup method vs (S)ARIMA(X) model Could try forecasting with an ARIMA model, using reservations as an exogenous variable and taking into account the obvious seasonality. The current autoregressive model actually has a lead time of 1 year which puts it at quite a disadvantage to the pickup model. This lead time could be reduced by expanding an AR model to a more complex SARIMA model. ### Hierarchical Modeling See [here](https://nixtlaverse.nixtla.io/hierarchicalforecast/index.html) ## Conclusion The autoregressive model is suprisingly effective given it's simplicity. Currently it outpreforms the pickup model as presented but it should be noted that this is not at all and apples to apples comparison. For one, the autoregressive model here has an inherent lead time of 1 year as opposed to a variable lead time of the pickup method. Additionally, the pickup model is not "fit" on the data. The smoothing parameters can be tuned but this has been done only by ad hoc experimentation. This could be automated and there are certainly many ways in which the current pickup model could be optimized (eg, more elegant noise reduction, MLE fitting of pickup estimators, trying different time series models instead of exponential smoothing, ...)

National Park Visitation Forecasting

Introduction

Data Sources

Model Explanation

Pickup Definition (\(\pi_p\))

Reservation Forecast Definition (\(\hat{r}_p\))

EDA/Pre-visualizations

Results

Comparison to Other Models

AR(5) Model

SARIMA

Direct exponential smoothing

Potential Improvements

Taking park facility into consideration

More advanced forecasting models for pickup and reservation trends.

Pickup method vs (S)ARIMA(X) model

Hierarchical Modeling

Conclusion

Footnotes