Natural gas is one of the largest sources of energy production in the United States. Natural gas, like other fossil fuels, comes from dead plant and animal remains from millions of years ago, that after being buried under sand, silt, and rock, transform from heat and pressure into oil and natural gas. Natural gas from deposits are processed and then sold to the market and consumers. In addition to being sold to consumers (including residential, commercial, and industrial use), natural gas can also be used to generate electricity. Natural gas is also used to heat buildings and power gas stoves and other appliances.

The dataset for natural gas spot prices comes from the US Energy Information Administration, and details the Henry Hub spot price weekly since 1997. The Henry Hub spot price is the spot price at close from the Henry distribution hub in Erath, Louisiana. The data was extracted on a weekly rate, taken on the Friday week date.

In the past few years, the price of natural gas has risen dramatically, much of which was centered around 2022. As natural gas is such a widely used fuel source, price forecasting is helpful to predict potential future trends and plan mitigation strategies, if we assume demand is correlated to the spot price.

This project will try to predict future natural gas spot prices using supervised machine learning algorithms. There are three main concerns with this prediction:

This data is a time series data, so various trend data should be accounted for when modeling future price.
This data is not adjusted for inflation. We will need to adjust to ensure accurate predictions.
How do you predict values beyond one week?

We will first deal with inflation by calculating the inflation rate based on the current Consumer Price Index (CPI) for Energy. We can then multiply this number by the spot price to get inflation adjusted prices.

We will then split the data into a training and test set by withholding the last N weeks of data from the modeling process. We will then predict the withheld ‘future’ prices and calculate modeling metrics. We will repeat this process for several models and several hyperparameter combinations.

We will then re-train the model on the full dataset and compare predictions to the actual values from true future weeks. As the data was downloaded on November 17th, the future data will include the closing date of 2023-11-17 and any future weeks until this project is completed.

Resources:

Data Citation: Data from the EIA located here, which is the public government website.

U.S. Energy Information Administration. (n.d.).  Natural Gas Spot and Futures Prices (NYMEX) . Natural Gas Futures Prices (Nymex). https://www.eia.gov/dnav/ng/ng_pri_fut_s1_d.htm 

Import Libraries

# Libraries
import pandas as pd
import numpy as np

## Dates and Times
import pytz
from dateutil import parser
from datetime import timezone,datetime,timedelta
from dateutil.relativedelta import relativedelta
from datetime import date 

## String manipulation
import string
import re #regex
import json #json parsing
## Data Generation
import random

# Data Manipulation
import pandas as pd
import numpy as np
## Vector Math
from numpy.linalg import norm
## Math
import math

# Data Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt

# Sklearn Pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Models

# Metrics
from sklearn.metrics import explained_variance_score
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_percentage_error

Loading and Cleaning Data

natural_gas = pd.read_excel('natural_gas_us_price.xls', sheet_name='Data 1')
natural_gas = natural_gas.drop(index={0,1})
natural_gas = natural_gas.rename(columns = {'Back to Contents':'date', 'Data 1: Weekly Henry Hub Natural Gas Spot Price (Dollars per Million Btu)':'spot_price'})
natural_gas['date'] = pd.to_datetime(natural_gas['date'])
natural_gas = natural_gas.reset_index().drop(columns={'index'})
natural_gas

	date	spot_price
0	1997-01-10	3.79
1	1997-01-17	4.19
2	1997-01-24	2.98
3	1997-01-31	2.91
4	1997-02-07	2.53
...	...	...
1395	2023-10-13	3.2
1396	2023-10-20	2.86
1397	2023-10-27	2.89
1398	2023-11-03	3.17
1399	2023-11-10	2.46

1400 rows × 2 columns

We can see from the preview that this dataset contains 1,400 rows, each corresponding to the Friday week date with one single spot price listed.

We can also import a secondary data source from the extracted energy specific Consumer Price Index from the Bureau of Labor Statistics (link.

References:

https://www.officialdata.org/

Data Citation: Bureau of Labor Statistics (link).

U.S. Bureau of Labor Statistics. (n.d.). CPI Home.  Consumer Price Index Search Consumer Price Index. https://www.bls.gov/cpi/ 

energy_cpi = pd.read_excel('energy_cpi.xlsx')
energy_cpi = energy_cpi.drop(index={0,1,2,3,4,5,6,7,8,9,10}, columns={'Unnamed: 13', 'Unnamed: 14'})
energy_cpi = energy_cpi.rename(columns={'CPI for All Urban Consumers (CPI-U)':'year', 'Unnamed: 1':'01', 'Unnamed: 2':'02',
       'Unnamed: 3':'03', 'Unnamed: 4':'04', 'Unnamed: 5':'05', 'Unnamed: 6':'06', 'Unnamed: 7':'07',
       'Unnamed: 8':'08', 'Unnamed: 9':'09', 'Unnamed: 10':'10', 'Unnamed: 11':'11', 'Unnamed: 12':'12'})
energy_cpi = energy_cpi.reset_index().drop(columns={'index'})
energy_cpi

C:\Users\623an\anaconda3\Lib\site-packages\openpyxl\styles\stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default
  warn("Workbook contains no default style, apply openpyxl's default")

	year	01	02	03	04	05	06	07	08	09	10	11	12
0	1997	113.3	113.1	111.2	110	109.9	112.3	111.4	112.5	113.9	111.5	110.7	108.4
1	1998	105.9	103.2	101.6	101.9	103.8	105.7	105.2	103.8	102.7	101.3	100.5	98.9
2	1999	98.1	97.3	98.4	105	105.6	106.8	108.7	111.3	113.2	111.6	111.2	112.2
3	2000	112.5	116.7	122.2	120.7	121	129.6	129.7	125.9	130.6	129.3	129	128.1
4	2001	132.5	132	129.5	133.1	140.1	140.5	132.4	129.4	132.5	122.1	116	111.4
5	2002	111.7	111	115.6	122.2	122.9	124.9	125.5	125.8	126.1	125.8	125.3	123.3
6	2003	127.5	135.4	142.6	138.1	134	136.5	136.8	140.6	144.6	136.9	133.1	131.8
7	2004	137.4	140.6	143.1	145.9	154.1	159.7	156.3	155.3	154.3	157.7	158.6	153.7
8	2005	151.9	155.2	160.8	170.9	169.4	171.4	178.5	186.6	208	204.3	187.6	180
9	2006	189.5	186.4	188.6	201.4	209.3	211.3	215.1	214.7	199.1	181.3	180.4	185.2
10	2007	183.567	184.451	196.929	207.265	219.071	221.088	217.274	209.294	209.637	207.588	219.009	217.506
11	2008	219.465	219.311	230.505	240.194	257.106	275.621	280.833	266.283	258.02	231.561	189.938	171.158
12	2009	174.622	178.741	177.454	179.704	186.909	205.408	201.938	204.971	202.243	199.198	204.026	202.301
13	2010	208.026	204.455	209.999	212.977	214.363	211.66	212.372	212.663	210.003	210.947	211.97	217.953
14	2011	223.266	226.86	242.516	253.495	260.376	254.17	252.661	251.706	250.48	240.902	238.177	232.3
15	2012	236.942	242.663	253.599	255.736	250.306	244.167	239.972	250.306	256.332	250.523	238.946	233.473
16	2013	234.624	248.146	249.565	244.757	247.805	251.921	251.37	250.011	248.513	238.524	233.136	234.542
17	2014	239.551	242.041	250.543	252.717	255.982	259.858	257.907	250.951	247.077	234.745	221.844	209.785
18	2015	192.619	196.597	204.731	203.715	214.33	220.861	219.852	213.248	201.641	194.501	189.267	183.378
19	2016	180.171	172.061	179.017	185.652	192.673	200.035	195.94	193.524	195.852	194.786	191.402	193.306
20	2017	199.608	198.195	198.597	202.869	203.132	204.646	202.554	205.894	215.711	207.29	209.383	206.598
21	2018	210.663	213.519	212.554	218.83	226.81	229.137	227.107	226.939	226.165	225.757	215.91	205.905
22	2019	200.563	202.74	211.724	222.499	225.773	221.373	222.492	216.978	215.418	216.351	214.636	212.982
23	2020	213.043	208.354	199.573	183.081	183.076	193.379	197.665	197.362	198.858	196.458	194.388	198.155
24	2021	205.273	213.277	225.861	229.116	235.339	240.72	244.8	246.639	248.228	255.338	259.1	256.207
25	2022	260.653	267.771	298.246	298.469	316.761	340.917	325.407	305.372	297.343	300.359	292.953	274.937
26	2023	283.33	281.673	279.084	283.352	279.816	283.854	284.828	294.328	296.004	286.754	NaN	NaN

There are 27 rows in this data set, with columns for each month and one denoting the year.

Since the data is available monthly, we will link to the year-month date in the previous table. Below is a function to create a bit more usable table to link back to our original dataset.

We can calculate the inflation rate by by dividing historical CPIs by the most current CPI index (October 2023).

# Calculating CPI
n = len(energy_cpi)
cols = list(energy_cpi.columns)
del(cols[0])

curr_cpi = energy_cpi.iloc[26,10]
print('Current Energy Consumer Price Index:', curr_cpi)

energy_cpi.iloc[26,11] = curr_cpi # Setting November 2023 CPI to October 2023

energy_inf = energy_cpi.copy()
# energy_inf = pd.DataFrame(columns=['year','month','inflation'])

for i in cols:
    energy_inf[i] = energy_inf[i]/curr_cpi
    
energy_inf = energy_inf.melt(id_vars=['year'], var_name='month', value_name='inflation')
energy_inf['month'] = energy_inf['month'].apply(lambda x: x.rjust(2,'0'))
energy_inf['year'] = energy_inf['year'].astype('str')
energy_inf['date'] = pd.to_datetime((energy_inf['year'])+'-'+(energy_inf['month'])+'-01').dt.date
energy_inf

Current Energy Consumer Price Index: 286.754

	year	month	inflation	date
0	1997	01	0.395112	1997-01-01
1	1998	01	0.369306	1998-01-01
2	1999	01	0.342105	1999-01-01
3	2000	01	0.392322	2000-01-01
4	2001	01	0.462069	2001-01-01
...	...	...	...	...
319	2019	12	0.742734	2019-12-01
320	2020	12	0.691028	2020-12-01
321	2021	12	0.893473	2021-12-01
322	2022	12	0.95879	2022-12-01
323	2023	12	NaN	2023-12-01

324 rows × 4 columns

We can now join the two dataframes and calculate the price in terms of October 2023 inflation by multiplying the inflation column.

# Join
natural_gas_inf = natural_gas.copy()
natural_gas_inf['month'] = natural_gas_inf['date'].apply(lambda x: str(x.month).rjust(2,'0'))
natural_gas_inf['year'] = natural_gas_inf['date'].apply(lambda x: str(x.year))
natural_gas_inf = natural_gas_inf.merge(energy_inf, how='left', on =['month','year'])
natural_gas_inf = natural_gas_inf.drop(columns={'date_y', 'month','year'})
natural_gas_inf = natural_gas_inf.rename(columns={'date_x':'date'})
natural_gas_inf['spot_price_inf'] = natural_gas_inf['spot_price'] * natural_gas_inf['inflation']
natural_gas_inf

	date	spot_price	inflation	spot_price_inf
0	1997-01-10	3.79	0.395112	1.497475
1	1997-01-17	4.19	0.395112	1.65552
2	1997-01-24	2.98	0.395112	1.177434
3	1997-01-31	2.91	0.395112	1.149776
4	1997-02-07	2.53	0.394415	0.997869
...	...	...	...	...
1395	2023-10-13	3.2	1.0	3.2
1396	2023-10-20	2.86	1.0	2.86
1397	2023-10-27	2.89	1.0	2.89
1398	2023-11-03	3.17	1.0	3.17
1399	2023-11-10	2.46	1.0	2.46

1400 rows × 4 columns

Correcting for inflation is important, especially since there were large changes to the value of a US dollar over. As this data goes all the way back to 1997, we can expect for historical spot prices to be overstated, as \$5 in 1997 would likely have less purchasing power than 2023 (i.e. \$5 from 1997 would be equivalent to only \$2 in 2023).

Data Exploration

print('Data Size: ', len(natural_gas_inf))

Data Size:  1400

In this dataset, there are 1,400 rows of data. With the addition of the ‘spot_price_inf’, or inflation adjusted spot price, there are 2 columns of data. There only feature of the data is the date of the spot price.

fig, axs = plt.subplots()
axs.plot(natural_gas_inf['date'], natural_gas_inf['spot_price'], label='Spot Price')
axs.plot(natural_gas_inf['date'], natural_gas_inf['spot_price_inf'], label='Infation Adjusted Spot Price', color='green')

axs.set_xlabel('Date')
axs.set_ylabel('Spot Price')
axs.set_title('Spot Price Over Time')
axs.legend()

<matplotlib.legend.Legend at 0x1a4709cd2d0>

png

We can see that adjusting for inflation mostly decreases prices, especially for historical data. Adjusting for inflation will hopefully not over-predict spot prices based on higher spot prices than the adjusted spot prices.

mean_sp = natural_gas_inf['spot_price'].mean()
q1_so = natural_gas_inf['spot_price'].quantile(0.25)
median_sp = natural_gas_inf['spot_price'].quantile(0.5)
q3_sp = natural_gas_inf['spot_price'].quantile(0.75)

print('Spot Price')
print('Mean Price: $%.2f' %(mean_sp))
print('Q1 Price: $%.2f' %(q1_so))
print('Median Price: $%.2f' %(median_sp))
print('Q3 Price: $%.2f' %(q3_sp))
print('')

mean_sp = natural_gas_inf['spot_price_inf'].mean()
q1_so = natural_gas_inf['spot_price_inf'].quantile(0.25)
median_sp = natural_gas_inf['spot_price_inf'].quantile(0.5)
q3_sp = natural_gas_inf['spot_price_inf'].quantile(0.75)

print('Inflation Adjusted Spot Price')
print('Mean Price: $%.2f' %(mean_sp))
print('Q1 Price: $%.2f' %(q1_so))
print('Median Price: $%.2f' %(median_sp))
print('Q3 Price: $%.2f' %(q3_sp))

Spot Price
Mean Price: $4.20
Q1 Price: $2.64
Median Price: $3.54
Q3 Price: $5.27

Inflation Adjusted Spot Price
Mean Price: $2.91
Q1 Price: $1.76
Median Price: $2.42
Q3 Price: $3.53

Looking at the mean and various quantiles of prices, we also can see that the inflation adjusted spot prices are overall less than the non-inflation adjusted prices.

mean_sp = natural_gas_inf['spot_price_inf'].mean()
q1_so = natural_gas_inf['spot_price_inf'].quantile(0.25)
median_sp = natural_gas_inf['spot_price_inf'].quantile(0.5)
q3_sp = natural_gas_inf['spot_price_inf'].quantile(0.75)

max_date = natural_gas_inf['date'].max()
min_date = natural_gas_inf['date'].min()


fig, axs = plt.subplots()
axs.plot(natural_gas_inf['date'], natural_gas_inf['spot_price_inf'], color='green')
axs.hlines(mean_sp, min_date, max_date, color='black', label='mean')
axs.hlines(median_sp, min_date, max_date, color='grey', label='median')
axs.hlines(q1_so, min_date, max_date, color='grey', linestyles='dashed')
axs.hlines(q3_sp, min_date, max_date, color='grey', linestyles='dashed')


axs.set_xlabel('Date')
axs.set_ylabel('Spot Price')
axs.set_title('Inflation Adjusted Spot Price Over Time')
axs.legend()

<matplotlib.legend.Legend at 0x1a470d3dc90>

png

We can plot just the spot price and the mean, median, and quartile values. As we can see from the above plot, spot price fluctuates, but tends to be relatively stable.

Feature Engineering

As this data has only one feature, we can transform the data to add additional dimensions to aid in prediction. We can first add rolling mean prices in various increments.

It is important to ensure the rolling window does not take into account the current spot price, or else the model would be snooping on ‘future’ data.

# natural_gas_inf = natural_gas_inf.reset_index(inplace=True)
# natural_gas_inf = natural_gas_inf.set_index('date')
natural_gas_inf = natural_gas_inf.merge(natural_gas_inf[['date','spot_price_inf']].rolling('14D' ,on='date', axis=0, min_periods=1).mean(), \
                      how='left', on='date')
natural_gas_inf = natural_gas_inf.rename(columns={'spot_price_inf_x':'spot_price_inf', 'spot_price_inf_y':'2_week_rolling_avg'})
natural_gas_inf['2_week_rolling_avg'] = natural_gas_inf['2_week_rolling_avg'].shift(1)


natural_gas_inf = natural_gas_inf.merge(natural_gas_inf[['date','spot_price_inf']].rolling('28D' ,on='date', axis=0, min_periods=1).mean(), \
                      how='left', on='date')
natural_gas_inf = natural_gas_inf.rename(columns={'spot_price_inf_x':'spot_price_inf', 'spot_price_inf_y':'4_week_rolling_avg'})
natural_gas_inf['4_week_rolling_avg'] = natural_gas_inf['4_week_rolling_avg'].shift(1)


natural_gas_inf = natural_gas_inf.merge(natural_gas_inf[['date','spot_price_inf']].rolling('56D' ,on='date', axis=0, min_periods=1).mean(), \
                      how='left', on='date')
natural_gas_inf = natural_gas_inf.rename(columns={'spot_price_inf_x':'spot_price_inf', 'spot_price_inf_y':'8_week_rolling_avg'})
natural_gas_inf['8_week_rolling_avg'] = natural_gas_inf['8_week_rolling_avg'].shift(1)

natural_gas_inf['prev_price'] = natural_gas_inf['spot_price_inf'].shift(1)

natural_gas_inf

	date	spot_price	inflation	spot_price_inf	2_week_rolling_avg	4_week_rolling_avg	8_week_rolling_avg	prev_price
0	1997-01-10	3.79	0.395112	1.497475	NaN	NaN	NaN	None
1	1997-01-17	4.19	0.395112	1.65552	1.497475	1.497475	1.497475	1.497475
2	1997-01-24	2.98	0.395112	1.177434	1.576498	1.576498	1.576498	1.65552
3	1997-01-31	2.91	0.395112	1.149776	1.416477	1.443477	1.443477	1.177434
4	1997-02-07	2.53	0.394415	0.997869	1.163605	1.370052	1.370052	1.149776
...	...	...	...	...	...	...	...	...
1395	2023-10-13	3.2	1.0	3.2	2.827903	2.789435	2.708395	2.91
1396	2023-10-20	2.86	1.0	2.86	3.055000	2.890080	2.776094	3.2
1397	2023-10-27	2.89	1.0	2.89	3.030000	2.928951	2.808991	2.86
1398	2023-11-03	3.17	1.0	3.17	2.875000	2.965000	2.838628	2.89
1399	2023-11-10	2.46	1.0	2.46	3.030000	3.030000	2.909717	3.17

1400 rows × 8 columns

We can see from the resulting dataframe that only the first row has NAN values. This is because there is no ‘previous’ value to average. We will exclude this row from the training set later on.

We can also add additional columns for various date dimensions. Adding in date parts, like month, quarter, and year, can help the model determine seasonal trends and yearly trends.

# Adding Date Parts as Columns
natural_gas_inf['month'] = natural_gas_inf['date'].apply(lambda x: x.month)
natural_gas_inf['quarter'] = natural_gas_inf['date'].apply(lambda x: x.quarter)
natural_gas_inf['year'] = natural_gas_inf['date'].apply(lambda x: x.year)

We can now calculate the correlation plot between these new dimensions.

natural_gas_inf.corr()

	date	spot_price	inflation	spot_price_inf	2_week_rolling_avg	4_week_rolling_avg	8_week_rolling_avg	prev_price	month	quarter	year
date	1.000000	-0.137307	0.826097	0.189714	0.191673	0.193627	0.197814	0.190223	0.025185	0.024067	0.999312
spot_price	-0.137307	1.000000	0.195646	0.896610	0.863616	0.850321	0.826259	0.869444	0.031304	0.029119	-0.138504
inflation	0.826097	0.195646	1.000000	0.554204	0.558781	0.562984	0.566945	0.555138	0.031359	0.028085	0.825165
spot_price_inf	0.189714	0.896610	0.554204	1.000000	0.973856	0.963913	0.942210	0.977401	0.038471	0.035941	0.188354
2_week_rolling_avg	0.191673	0.863616	0.558781	0.973856	1.000000	0.992704	0.971842	0.994348	0.044525	0.041332	0.190068
4_week_rolling_avg	0.193627	0.850321	0.562984	0.963913	0.992704	1.000000	0.987607	0.982706	0.043695	0.041666	0.192059
8_week_rolling_avg	0.197814	0.826259	0.566945	0.942210	0.971842	0.987607	1.000000	0.959889	0.045978	0.044945	0.196164
prev_price	0.190223	0.869444	0.555138	0.977401	0.994348	0.982706	0.959889	1.000000	0.044575	0.041264	0.188600
month	0.025185	0.031304	0.031359	0.038471	0.044525	0.043695	0.045978	0.044575	1.000000	0.971172	-0.011785
quarter	0.024067	0.029119	0.028085	0.035941	0.041332	0.041666	0.044945	0.041264	0.971172	1.000000	-0.011849
year	0.999312	-0.138504	0.825165	0.188354	0.190068	0.192059	0.196164	0.188600	-0.011785	-0.011849	1.000000

As we can see, the correlation between the various rolling averages is very high, which is unsurprising. The various averages are much less correlated with the date attributes, but are the most correlated with the yearly column. This makes sense because we expect most of the significant price changes are likely observed yearly.

Overall, the model likely only needs one rolling average value, if it’s needed at all. We can start by using the 4 week rolling average and adjusting if necessary.

Build Pipeline

We can build a pipeline to transform our categorical variables (the date parts) and the numeric variables (the four week rolling average and date).

We first must convert the date column to an ordinal number in order for the model not to treat the date as a categorical value. We can take the ordinal number in relation to weeks since the first recorded date.

# Ordinal Date
natural_gas_inf['date_ordinal'] = natural_gas_inf['date'].apply(lambda x: date.toordinal(x))

# Drop NANs
natural_gas_inf = natural_gas_inf.dropna(axis=0)

We can now build the full pipeline.

# Data Preprocessing
numeric_features = ['4_week_rolling_avg', 'date_ordinal']
numeric_transformer = Pipeline( # No missing values, so no need for an imputer
    steps=[('scaler', StandardScaler())]
)

categorical_features = ['month','quarter','year']
categorical_transformers = Pipeline(
    steps=[('scaler_2', StandardScaler())]
)

preprocessor = ColumnTransformer(
    transformers = [
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformers, categorical_features)
    ]
)

pipeline = Pipeline(
    steps=[('preprocessor', preprocessor)]
)

Train Test Split

Instead of randomly splitting the train and testing sets, we will instead choose to hold the last 20% of spot prices for testing. This mirrors the real world where the model needs to predict only future values.

natural_gas_inf = natural_gas_inf.sort_values(by='date').reset_index()
train = natural_gas_inf.loc[0:len(natural_gas_inf) - (len(natural_gas_inf) * 0.2)-1]
test = natural_gas_inf.loc[len(natural_gas_inf) - (len(natural_gas_inf) * 0.2):]

print('Training Set: ', len(train))
print('Testing Set: ', len(test))
print('Split Percent: ', len(test)/(len(train)+len(test))*100,'%')

Training Set:  1119
Testing Set:  279
Split Percent:  19.95708154506438 %

train_y = train['spot_price_inf']
test_y = test['spot_price_inf']

train_x = pipeline.fit_transform(train[['4_week_rolling_avg', 'month','quarter','year','date_ordinal']])
test_x = pipeline.transform(test[['4_week_rolling_avg', 'month','quarter','year','date_ordinal']])

Model Evaluation

We can try various regressors from sklearn to fit our training data and predict the testing values. We will use the R2 and Explained Variance for fit metrics and Mean Squared Error (MSE) and Mean Absolute Percentage Error (MAPE) as prediction metrics. We are able to use MAPE because there are no actual values (spot prices) that are zero. We can always expect values greater than zero for both predictions and actuals.

from sklearn.linear_model import LinearRegression, Ridge, SGDRegressor #non-neg positive=True
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor, AdaBoostRegressor

lr_model = LinearRegression(positive=True)
lr_model.fit(train_x, train_y)
lr_predict = lr_model.predict(test_x)
print('Linear Model')
print('R2 Score: ', r2_score(lr_predict, test_y))
print('Explained Variance: ', explained_variance_score(lr_predict, test_y))
print('MSE: ', mean_squared_error(lr_predict, test_y))
print('MAPE: ', mean_absolute_percentage_error(lr_predict, test_y), '%')
print('')

r_model = Ridge()
r_model.fit(train_x, train_y)
r_predict = r_model.predict(test_x)
print('Ridge Model')
print('R2 Score: ', r2_score(r_predict, test_y))
print('Explained Variance: ', explained_variance_score(r_predict, test_y))
print('MSE: ', mean_squared_error(r_predict, test_y))
print('MAPE: ', mean_absolute_percentage_error(r_predict, test_y), '%')
print('')

sgd_model = SGDRegressor()
sgd_model.fit(train_x, train_y)
sgd_predict = sgd_model.predict(test_x)
print('SGD Model')
print('R2 Score: ', r2_score(sgd_predict, test_y))
print('Explained Variance: ', explained_variance_score(sgd_predict, test_y))
print('MSE: ', mean_squared_error(sgd_predict, test_y))
print('MAPE: ', mean_absolute_percentage_error(sgd_predict, test_y), '%')
print('')

svr_model = SVR(kernel='linear')
svr_model.fit(train_x, train_y)
svr_predict = svr_model.predict(test_x)
print('SVR Model')
print('R2 Score: ', r2_score(svr_predict, test_y))
print('Explained Variance: ', explained_variance_score(svr_predict, test_y))
print('MSE: ', mean_squared_error(svr_predict, test_y))
print('MAPE: ', mean_absolute_percentage_error(svr_predict, test_y), '%')
print('')

dt_model = DecisionTreeRegressor()
dt_model.fit(train_x, train_y)
dt_predict = dt_model.predict(test_x)
print('DT Model')
print('R2 Score: ', r2_score(dt_predict, test_y))
print('Explained Variance: ', explained_variance_score(dt_predict, test_y))
print('MSE: ', mean_squared_error(dt_predict, test_y))
print('MAPE: ', mean_absolute_percentage_error(dt_predict, test_y), '%')
print('')

gbr_model = GradientBoostingRegressor()
gbr_model.fit(train_x, train_y)
gbr_predict = gbr_model.predict(test_x)
print('GBR Model')
print('R2 Score: ', r2_score(gbr_predict, test_y))
print('Explained Variance: ', explained_variance_score(gbr_predict, test_y))
print('MSE: ', mean_squared_error(gbr_predict, test_y))
print('MAPE: ', mean_absolute_percentage_error(gbr_predict, test_y), '%')
print('')

abr_model = AdaBoostRegressor()
abr_model.fit(train_x, train_y)
abr_predict = abr_model.predict(test_x)
print('ABR Model')
print('R2 Score: ', r2_score(abr_predict, test_y))
print('Explained Variance: ', explained_variance_score(abr_predict, test_y))
print('MSE: ', mean_squared_error(abr_predict, test_y))
print('MAPE: ', mean_absolute_percentage_error(abr_predict, test_y), '%')
print('')

Linear Model
R2 Score:  0.8659888839421176
Explained Variance:  0.8659962175215565
MSE:  0.5539276910348315
MAPE:  0.1166230431517775 %

Ridge Model
R2 Score:  0.8653213114260554
Explained Variance:  0.8653322119911879
MSE:  0.555900354339198
MAPE:  0.11804571330243194 %

SGD Model
R2 Score:  0.8653333083555871
Explained Variance:  0.8653846962653181
MSE:  0.5551462735997426
MAPE:  0.11684135858066252 %

SVR Model
R2 Score:  0.8684036743031867
Explained Variance:  0.8686272502812704
MSE:  0.5592030489472042
MAPE:  0.11961196883869504 %

DT Model
R2 Score:  0.7996463636341851
Explained Variance:  0.8003998955611449
MSE:  0.7359502963035613
MAPE:  0.164906156229634 %

GBR Model
R2 Score:  0.7829375633679315
Explained Variance:  0.7881995102711575
MSE:  0.6599324656153797
MAPE:  0.13270916177164374 %

ABR Model
R2 Score:  0.822822387279977
Explained Variance:  0.8344717894122089
MSE:  0.6338089192395995
MAPE:  0.16736115879824623 %

Overall, the models performed pretty well. The Linear model, only specifying a non-negative linear model, had an R2 score of 0.866 and MAPE of 0.117. The worse performing model was the Gradient Boosted Regression model with an R2 of 0.776 and MAPE of 0.135, which is still fairly good. The best performing model was the SVR model with a linear kernel.

We can evaluate the Linear model and the SVR model more in depth.

xy_x = range(0,11,1)
xy_y = range(0,11,1)
fig, ax = plt.subplots()
ax.scatter(lr_predict, test_y)
ax.plot(xy_x, xy_y, color='black')
ax.set_xlim(0)
ax.set_ylim(0)
ax.set_title('Linear Model: Actual vs. Predictions')
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')

Text(0, 0.5, 'Actual')

png

xy_x = range(0,11,1)
xy_y = range(0,11,1)
fig, ax = plt.subplots()
ax.scatter(svr_predict, test_y)
ax.plot(xy_x, xy_y, color='black')
ax.set_xlim(0)
ax.set_ylim(0)
ax.set_title('Support Vector Regression Model: Actual vs. Predictions')
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')

Text(0, 0.5, 'Actual')

png

When we plot predictions vs. actuals, we see a similar pattern: at lower prices, predictions and actuals are very close, but at higher prices, predictions and actuals differ, but the predictions are split between over predicting and under predicting.

Around the predicted value of 42 there seem to be two points very far from the line y=x (or predictions = actuals), one of which is in actuality much higher than 4.

dates = test['date']

fig, ax = plt.subplots()
ax.plot(dates, test_y, label='Actuals')
ax.plot(dates, lr_predict, label='Predicted')
ax.legend()
ax.set_title('Linear Model: Actuals vs. Predicted')
ax.set_xlabel('Date')
ax.set_ylabel('Price')

Text(0, 0.5, 'Price')

png

fig, ax = plt.subplots()
ax.plot(dates, test_y, label='Actuals')
ax.plot(dates, svr_predict, label='Predicted')
ax.legend()
ax.set_title('SVR Model: Actuals vs. Predicted')
ax.set_xlabel('Date')
ax.set_ylabel('Price')

Text(0, 0.5, 'Price')

png

lr_model.coef_

array([1.79160711e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       1.34136661e-03])

svr_model.coef_

array([[ 1.81786274, -0.02265806,  0.05588361, -0.06351241,  0.01449046]])

Plotting the predictions and actuals over time shows exactly what this model is doing. We see that the model is more or less relying soly on the 4 Week Rolling Average to make the predictions. This means the predictions are a time shifted, smoothed version of the 4 week moving average. Overall, this makes the fit fairly good, but the predictions can be wildly off, especially if there is a large spike in price, like the spike around early 2021.

ARMA and ARIMA

Instead of making a prediction based on the 4 week moving average (or any other moving average), we can instead try to use Autoregressive Moving Average (ARMA) and Autoregressive Integrated Moving Average modeling techniques to predict this data.

References:

https://www.machinelearningplus.com/time-series/time-series-analysis-python/
https://machinelearningmastery.com/time-series-forecasting-methods-in-python-cheat-sheet/

from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_predict
from statsmodels.tsa.seasonal import seasonal_decompose

train_dates = pd.DatetimeIndex(train['date'].values, freq='infer')
train_values = train['spot_price_inf'].values.astype(float)
ts = pd.DataFrame(train_values, index=train_dates, columns=['spot_price_inf'])
result_mul = seasonal_decompose(ts.asfreq('W-FRI').dropna(), model='multiplicative', period=4)
result_add = seasonal_decompose(ts.asfreq('W-FRI').dropna(), model='additive', period=4)

result_mul.plot().suptitle('Multiplicative', y=1)
result_add.plot().suptitle('Additive', y=1)

Text(0.5, 1, 'Additive')

png

We can see by the trend decomposition graphs that there does not seem to be any obvious trends, but we can still fit the ARIMA models and plot the predictions.

We can fit the model with a (1,0,1) order model, or a first order Autoregressive and Moving Average Model.

train_dates = pd.DatetimeIndex(train['date'].values).to_period('W-FRI')
train_values = train['spot_price_inf'].values.astype(float)
ts = pd.Series(train_values, index=train_dates)
arma_model = ARIMA(ts, order=(1,0,1))
arma_res = arma_model.fit()
print(arma_res.summary())

fig, ax = plt.subplots()
ax.plot(test['date'], test['spot_price_inf'], label='actual')
fig = plot_predict(arma_res, start='2018-07-13', end='2023-11-10', ax=ax)
legend = ax.legend()

                               SARIMAX Results                                
==============================================================================
Dep. Variable:                      y   No. Observations:                 1119
Model:                 ARIMA(1, 0, 1)   Log Likelihood                -220.281
Date:                Sun, 24 Dec 2023   AIC                            448.563
Time:                        11:24:46   BIC                            468.644
Sample:                    01-17-1997   HQIC                           456.153
                         - 06-29-2018                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.7732      0.917      3.025      0.002       0.977       4.570
ar.L1          0.9846      0.004    265.261      0.000       0.977       0.992
ma.L1          0.0773      0.011      6.905      0.000       0.055       0.099
sigma2         0.0865      0.001     75.415      0.000       0.084       0.089
===================================================================================
Ljung-Box (L1) (Q):                   0.02   Jarque-Bera (JB):             25231.90
Prob(Q):                              0.90   Prob(JB):                         0.00
Heteroskedasticity (H):               1.09   Skew:                             0.27
Prob(H) (two-sided):                  0.42   Kurtosis:                        26.26
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

png

Unfortunately, since the ARMIA model could not determine trends, the model is very basic and the confidence interval does not capture the spikes in data around 2022-2023.

XGBoost

We can try one more modeling technique, XGBoost. XGBoost is another ensemble method and can be used for time series forecasting.

References:

https://machinelearningmastery.com/xgboost-for-time-series-forecasting/
https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/

import xgboost

xgb_model = xgboost.XGBRegressor(objective='reg:squarederror', n_estimators=1000)
xgb_model.fit(train_x, train_y)
xgb_predict = xgb_model.predict(test_x)

print('XGB Model')
print('R2 Score: ', r2_score(xgb_predict, test_y))
print('Explained Variance: ', explained_variance_score(xgb_predict, test_y))
print('MSE: ', mean_squared_error(xgb_predict, test_y))
print('MAPE: ', mean_absolute_percentage_error(xgb_predict, test_y), '%')
print('')

XGB Model
R2 Score:  0.3291877891073315
Explained Variance:  0.3750278195181632
MSE:  1.1987879121383782
MAPE:  0.20415018788363182 %

xy_x = range(0,11,1)
xy_y = range(0,11,1)
fig, ax = plt.subplots()
ax.scatter(xgb_predict, test_y)
ax.plot(xy_x, xy_y, color='black')
ax.set_xlim(0)
ax.set_ylim(0)
ax.set_title('XGBoost Model: Actual vs. Predictions')
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')

Text(0, 0.5, 'Actual')

png

dates = test['date']

fig, ax = plt.subplots()
ax.plot(dates, test_y, label='Actuals')
ax.plot(dates, xgb_predict, label='Predicted')
ax.legend()
ax.set_title('XGBoost Model: Actuals vs. Predicted')
ax.set_xlabel('Date')
ax.set_ylabel('Price')

Text(0, 0.5, 'Price')

png

We can see that the XGBoost model is predicting much different values from the actual value. The overall R2 and MAPE metrics for the model are much worse than previous models, which is not surprising seeing the Predicted vs. Actuals graphs.

The model seems to under-predict when the spot price is relatively high, but over-predicts when the spot price is low.

Actual Future Spot Price Evaluation

With no consensus for which model would preform the best, we can examine the Linear, SVR, and XGBoost predictions for the future values.

A special prediction function must be created in order to predict multiple periods into the future, while taking into account previous period predictions. This will be done by iteratively predicting each time step and taking the model prediction as the ‘actual’ in subsequent steps.

# Prediction Function (4 Week Rolling Average)
def predict_prices(period, model, pipeline, org_data):
    copy_data = org_data.copy()
    copy_data = copy_data[['date', 'spot_price', '4_week_rolling_avg','month','quarter','year', 'date_ordinal']]
    copy_data['prediction/actual'] = 'actual'
    max_date = copy_data['date'].max()
    max_date_ordinal = copy_data['date_ordinal'].max()
    date_dict = {}
    for i in range(period):
        add = (i+1)*7
        new_date_ordinal = max_date_ordinal + add
        new_date = max_date + timedelta(days=add)
        date_dict[new_date_ordinal] = new_date
    for key in date_dict:
        c_month = date_dict[key].month
        c_quarter = date_dict[key].quarter
        c_year =  date_dict[key].year
        c_4_week_avg = copy_data[(copy_data['date_ordinal'] < key) & (copy_data['date_ordinal'] >= (key-28))]['spot_price'].mean()
        d = {'date': date_dict[key], '4_week_rolling_avg':c_4_week_avg, \
             'month':c_month, 'quarter':c_quarter, 'year':c_year, 'date_ordinal':key}
        temp_df = pd.DataFrame(d, index=[0])
        test_x = pipeline.transform(temp_df)
        predict = model.predict(test_x)
        d2 = {'date': date_dict[key], '4_week_rolling_avg':c_4_week_avg, \
             'month':c_month, 'quarter':c_quarter, 'year':c_year, 'date_ordinal':key, \
              'spot_price':round(predict[0],2), 'prediction/actual':'prediction'}
        add_df = pd.DataFrame(d2, index=[0])
        copy_data = pd.concat([copy_data, add_df], ignore_index=True)
    return copy_data 

predict_prices(2, xgb_model, pipeline, natural_gas_inf)

	date	spot_price	4_week_rolling_avg	month	quarter	year	date_ordinal	prediction/actual
0	1997-01-17	4.19	1.497475	1	1	1997	729041	actual
1	1997-01-24	2.98	1.576498	1	1	1997	729048	actual
2	1997-01-31	2.91	1.443477	1	1	1997	729055	actual
3	1997-02-07	2.53	1.370052	2	1	1997	729062	actual
4	1997-02-14	2.3	1.245150	2	1	1997	729069	actual
...	...	...	...	...	...	...	...	...
1396	2023-10-27	2.89	2.928951	10	4	2023	738820	actual
1397	2023-11-03	3.17	2.965000	11	4	2023	738827	actual
1398	2023-11-10	2.46	3.030000	11	4	2023	738834	actual
1399	2023-11-17	2.43	2.845000	11	4	2023	738841	prediction
1400	2023-11-24	2.46	2.737500	11	4	2023	738848	prediction

1401 rows × 8 columns

We can see the output is a dataframe with actual and prediction values. While the model does not need all data, we can return the full dataframe, as it is fairly small.

We can now assess the actual week ending spot prices. We could re-download the data, but because we are only looking at a few periods, we can just copy the data directly from the Spot Price website, previously cited.

y_actual = {'date':[date(2023,11,17),date(2023,11,24),date(2023,12,1),date(2023,12,8),date(2023,12,15)],
            'spot_price':[2.74,2.64,2.72,2.62,2.38], 'prediction/actual':['actual','actual','actual','actual','actual']}
y_actual_df = pd.DataFrame(y_actual)
y_actual_df['date'] = y_actual_df['date'].apply(lambda x: pd.to_datetime(x))
y_actual_df

	date	spot_price	prediction/actual
0	2023-11-17	2.74	actual
1	2023-11-24	2.64	actual
2	2023-12-01	2.72	actual
3	2023-12-08	2.62	actual
4	2023-12-15	2.38	actual

We can now predict the next 5 weeks of data for the Linear Regressor, SVR, and XGBoost model.

lr_predict_df = predict_prices(5, lr_model, pipeline, natural_gas_inf)
svr_predict_df = predict_prices(5, svr_model, pipeline, natural_gas_inf)
xgb_predict_df = predict_prices(5, xgb_model, pipeline, natural_gas_inf)

lr_predict_df_full = pd.concat([lr_predict_df,y_actual_df])
svr_predict_df_full = pd.concat([svr_predict_df,y_actual_df])
xgb_predict_df_full = pd.concat([xgb_predict_df,y_actual_df])

fig, ax = plt.subplots()
df = lr_predict_df_full[lr_predict_df_full['date'] >= pd.to_datetime(date(2023,1,1))]
ax.plot(df[df['prediction/actual']=='actual']['date'], df[df['prediction/actual']=='actual']['spot_price'], label='actual')
ax.plot(df[df['prediction/actual']=='prediction']['date'], df[df['prediction/actual']=='prediction']['spot_price'], label='prediction')
ax.legend()
ax.set_title('Linear Model: Actuals vs. Predicted')
ax.set_xlabel('Date')
ax.set_ylabel('Price')

y_actuals = y_actual_df['spot_price'].values
y_predict = lr_predict_df[lr_predict_df['prediction/actual']=='prediction']['spot_price'].values

print('Mean Squared Error: %.2f'%(mean_squared_error(y_actuals, y_predict)))
print('Mean Absolute Percent Error: %.2f'%(mean_absolute_percentage_error(y_actuals, y_predict)),'%')

Mean Squared Error: 0.06
Mean Absolute Percent Error: 0.08

png

fig, ax = plt.subplots()
df = svr_predict_df_full[svr_predict_df_full['date'] >= pd.to_datetime(date(2023,1,1))]
ax.plot(df[df['prediction/actual']=='actual']['date'], df[df['prediction/actual']=='actual']['spot_price'], label='actual')
ax.plot(df[df['prediction/actual']=='prediction']['date'], df[df['prediction/actual']=='prediction']['spot_price'], label='prediction')
ax.legend()
ax.set_title('SVR Model: Actuals vs. Predicted')
ax.set_xlabel('Date')
ax.set_ylabel('Price')

y_actuals = y_actual_df['spot_price'].values
y_predict = svr_predict_df[svr_predict_df['prediction/actual']=='prediction']['spot_price'].values

print('Mean Squared Error: %.2f'%(mean_squared_error(y_actuals, y_predict)))
print('Mean Absolute Percent Error: %.2f'%(mean_absolute_percentage_error(y_actuals, y_predict)),'%')

Mean Squared Error: 0.03
Mean Absolute Percent Error: 0.05 %

png

fig, ax = plt.subplots()
df = xgb_predict_df_full[xgb_predict_df_full['date'] >= pd.to_datetime(date(2023,1,1))]
ax.plot(df[df['prediction/actual']=='actual']['date'], df[df['prediction/actual']=='actual']['spot_price'], label='actual')
ax.plot(df[df['prediction/actual']=='prediction']['date'], df[df['prediction/actual']=='prediction']['spot_price'], label='prediction')
ax.legend()
ax.set_title('XGBoost Model: Actuals vs. Predicted')
ax.set_xlabel('Date')
ax.set_ylabel('Price')

y_actuals = y_actual_df['spot_price'].values
y_predict = xgb_predict_df[xgb_predict_df['prediction/actual']=='prediction']['spot_price'].values

print('Mean Squared Error: %.2f'%(mean_squared_error(y_actuals, y_predict)))
print('Mean Absolute Percent Error: %.2f'%(mean_absolute_percentage_error(y_actuals, y_predict)),'%')

Mean Squared Error: 0.11
Mean Absolute Percent Error: 0.11 %

png

Conclusion and Reflections

Overall, the SVR model did the best at predicting future prices for multiple future periods. This is not surprising given that the model performed the best against the testing data. While the first two models overpredicted the spot price, the XGBoost model underpredicted the spot price. However, the XGBoost predicted a closer price during the final period, where the spot price took a dramatic decrease, compared to the first two models.

This project was very interesting to complete. There are a lot of things to take into account with predicting time series data, including how past data should be incorporated as well as how to predict multiple periods in the future. I am happy with the skills I gained completing this project even if the overall model results could be better.

Future improvements can be made by incorporating or testing various rolling average ranges or trying advanced time series data techniques, as the first two models are too dependent on the single rolling-average spot price column. Additionally, finding a way to output confidence intervals similar to the ARMA model would be helpful, as it’s unlikely these models will predict the exact spot price.