Analyzing 24 years of wildfire data in the US

Characterizing the differences over the years and between states

11 min readFeb 11, 2021

*Mike McMillan — U.S. Forest Service/U.S. Department of Agriculture*

Given the most recent California wildfire season, many people assume that wildfires in the US are getting much worse. Are wildfires more frequent? Larger? Predictable? Using data collected from almost 2 million wildfires in the United States from 1992 to 2015, I will attempt to answer the following 5 questions:

Was there an increase in the number of wildfires between 1992 and 2015?
Was there an increase in the number of acres burned between 1992 and 2015?
How does the wildfire season vary between states?
Can states be clustered based on their annual wildfire distribution?
Can the duration of wildfires be predicted?

Wildfire dataset

The data used in this study contains 24 years of geo-referenced wildfire data and was downloaded from Kaggle (https://www.kaggle.com/rtatman/188-million-us-wildfires). It includes information on 1.88 million wildfires from 1992 to 2015, representing over 140 million acres burned. Each wildfire has about 40 recorded fields, including: state and county information, positional data, date and time of fire discovery and containment, and fire size. Basic data cleaning had been performed to ensure the data conformed to standards and to remove any duplicate records. However, many of the records are incomplete, with some missing fields. The data fields along with number of entries is summarized below. The fields in blue are the ones I used for my analysis. Note that the fields DISCOVERY_TIME (time of day the fire was discovered), CONT_DATE (date of fire containment), CONT_DOY (day of year of containment), and CONT_TIME (time of day of containment) only have about half of the records entered. These fields were only used for the final question (predicting fire duration). To answer this question, a subset of records were selected that contained entries for all these fields.

After an initial visualization of the data, I decided to focus my analysis on the lower 48 states (plus DC). I thought that the outlying locations (and differing climates along with potentially different fire patterns) of Alaska, Hawaii, and Puerto Rico would compromise my ability to distill useful information from the data. I was left with 1.83 million record from the contiguous United States. The fires are visualized in the map below, with the colour showing the day of discovery and size representing the acres burned.

I find this map fascinating. You can immediately see some differences in wildfire season, the early-year season of the east in blue contrasting with the summer season in the west in orange. You can also see the western states have more large fires than the east. In addition, you can see many geographical features of the US: the thick forests of the east and south, the Mississippi River, the agriculture-rich mid-west, and the deserts of California and Nevada, surrounded by forested mountainous areas. Major transportation routes are also visible, particularly in the southwest. Unfortunately, you can also see what appear to be some data quality issues at state boundaries. For example, Pennsylvania seems to be under-reporting fires compared to its neighbours.

Methods and Tools

The analysis and visualization was done using a combination of Tableau and Python. I used Tableau for the simple aggregations (summing wildfires/year, wildfires/state/year) and corresponding visualizations. I also used Tableau for all of the map displays, often involving exporting results from Python. The more involved aggregations, radial plots and clustering/classification were were done in Python using JupyterLab, taking advantage of the pandas, matplotlib, numpy, scipy, and sklearn libraries.

Was there an increase in the number of wildfires between 1992 and 2015?

Based on the recent severe wildfire season, I was curious if there has been a significant increase in wildfires over the 24 years of data. I first summed number of wildfires for each year. The graph below shows the results.

On the left side of the plot, I am displaying the number of wildfires for each year, along with the mean and its 95% confidence interval. Note the large variance between fire years, with nearly twice as many fires in some years compared to others. This variation is due mainly to differing weather, with hotter and drier years tending to have far more wildfires than wetter and cooler years. The line graph shows the number of wildfires plotted by year in blue along with a trendline in grey. The trendline has a p-value of 0.47, meaning there was no significant change in the number of wildfires between 1992 and 2015. I computed the trendlines for each state and have summarized the results in the map below.

There are 14 states that showed a significant increase in the number of wildfires and 10 states showed a significant decrease. Pennsylvania was one of the states that showed a significant increase in wildfires. However, based on the chart below, I think this increase is suspicious. There are very few fires recorded before 2002. Combined with my observation on the overall fire map I think it is likely that Pennsylvania under-reported fires before 2002. When the fire data from 1992 to 2001 are excluded, Pennsylvania shows no significant change in the number of wildfires.

Was there an increase in the number of acres burned between 1992 and 2015?

For this question I ran similar analysis for the as for the previous, summing the number of acres burned instead of counting the number of wildfires. The graph of the mean and confidence interval along with the acres burned by year is shown below.

The distribution of acres burned shows more variability than the number of wildfires. The peaks and troughs are roughly correlated as you would expect, given that there are likely to be more acres burned in years with more wildfires. However, there is a much larger increase over time, resulting in a trendline with a p-value of 0.045. There was a significant increase in acres burned due to wildfires between the years of 1992 and 2015. I also computed the trendlines for all states with the results shown on the map below.

There were 9 states with significant increases in the number of acres burned and only 3 states with a significant decrease. Interestingly, there were fewer states with an increase in acres burned than states with an increase in number of wildfires. I think this can be attributed to the higher variance in acres burned, resulting in higher p-values and less chance of the trendline being significant.

How does the wildfire season vary between states?

Looking at the map of all the wildfires, it quickly becomes apparent that the wildfire season is variable across the country. To understand these differences, I summed the number of wildfires by day of discovery for each state. The map below shows each state coloured by the month with the most wildfires. For six representative states (Idaho, California, Texas, Minnesota, Maine and Florida) I have displayed the daily wildfire distribution on a radial plot, with the radius showing the percent of annual wildfires on each day of the year.

This map clearly shows the east-west divide in terms of wildfire season. The eastern states mainly have the highest number of wildfires between March and May. The western states have the highest number of wildfires in either July or August. Not a single state has June as their dominant month. The radial plots also reveal some interesting information. The northern states tend to have shorter wildfire seasons, with the majority of fires over occurring over 1 or 2 months. The southern states have longer wildfire seasons. California exhibits a long single season from June to October, while Texas and Florida have two distinct seasons, one in the early spring and one in the summer. Texas has two spikes in its distribution, one around New Year’s Day and the other around July 4th. These spikes are presumably related to fireworks, a well known cause of wildfires. Texas also is the only state with most wildfires in January. From the radial plot it looks like if the effect of the fireworks were removed it would have the most wildfires in either February or March, more consistent with its neighbouring states.

Can states be clustered based on their wildfire season?

The daily wildfire distribution between states revealed interesting differences in wildfire season. I used these data as input to attempt to cluster the states. There appears to be two obvious clusters (east and west) but I was hoping to divide the states into 3 or 4 useful clusters without too many outliers.

The input for clustering was the percentage of wildfires for each day of the year for each state. I used two methods: k-means and hierarchical clustering. Both methods produced the same 3 stable clusters, as shown below.

When looking at the map of most wildfires by month in the previous section, the break between the north and south clusters is not so obvious. However, when looking at the dendrogram below, the three clusters are very clearly defined. The average daily wildfire distributions also show clear differences between clusters. The south, with its earlier spring, shows peak wildfires in March with a primary season between February and April. The north, with its later spring, has the majority of wildfires in April. And the west shows a strong summer wildfire season in July and August.

I used a distance of 80 to define my hierarchical clusters. The only outlier is the District of Colombia. Given its urban setting, small area, and small number of wildfires I expected DC (along with other small northeastern states) to be outliers. DC should be part of the south cluster as it is separated from the north by Maryland (~100 km). Excluding DC, the composition of the clusters is very well defined. I tried to tease out a few more clusters by using selection distances of 30 to 40 and by running k-means with 4, 5, or 6 clusters. The south started to break into 2 sensible clusters but the north and west did not separate so nicely. I decided that these 3 clusters provided the most reasonable separation.

Can the duration of wildfires be predicted?

I decided to try to predict the duration of the wildfires from other features in the dataset. The first step was to create a duration for each wildfire. Note, that by duration I am referring to the time from the discovery of the fire to when it was contained (not extinguished). This dataset contains information on the day and time of discovery and containment of the wildfires. From these fields, I calculated the duration of each wildfire. Not every fire contained all the necessary fields, with some fires not having discovery or containment time and others having both set to the same time. I removed these fires and was left with around 770 thousand fires. These fires were then labeled into 4 similar sized groups as follows:

I split the data into training (80%) and testing datasets (20%). I selected features that I thought could be predictive. After running several tests, I settled on the following features to predict the duration:

Fire Size: This was the most obvious feature to use. Initially I was worried this would be too predictive. However, when used in isolation it was not a very strong predictor of duration. Also, Fire Size was strongly right skewed so I applied a log transformation to produce a more normal distribution.
Discovery Day of Year: Based on the above analysis of day of year, I thought this feature would be useful. It should help explain differences between fires that are in wildfire season vs not as the resources to contain fires may be more prepared in season or may be more available out of season. Also, fires out of season may be easier to contain given less than ideal burning conditions.
Latitude and Longitude: These features are useful in several ways. They accommodate for spatial variation in conditions that cause the wildfires. But they also help explain differences in wildfire response between jurisdictions at the state and county levels.

With the training and testing datasets defined and features selected, I tested a variety of classifiers including k-nearest neighbour, naive bayes, and neural networks. K-nearest neighbour gave by far the best predictions, giving an accuracy of 52.3%. While not great given there were only 4 categories, the results show that the duration of wildfires are somewhat predictable. The confusion matrix is displayed below and shows for that for each of the 4 categories the correct label was applied far more than any of the incorrect labels.

I was interested in the results at the state level. Decomposing the test data into their states, I produced the map below showing the accuracy for each state. The colormap is centred on overall accuracy of 52.3%, showing states with above average accuracy in orange and states with below average accuracy in blue.

I found this map difficult to interpret at first. Overall, the western states (with the notable exception of Idaho) show below average predictions. The smaller states in the northeast have quite variable results, owing I think to the small number of wildfires in those states. The most notable outlier of the major wildfire states is Wisconsin, which with over 13000 fires has an accuracy of over 75%. Looking distribution of fires in Wisconsin over my 4 label categories, I noted that the duration is skewed strongly towards shorter duration, as shown in the plot below.

Most of Wisconsin’s wildfires are in a single category, which I think explains why Wisconsin’s prediction accuracy was so high. Another state with high prediction accuracy, Idaho, is skewed in the other direction as shown below.

I verified these results with several other outliers. States with a dominant class of fires (either label 1 or 4) all exhibited higher than average accuracy. This is not surprising and highlights the importance of understanding the distribution of your data before drawing conclusions!

Conclusion

I started this article with a series of questions. Based on my analysis, here are the key findings:

There was not an increase in the number of wildfires in the contiguous USA between 1992 and 2015.
However, there was a significant increase in the acres burned over this timeframe.
The daily wildfire distribution shows marked differences between the wildfire season across the country (and Texans like their fireworks!)
The daily wildfire distribution can be used to cluster the states into 3 distinct groups with almost no outliers (DC!): west, north and south.
The duration of a wildfire is somewhat predictable, but care needs to be taken when interpreting the results.

Thank you for taking the time to read my article. Hopefully you have learned something along the way, I know I have!