1 Wildfire Spread Prediction

In this use case, we will predict how wildfires spread within 24 hours, using a U-NET neural network architecture.

A causal flick of a burning cigarette but; a dry forest; strong winds. A wildfire starts, hot enough to be detected by thermal satellites.

Where should the firefighters focus their attention? Which places have to be evacuated? To decide these things, we have to predict where the fire will spread. Wildfire spread depends on many factors, like the extend of the initial fire, the dryness of the surrounding area, elevation and much more. A perfect use case for machine learning.

Designing the prediction task

How can we model wildfire spread with machine learning? When using satellites, we get a “fire” map, a binary map that says fire yes or no for each pixel. This requires some pre-processing, since thermal satellites measure heat on a scale, so we have to apply a threshold and remove false positives like burnings form industry plants (e.g., controlled burnings of methane). For this use case, however, the data pre-processing of the raw satellite images into binary fire maps is already done.

Since we are interested in modeling wildfire spread over time, we need binary fire maps for two points in time, for example, on day t and on day t+1. The fire map on t+1 is our outcome. So the prediction task looks like this:

However, we don’t only want to predict fire at t+1 from the fires at t, but use additional information, like the elevation profile, the wind strengths, the dryness, and so on. We can spatially represent this information as images of the same size as the fire map at time t.

This makes the input an “image” or rather a 3-dimensional Tensor with the dimensions longitude, latitude, and features. A regular image typically has three features, like red, green, and blue channels, but we will work with many more. The prediction task itself can be seen as a segmentation task, because we segment our Tensor into fire vs. no fire at time t+1:

Next up: Get some data.

The “Next Day Wildfire Spread” data

Fortunately, there is already data available to train such a segmentation model. Huot et al. (2022) collected and published the “Next Day Wildfire Spread” dataset. The dataset is published in multiple places, one being Kaggle. . It’s published with a CC-BY 4.0 license, meaning you are allowed to mix and adapt it, even use it commercially, as long as you attribute the authors.

This is inside the Next Day Wildfire Data:

18545 fire events across the US.
Each fire event has two snapshots, one at time t, one at t+1.
The fire events are split into training, validation and testing with a ratio of 8:1:1.
Satellite images are 64 x 64 km.
Inputs: previous fire mask, elevation, wind direction, wind speed, min temp, max temp, humidity, precipitation, drought, vegetation, population density, energy release component.
Input features were generated with Google Earth Engine.
Output: Wildfire on the next day.

Picking a model architecture

I’m using a U-Net architecture, which is a solid choice for segmentation tasks. U-Nets consist of an encoder and a decoder part. The job of the encoder is to distill information from the features, the job of the decoder to scale it again to an image. In addition, U-Net architectures have skip-layer connections, meaning previous, more raw layer information “survives” the encoding and is concatenated in the decoding part with the information pushed through the bottleneck. I think this skip layer part is the right inductive bias for wildfire spread prediction, since, especially for the input wildfire mask at time t. In fact, simply predicting that the wildfire at time t+1 is the same as at time t is a good baseline.

Thankfully, Huot et al. (2022) have published their code along with paper and data, so that I could build on that. They released their code under the Apache License 2.0, which allows commercial use as well. They used a different model architecture with a residual convolutional neural network with dropout.

Downloading the data

The data is hosted on Kaggle, but can be downloaded without an account. Here is how to download and unzip it. Just change the folder locations as needed.

curl -L -o ~/Downloads/next-day-wildfire-spread.zip\
  https://www.kaggle.com/api/v1/datasets/download/fantineh/next-day-wildfire-spread
unzip ~/Downloads/next-day-wildfire-spread.zip -d ~/Downloads/next-day-wildfire-spread

The files looks like this:

The file format .tfrecord is specific to the deep learning library TensorFlow. The files are already split into train, validation, and test. In theory, you could split it differently, but it’s good to stay with this split since it means you can compare your solution directly with the original paper.

Load and process the data

This code is adapted from the Next Day Wildfire Spread paper (Apache License 2.0) in their Github repository.

import re
from typing import Dict, List, Optional, Text, Tuple
import matplotlib.pyplot as plt
from matplotlib import colors
import tensorflow as tf
import numpy as np

tf.random.set_seed(42)
np.random.seed(42)

Before loading the data, we define the feature names and feature scaling parameters.

INPUT_FEATURES = [
    'PrevFireMask', 'elevation', 'th', 'vs', 'tmmn', 'tmmx', 'sph',
    'pr', 'pdsi', 'NDVI', 'population', 'erc'
]
OUTPUT_FEATURES = ['FireMask']
PREV_FIRE_MASK_INDEX = INPUT_FEATURES.index('PrevFireMask')
SIDE_LENGTH = 32

# Format: (min_clip, max_clip, mean, std)
DATA_STATS = {
    'PrevFireMask': (-1.0, 1.0, 0.0, 1.0),
    'elevation': (0.0, 3141.0, 657.3, 649.0),
    'th': (0.0, 360.0, 190.3, 72.6),
    'vs': (0.0, 10.02, 3.85, 1.41),
    'tmmn': (253.15, 298.95, 281.1, 8.98),
    'tmmx': (253.15, 315.09, 295.2, 9.82),
    'sph': (0.0, 1.0, 0.0072, 0.0043),
    'pr': (0.0, 44.53, 1.74, 4.48),
    'pdsi': (-6.13, 7.88, -0.005, 2.68),
    'NDVI': (-9821.0, 9996.0, 5157.6, 2466.7),
    'population': (0.0, 2534.06, 25.53, 154.72),
    'erc': (0.0, 106.25, 37.33, 20.85),
    'FireMask': (-1.0, 1.0, 0.0, 1.0),
}

We also need to define some functions for loading the data. For the data preparation, we do the following:

_clip_and_rescale:
- Clip the data to make sure that the data does not take on extreme values.
- Rescale the data to the range of 0 to 1.
_augment: Augment the data. This serves multiple purposes: First, it helps us bring more variety to the data. 15k wildfires is not a bad start, but also not plenty. In addition, we want the model to be robust to certain data invariances:
- Random cropping: this helps the model to learn that fires can be in different parts of the images. When I didn’t use cropping, most fires started in the middle of the image and consequently the model learned to always predict a fire for t+1 in the middle, since fires tend to persist where they are. The cropping step means that our model works with 32x32 images.
- Random flipping: Flip the images randomly horizontally and vertically. Helps the model learn invariance to orientation. Imagine we had only data from a region where wind mostly blows from the West. The model wouldn’t need to learn a strong connection to wind, but could simply learn that fires grow eastwards.
Filtering: Random cropping can lead to inputs without fire at time t. These are filtered out, because there is nothing useful to learn from them.

Here are the functions:

def _clip_and_rescale(x: tf.Tensor, key: Text) -> tf.Tensor:
    min_val, max_val, _, _ = DATA_STATS[key]
    x = tf.clip_by_value(x, min_val, max_val)
    return x if key == "PrevFireMask" else tf.math.divide_no_nan(x - min_val, max_val - min_val)

def _parse_fn(proto: tf.train.Example, size: int) -> Tuple[tf.Tensor, tf.Tensor]:
    keys = INPUT_FEATURES + OUTPUT_FEATURES
    parsed = tf.io.parse_single_example(proto, {
        k: tf.io.FixedLenFeature([size, size], tf.float32) for k in keys
    })
    x = tf.stack([_clip_and_rescale(parsed[k], k) for k in INPUT_FEATURES], axis=-1)
    y = tf.stack([parsed[k] for k in OUTPUT_FEATURES], axis=-1)
    return x, y

def _augment(x, y):
    combined = tf.concat([x, y], axis=-1)
    combined = tf.image.random_crop(combined, [SIDE_LENGTH, SIDE_LENGTH, tf.shape(combined)[-1]])
    combined = tf.image.random_flip_left_right(combined)
    combined = tf.image.random_flip_up_down(combined)
    in_ch = len(INPUT_FEATURES)
    return combined[..., :in_ch], combined[..., in_ch:]

def _has_fire(x, y): return tf.reduce_any(tf.not_equal(x[..., PREV_FIRE_MASK_INDEX], 0.0))

def get_dataset(split: Text, batch_size=64, size=64) -> tf.data.Dataset:
    files = tf.data.Dataset.list_files(f"data/next-day-wildfire-spread/next_day_wildfire_spread_{split}*")
    return (files
        .interleave(tf.data.TFRecordDataset, num_parallel_calls=tf.data.AUTOTUNE)
        .map(lambda x: _parse_fn(x, size), num_parallel_calls=tf.data.AUTOTUNE)
        .shuffle(16000)
        .map(_augment, num_parallel_calls=tf.data.AUTOTUNE)
        .filter(_has_fire)
        .batch(batch_size)
        .prefetch(tf.data.AUTOTUNE))

Finally, we can use the function get_dataset to generate the training, validation, and test datasets:

train_dataset = get_dataset('train')
validation_dataset = get_dataset('eval')
test_dataset = get_dataset('test')

Training the Model

Let’s define the architecture of the U-Net model:

import tensorflow as tf
from tensorflow.keras import layers, models, callbacks, losses, metrics, optimizers

def conv_block(x, filters):
    for _ in range(2):
        x = layers.Conv2D(filters, 3, padding='same')(x)
        x = layers.BatchNormalization()(x)
        x = layers.LeakyReLU(0.1)(x)
    return x

def build_unet(input_shape=(32, 32, 12)):
    inputs = tf.keras.Input(shape=input_shape)
    c1, p1 = conv_block(inputs, 32), layers.MaxPooling2D()(conv_block(inputs, 32))
    c2, p2 = conv_block(p1, 64), layers.MaxPooling2D()(conv_block(p1, 64))
    c3, p3 = conv_block(p2, 128), layers.MaxPooling2D()(conv_block(p2, 128))
    b = conv_block(p3, 256)
    u3 = conv_block(layers.Concatenate()([
        layers.Conv2DTranspose(128, 3, strides=2, padding='same')(b), c3]), 128)
    u2 = conv_block(layers.Concatenate()([
        layers.Conv2DTranspose(64, 3, strides=2, padding='same')(u3), c2]), 64)
    u1 = conv_block(layers.Concatenate()([
        layers.Conv2DTranspose(32, 3, strides=2, padding='same')(u2), c1]), 32)
    outputs = layers.Conv2D(1, 1, activation='sigmoid')(u1)
    return models.Model(inputs, outputs)

Next, we define the loss function that we optimize for. I picked the focal loss, which is defined as:

\[ \text{FL}(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t) \]

where ( p_t ) is the model’s estimated probability for the true class, ( _t ) is a weighting factor, and ( ) is a focusing parameter that reduces the loss contribution from easy examples and extends it for hard ones.

I didn’t pick TensorFlow’s built-in implementation of the focal loss (tf.keras.losses.BinaryFocalCrossentropy), but a custom one. The reason is that there can be missing values in the output fire mask (with value = -1), which should be ignored when computing the loss.

@tf.keras.utils.register_keras_serializable()
def masked_focal_loss(y_true, y_pred):
    mask = tf.cast(tf.not_equal(y_true, -1), tf.float32)
    y_true_clipped = tf.clip_by_value(y_true, 0.0, 1.0)
    y_pred_clipped = tf.clip_by_value(y_pred, 1e-7, 1.0 - 1e-7)
    
    bce = tf.keras.losses.BinaryFocalCrossentropy(from_logits=False, gamma=2.0,
                                                  reduction=tf.keras.losses.Reduction.NONE)
    fl = bce(y_true_clipped, y_pred_clipped)
    fl = tf.expand_dims(fl, -1) if fl.shape.rank == 3 else fl
    return tf.reduce_sum(mask * fl) / tf.reduce_sum(mask)

While we optimize for the focal loss, we want to evaluate the model on a different metric. For segmentation, one such metric is the area under the precision-recall curve (PR AUC).

It’s defined as the area under the curve that plots precision (positive predictive value) against recall (true positive rate) for different classification thresholds.
Formally, it’s the integral:

\[ \text{PR AUC} = \int_0^1 \text{Precision}(r) \, dr \]

where ( r ) is the recall value.

There are many more non-fire pixels than fire pixels in the outcome. That’s why we need a metric that is sensitive to this unbalanced outcome, which PR AUC is well suited for.

@tf.keras.utils.register_keras_serializable()
class MaskedPRAUC(tf.keras.metrics.AUC):
    def __init__(self, **kwargs):
        kwargs.setdefault('name', 'masked_pr_auc')
        kwargs.setdefault('curve', 'PR')
        super().__init__(**kwargs)

    def update_state(self, y_true, y_pred, sample_weight=None):
        mask = tf.not_equal(y_true, -1)
        y_true_masked = tf.boolean_mask(y_true, mask)
        y_pred_masked = tf.boolean_mask(y_pred, mask)
        return super().update_state(y_true_masked, y_pred_masked)
model = build_unet()
model.compile(
    optimizer=optimizers.Adam(1e-3),
    loss=masked_focal_loss,
    metrics=[MaskedPRAUC()]
)

In addition, I implemented early stopping in combination with reducing the learning rate when the validation loss plateaus.

early_stop = callbacks.EarlyStopping(monitor='val_masked_pr_auc', patience=5,
                                     restore_best_weights=True, mode='max')
reduce_lr = callbacks.ReduceLROnPlateau(monitor='val_masked_pr_auc', factor=0.5,
                                        patience=3, mode='max', verbose=1)

Now we can finally train the model:

model.fit(
    train_dataset,
    validation_data=validation_dataset,
    epochs=100,
    callbacks=[early_stop, reduce_lr]
)

The model stops after 11 epochs due to the early stopping criteria.

Evaluating the Model

We evaluate the model on the left out test data. For the evaluation, I looked at PR AUC again:

# Evaluate on test set using MaskedPRAUC
def evaluate_on_test(model, test_dataset):
    pr_auc_metric = MaskedPRAUC()
    for x_batch, y_batch in test_dataset:
        y_pred = model.predict(x_batch, verbose=0)
        pr_auc_metric.update_state(y_batch, y_pred)

    print("Final Test PR AUC:", pr_auc_metric.result().numpy())

evaluate_on_test(model, test_dataset)

Final Test PR AUC: 0.30207607

The performance on test set is on par with what was reported in the paper (PR AUC of: 28.4 for the ResNet). This is quite neat, since we are using a simpler architecture without hyperparameter optimization, except for early stopping.