EOCubeSceneCollection - custom functions

EOCubeSceneCollection and EOCubeSceneCollectionChunk are similar to EOCube and EOCubeChunk and have basically all the functionality of these classes but some more specific functionality and behavious.

While EOCube and EOCubeChunk do not make any assumption about the input data, EOCubeSceneCollection and EOCubeSceneCollectionChunk assume that the input is a collection (usally time series) of scenes where each scene consists of the same bands (including a quality assessment layer).

Lets get some typical dataset to make it more tangible what we are talking about.

But first we load the packages required in this tutorial and the function for loading a sample dataset.

[1]:
%load_ext autoreload
%autoreload 2

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pathlib import Path
import rasterio
import seaborn as sns

from eobox import sampledata
from eobox.raster import cube
from eobox.raster import gdalutils

from eobox.raster.utils import cleanup_df_values_for_given_dtype
from eobox.raster.utils import dtype_checker_df

print(cube.__file__)
print(sampledata.__file__)
/home/ben/Devel/Packages/eo-box/eobox/raster/cube.py
/home/ben/Devel/Packages/eo-box/eobox/sampledata/__init__.py

The sample dataset

[2]:
def get_sampledata(year):
    dataset = sampledata.get_dataset("lsts")
    layers_paths = [Path(p) for p in dataset["raster_files"]]
    layers_df = pd.Series([p.stem for p in layers_paths]).str.split("_", expand=True) \
    .rename({0: "sceneid", 1:"band"}, axis=1)

    layers_df["date"] = pd.to_datetime(layers_df.sceneid.str[9:16], format="%Y%j")
    layers_df["uname"] = layers_df.sceneid.str[:3] + "_" + layers_df.date.dt.strftime("%Y-%m-%d") + "_" + layers_df.band.str[::]
    layers_df["path"] = layers_paths

    layers_df = layers_df.sort_values(["date", "band"])
    layers_df = layers_df.reset_index(drop=True)

    layers_df_year = layers_df[(layers_df.date >= str(year)) & (layers_df.date < str(year+1))]
    layers_df_year = layers_df_year.reset_index(drop=True)
    return layers_df_year

The sample data we are loading here contains 23 scenes each of which consists of three bands (b3, b4, b5) and a QA (quality assessment) band (here fmask). This is a typical starting point for nowadays using-all-available-pixels EO analysis tasks.

[3]:
df_layers = get_sampledata(2008)
display(df_layers.head())
display(df_layers.tail(1))
df_layers.band.value_counts()
sceneid band date uname path
0 LT50350322008110PAC01 b3 2008-04-19 LT5_2008-04-19_b3 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
1 LT50350322008110PAC01 b4 2008-04-19 LT5_2008-04-19_b4 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
2 LT50350322008110PAC01 b5 2008-04-19 LT5_2008-04-19_b5 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
3 LT50350322008110PAC01 fmask 2008-04-19 LT5_2008-04-19_fmask /home/ben/Devel/Packages/eo-box/eobox/sampleda...
4 LE70350322008118EDC00 b3 2008-04-27 LE7_2008-04-27_b3 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
sceneid band date uname path
91 LE70350322008342EDC00 fmask 2008-12-07 LE7_2008-12-07_fmask /home/ben/Devel/Packages/eo-box/eobox/sampleda...
[3]:
b5       23
b4       23
b3       23
fmask    23
Name: band, dtype: int64

Invalid pixels are typically covered by pixels which are outside the the sensed area, saturated data, snow (depending on the application), clouds, cloud shadows, etc. Usually, these categories are mapped in a QA layer. For example, this dataset contains fmask, a widely used Landsat QA-layer (or pre classification layer). It which has the following categories:

  0 - clear land
  1 - clear water
  2 - cloud
  3 - snow
  4 - shadow
255 - NoData

Processing data with a custom function

Goal: Temporal statistical metrics

In this notebook we show how to calculate a set of temporal statistical metrics / layers from all valid observations of 2008. Particularly we will calculate the mean, minimum, maximum, standard deviation, 10th, 25th, 50th (median), 75th, 90th percentiles, inter-quartile range from all valid observations. Such temporal statistical metrics can be used to derive spatially contiguous features. Such features can be used to build and predict models in large-area mapping applications.

Of course, for a small dataset as the one used here the calculation could be performed all in memory and would be realatively easy. However, if there are many layers and the single layers are large (in terms of pixels) it requires a machine with a lot of memory or code that processes the data in chunks.

With EOCubeSceneCollection you can

  • easily get a chunk of data of a dataset with many large layers, e.g. a time series of Sentinel-2 scenes in the distributed tile format),

  • load a chunk of data that fits in memory,

  • develop a custom function to process the chunk of data,

  • process the whole dataset with the custom function.

Let’s get started.

Read a chunk of data

We first initialize the EOCubeSceneCollection. We provide the class already with important information, i.e.

  • with the variables it should processed, and

  • with the band to be used for masking out invalid pixels and the ids in that band to be considerd as valid pixels.

Furthermore, the chunksize is set. Here it is a very small number since the whole dataset is small and we want to show that we can develop on a small chunk and process the whole data chunk-wise. With real-world data you should make shure that the chunks are not too small since that increases the computational overhead and time but of course not too big for your memory.

[4]:
df_layers=df_layers
chunksize=2**5
variables=["b3", "b4", "b5"]
qa="fmask"
qa_valid=[0, 1]

scoll = cube.EOCubeSceneCollection(df_layers=df_layers,
                                   chunksize=chunksize,
                                   variables=variables,
                                   qa=qa,
                                   qa_valid=qa_valid
                                  )
scoll.df_layers.head()
[4]:
sceneid band date uname path
0 LT50350322008110PAC01 b3 2008-04-19 LT5_2008-04-19_b3 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
1 LT50350322008110PAC01 b4 2008-04-19 LT5_2008-04-19_b4 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
2 LT50350322008110PAC01 b5 2008-04-19 LT5_2008-04-19_b5 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
3 LT50350322008110PAC01 fmask 2008-04-19 LT5_2008-04-19_fmask /home/ben/Devel/Packages/eo-box/eobox/sampleda...
4 LE70350322008118EDC00 b3 2008-04-27 LE7_2008-04-27_b3 /home/ben/Devel/Packages/eo-box/eobox/sampleda...

For the given chunk size we can find derive the numbers of chunks required to cover the whole dataset. Here we have only 4 chunks.

[5]:
print("Chunksize: ", scoll.chunksize)
print("Number of Chunks: ", scoll.n_chunks)
Chunksize:  32
Number of Chunks:  4

We can read the data as with EOCube, i.e. as an unmasked 3-D numpy array, i.e. with the QA-layers. Here the third dimension of sc_chunk.data corresponds to the rows in df_layers. And we can also easily convert the data to a dataframe where all pixels are in the rows and all layers are in the columns.

[6]:
sc_chunk = scoll.get_chunk(0).read_data()
print(sc_chunk.data.__class__)
print(sc_chunk.data.shape)
assert sc_chunk.data.shape[2] == df_layers.shape[0]

sc_chunk = sc_chunk.convert_data_to_dataframe()
print(sc_chunk.data.__class__)
print(sc_chunk.data.shape)
assert sc_chunk.data.shape[1] == df_layers.shape[0]
<class 'numpy.ndarray'>
(32, 32, 92)
<class 'pandas.core.frame.DataFrame'>
(1024, 92)

However, in case of scene collection we often want to apply the same to the masked observations of each variable / band, such as in our case where we want to calculate termporal-statistical metrics. If this is what we want, then we can use read_data_by_variable to get a dictionary of dataframes for

  • all variables and the QA-layer with mask=False, or

  • all masked variables without the QA-layer with mask=True.

Note that it is not possible so far to get 3-D arrays so far but if needed it can be implemented - justopen an issue.

Here we want the masked variables.

[7]:
sc_chunk = sc_chunk.read_data_by_variable(mask=True)
for var in sc_chunk.data.keys():
    print(f"*****************\n{var}\n......")
    display(sc_chunk.data[var].head(2))
*****************
b3
......
date 2008-04-19 00:00:00 2008-04-27 00:00:00 2008-05-05 00:00:00 2008-05-21 00:00:00 2008-05-29 00:00:00 2008-06-06 00:00:00 2008-06-14 00:00:00 2008-06-22 00:00:00 2008-06-30 00:00:00 2008-07-08 00:00:00 ... 2008-08-09 00:00:00 2008-08-17 00:00:00 2008-08-25 00:00:00 2008-09-02 00:00:00 2008-09-18 00:00:00 2008-09-26 00:00:00 2008-10-12 00:00:00 2008-10-28 00:00:00 2008-11-21 00:00:00 2008-12-07 00:00:00
0 NaN NaN NaN 1274 NaN 651.0 NaN 426 240.0 313 ... 919.0 NaN 340 283.0 480.0 NaN NaN 894 NaN NaN
1 NaN NaN 3680.0 1354 NaN 596.0 NaN 426 222.0 313 ... 799.0 NaN 340 283.0 460.0 NaN NaN 767 NaN NaN

2 rows × 23 columns

*****************
b4
......
date 2008-04-19 00:00:00 2008-04-27 00:00:00 2008-05-05 00:00:00 2008-05-21 00:00:00 2008-05-29 00:00:00 2008-06-06 00:00:00 2008-06-14 00:00:00 2008-06-22 00:00:00 2008-06-30 00:00:00 2008-07-08 00:00:00 ... 2008-08-09 00:00:00 2008-08-17 00:00:00 2008-08-25 00:00:00 2008-09-02 00:00:00 2008-09-18 00:00:00 2008-09-26 00:00:00 2008-10-12 00:00:00 2008-10-28 00:00:00 2008-11-21 00:00:00 2008-12-07 00:00:00
0 NaN NaN NaN 2215 NaN 2891.0 NaN 3128 3571.0 3460 ... 2389.0 NaN 2970 3140.0 3075.0 NaN NaN 2250 NaN NaN
1 NaN NaN 4297.0 2280 NaN 2857.0 NaN 3062 3498.0 3294 ... 2353.0 NaN 2932 2979.0 2856.0 NaN NaN 1995 NaN NaN

2 rows × 23 columns

*****************
b5
......
date 2008-04-19 00:00:00 2008-04-27 00:00:00 2008-05-05 00:00:00 2008-05-21 00:00:00 2008-05-29 00:00:00 2008-06-06 00:00:00 2008-06-14 00:00:00 2008-06-22 00:00:00 2008-06-30 00:00:00 2008-07-08 00:00:00 ... 2008-08-09 00:00:00 2008-08-17 00:00:00 2008-08-25 00:00:00 2008-09-02 00:00:00 2008-09-18 00:00:00 2008-09-26 00:00:00 2008-10-12 00:00:00 2008-10-28 00:00:00 2008-11-21 00:00:00 2008-12-07 00:00:00
0 NaN NaN NaN 1961 NaN 1720.0 NaN 1492 1325.0 1339 ... 1155.0 NaN 1253 1146.0 1472.0 NaN NaN 1788 NaN NaN
1 NaN NaN 506.0 1916 NaN 1608.0 NaN 1448 1369.0 1294 ... 1057.0 NaN 1303 1121.0 1420.0 NaN NaN 1650 NaN NaN

2 rows × 23 columns

Create the custom function

Now with this data we can develop a custom function. This function must take the dataframe of one variable as input and must derive a dataframe as output where also the pixels are in the rows and output features / layers in the columns.

As mentioned above we want to calculate the mean, minimum, maximum, standard deviation, 10th, 25th, 50th (median), 75th, 90th percentiles, inter-quartile range from all valid observations. For these statistical we can make use of pd.DataFrame.describe() and extend it by the inter-quartile range.

Lets try it out:

[8]:
var = "b3"
# sc_chunk.data[var]

def calc_statistical_metrics(var_df, percentiles=None, iqr=True):
    """Calculate statistical metrics and the count of valid observations."""
    metrics_df = var_df.transpose().describe(percentiles=percentiles).transpose()
    if iqr and all(np.isin(["25%", "75%"], metrics_df.columns)) :
        metrics_df["iqr"] = metrics_df["75%"] - metrics_df["25%"]
        metrics_df = metrics_df.drop(labels=["count"], axis=1)
    return metrics_df
metrics_df = calc_statistical_metrics(sc_chunk.data[var],
                                      percentiles=[.1, .25, .50, .75, .9], iqr=True)
metrics_df.head()
[8]:
mean std min 10% 25% 50% 75% 90% max iqr
0 530.833333 334.208654 225.0 244.3 305.5 383.0 711.75 916.5 1274.0 406.25
1 760.615385 930.246790 222.0 285.4 313.0 426.0 767.00 1243.0 3680.0 454.00
2 775.142857 906.464973 240.0 297.1 316.0 452.5 652.00 1570.3 3574.0 336.00
3 691.733333 774.670664 192.0 247.8 303.5 346.0 617.50 1685.6 2875.0 314.00
4 550.200000 553.828648 173.0 219.2 247.5 300.0 490.00 1362.4 2011.0 242.50

This is what we want. Now lets calculate the metrics as raster layers for the whole data.

Process the whole dataset

To process the whole dataset we also need the destination paths for the output layers. For our dataset with the three variables and the output dataframe above the destination paths should be a dictionary with a structure as the following one:

[9]:
metrics = ['mean', 'std', 'min', 'p10', 'p25', 'p50', 'p75', 'p90', 'max', 'iqr']
dst_path_pattern = "./xxx_uncontrolled/statistical_metrics/ls_2008_metric_{metric}_{var}.vrt"
dst_paths = {}
for var in variables:
    dst_paths[var] = [dst_path_pattern.format(metric=m, var=var) for m in metrics]
dst_paths
[9]:
{'b3': ['./xxx_uncontrolled/statistical_metrics/ls_2008_metric_mean_b3.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_std_b3.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_min_b3.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p10_b3.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p25_b3.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p50_b3.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p75_b3.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p90_b3.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_max_b3.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_iqr_b3.vrt'],
 'b4': ['./xxx_uncontrolled/statistical_metrics/ls_2008_metric_mean_b4.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_std_b4.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_min_b4.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p10_b4.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p25_b4.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p50_b4.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p75_b4.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p90_b4.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_max_b4.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_iqr_b4.vrt'],
 'b5': ['./xxx_uncontrolled/statistical_metrics/ls_2008_metric_mean_b5.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_std_b5.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_min_b5.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p10_b5.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p25_b5.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p50_b5.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p75_b5.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p90_b5.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_max_b5.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_iqr_b5.vrt']}

We can (if optional) or have to (if required) define

  • the output data type (one for all variables or different ones per variables by passing dtypes in a dictionary with variable keys),

  • the value to use when there will be NaNs in the ouput data,

  • a compression type for the raster,

  • all the kewword arguments that our custom function requires (here percentiles and iqr.

Now we can run the calculations for the whole raster dataset.

[10]:
scoll.apply_and_write_by_variable(fun=calc_statistical_metrics,
                                  dst_paths=dst_paths,
                                  dtypes="int16",
                                  compress="lzw",
                                  nodata=-32768,
                                  percentiles=[.1, .25, .5, .75, .9], iqr=True)
4it [00:20,  5.11s/it]

Visualize the results

Let us visualize what we calculated:

[11]:
df_feature_layers = pd.Series(np.array([dst_paths[var] for var in variables]).flatten()).to_frame()
df_feature_layers.columns = ["path"]
df_feature_layers = pd.concat(
    [
        df_feature_layers,
        df_feature_layers["path"] \
            .apply(lambda x: Path(x).stem).str.split("_", expand=True)[[3, 4]] \
            .rename({3:"metric", 4:"band"}, axis=1)
    ],
    axis=1
)
df_feature_layers = df_feature_layers.sort_values(["band", "metric"]).reset_index()
print(df_feature_layers.shape)
df_feature_layers.head()
(30, 4)
[11]:
index path metric band
0 9 ./xxx_uncontrolled/statistical_metrics/ls_2008... iqr b3
1 8 ./xxx_uncontrolled/statistical_metrics/ls_2008... max b3
2 0 ./xxx_uncontrolled/statistical_metrics/ls_2008... mean b3
3 2 ./xxx_uncontrolled/statistical_metrics/ls_2008... min b3
4 3 ./xxx_uncontrolled/statistical_metrics/ls_2008... p10 b3
[12]:
eoc_metrics = cube.EOCube(df_feature_layers)
chunk_metrics = eoc_metrics.get_chunk(0)
[13]:
fig, axes = plt.subplots(nrows=3, ncols=10, figsize=(30, 10))

for i, ax in enumerate(axes.flatten()):
    aximg = chunk_metrics.plot_raster(i, spatial_bounds=False, vmin=0, vmax=4000, ax=ax)
    ax.set_title(chunk_metrics.df_layers.band[i] + "-" + chunk_metrics.df_layers.metric[i])
plt.tight_layout()
../../_images/examples_raster_eocubescenecollection_custom_functions_25_0.png