EOCubeSceneCollection - from level-2 to level-3 products

EOCubeSceneCollection and EOCubeSceneCollectionChunk are similar to EOCube and EOCubeChunk and have basically all the functionality of these classes but some more specific functionality and behavious.

While EOCube and EOCubeChunk do not make any assumption about the input data, EOCubeSceneCollection and EOCubeSceneCollectionChunk assume that the input is a collection (usally time series) of scenes where each scene consists of the same bands (including a quality assessment layer).

In the Sentinel-2 / Landsat context, the atmospherically corrected single scene data is often refered to Level-2. Level-3 data products are usually derived from time series of lewer level products with the goal to get spatially-contiguous and missing data / gap free products. Such products are:

  • temporal statistical metrics, e.g. median

  • virtual time series derived from interpolation

  • best available pixel composites

This tutorial shows how to derive virtual time series with EOCubeSceneCollection.

First we load the packages required in this tutorial.

[1]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
from pathlib import Path
import rasterio
import seaborn as sns

# from eobox.raster import MultiRasterIO
from eobox import sampledata
from eobox.raster import cube
from eobox.raster import gdalutils

from eobox.raster.utils import cleanup_df_values_for_given_dtype
from eobox.raster.utils import dtype_checker_df

Sample dataset

Lets get some typical dataset to make it more tangible what we are talking about.

[2]:
def get_sampledata(year):
    dataset = sampledata.get_dataset("lsts")
    layers_paths = [Path(p) for p in dataset["raster_files"]]
    layers_df = pd.Series([p.stem for p in layers_paths]).str.split("_", expand=True) \
    .rename({0: "sceneid", 1:"band"}, axis=1)

    layers_df["date"] = pd.to_datetime(layers_df.sceneid.str[9:16], format="%Y%j")
    layers_df["uname"] = layers_df.sceneid.str[:3] + "_" + layers_df.date.dt.strftime("%Y-%m-%d") + "_" + layers_df.band.str[::]
    layers_df["path"] = layers_paths

    layers_df = layers_df.sort_values(["date", "band"])
    layers_df = layers_df.reset_index(drop=True)

    layers_df_year = layers_df[(layers_df.date >= str(year)) & (layers_df.date < str(year+1))]
    layers_df_year = layers_df_year.reset_index(drop=True)
    return layers_df_year

The sample data we are loading here contains 23 scenes each of which consists of three bands (b3, b4, b5) and a QA (quality assessment) band (here fmask). This is a typical starting point for nowadays using-all-available-pixels EO analysis tasks.

[3]:
df_layers = get_sampledata(2008)
display(df_layers.head())
display(df_layers.tail(1))
df_layers.band.value_counts()
sceneid band date uname path
0 LT50350322008110PAC01 b3 2008-04-19 LT5_2008-04-19_b3 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
1 LT50350322008110PAC01 b4 2008-04-19 LT5_2008-04-19_b4 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
2 LT50350322008110PAC01 b5 2008-04-19 LT5_2008-04-19_b5 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
3 LT50350322008110PAC01 fmask 2008-04-19 LT5_2008-04-19_fmask /home/ben/Devel/Packages/eo-box/eobox/sampleda...
4 LE70350322008118EDC00 b3 2008-04-27 LE7_2008-04-27_b3 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
sceneid band date uname path
91 LE70350322008342EDC00 fmask 2008-12-07 LE7_2008-12-07_fmask /home/ben/Devel/Packages/eo-box/eobox/sampleda...
[3]:
b3       23
b5       23
fmask    23
b4       23
Name: band, dtype: int64

EOCubeSceneCollection

Initialization

Typically we want to derive temporal features from these kind of dataset which involves masking out invalid pixels. Invalid pixels are typically covered by clouds, cloud shadows, saturated data, pixels which are outside the the sensed area, snow, etc. Usually, these categories are usually included in the QA layer.

For example, fmask is a typical Landsat QA-like (or pre classification) layer which has the following categories:

  0 - clear land
  1 - clear water
  2 - cloud
  3 - snow
  4 - shadow
255 - NoData

As a result we initialize the EOCubeSceneCollection as follows:

[4]:
df_layers=df_layers
chunksize=2**5
variables=["b3", "b4", "b5"]
qa="fmask"
qa_valid=[0, 1]

scoll = cube.EOCubeSceneCollection(df_layers=df_layers,
                                   chunksize=chunksize,
                                   variables=variables,
                                   qa=qa,
                                   qa_valid=qa_valid
                                  )
scoll.df_layers.head()
[4]:
sceneid band date uname path
0 LT50350322008110PAC01 b3 2008-04-19 LT5_2008-04-19_b3 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
1 LT50350322008110PAC01 b4 2008-04-19 LT5_2008-04-19_b4 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
2 LT50350322008110PAC01 b5 2008-04-19 LT5_2008-04-19_b5 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
3 LT50350322008110PAC01 fmask 2008-04-19 LT5_2008-04-19_fmask /home/ben/Devel/Packages/eo-box/eobox/sampleda...
4 LE70350322008118EDC00 b3 2008-04-27 LE7_2008-04-27_b3 /home/ben/Devel/Packages/eo-box/eobox/sampleda...

Calculation of level-3 features

Virtual time series

[5]:
dst_dir = "./xxx_uncontrolled/ls2008_vts4w"
dst_pattern = "./xxx_uncontrolled/ls2008_vts4w/ls2008_vts4w_{date}_{var}.vrt"

freq = '4W'
idx_virtual = pd.date_range(start='2008-01-01', end="2008-12-31", freq=freq)
idx_virtual

scoll.create_virtual_time_series(idx_virtual=idx_virtual,
                                 dst_pattern=dst_pattern,
                                 dtypes="int16",
                                 compress="lzw",
                                 nodata=None,
                                 num_workers=8)
4it [00:04,  1.25s/it]

Create a time series layer stack (VRT) for each variable.

[6]:
for var in scoll.variables:
    input_file_list = list(list(Path(dst_dir).glob(f"*{var}*")))
    input_file_list = np.sort(input_file_list)
    output_file = Path(dst_dir) / "time_series_stacks" / f"ls_2008_vts__2008__{var}__vts4w.vrt"
    print(output_file)
    gdalutils.buildvrt(input_file_list, output_file, relative=True, separate=True)
xxx_uncontrolled/ls2008_vts4w/time_series_stacks/ls_2008_vts__2008__b3__vts4w.vrt
xxx_uncontrolled/ls2008_vts4w/time_series_stacks/ls_2008_vts__2008__b4__vts4w.vrt
xxx_uncontrolled/ls2008_vts4w/time_series_stacks/ls_2008_vts__2008__b5__vts4w.vrt

Statistical features

Such as mean, std, min, max, percentiles, iqr, …

TODO: Metrics such as regression slope (ordered by time), … - What else can be interesting?

[7]:
dst_dir = "./xxx_uncontrolled/ls2008_stats1yr"
dst_pattern = "./xxx_uncontrolled/ls2008_stats1yr/ls2008_stats1yr_{metric}_{var}.vrt"

percentiles = [.05, .1, .25, .50, .75, .9, .95]
iqr = True
diffs = True

scoll.create_statistical_metrics(percentiles=percentiles,
                                 iqr=iqr,
                                 diffs=diffs,
                                 dst_pattern=dst_pattern,
                                 dtypes="int16",
                                 compress="lzw",
                                 nodata=None,
                                 num_workers=8)

4it [00:10,  2.58s/it]

Create a statistical metrics layer stack (VRT) for each variable.

[26]:

[26]:
array([PosixPath('xxx_uncontrolled/ls2008_stats1yr/ls2008_stats1yr_p05_b5.vrt'),
       PosixPath('xxx_uncontrolled/ls2008_stats1yr/ls2008_stats1yr_p10_b5.vrt'),
       PosixPath('xxx_uncontrolled/ls2008_stats1yr/ls2008_stats1yr_p25_b5.vrt'),
       PosixPath('xxx_uncontrolled/ls2008_stats1yr/ls2008_stats1yr_p50_b5.vrt'),
       PosixPath('xxx_uncontrolled/ls2008_stats1yr/ls2008_stats1yr_p75_b5.vrt'),
       PosixPath('xxx_uncontrolled/ls2008_stats1yr/ls2008_stats1yr_p90_b5.vrt'),
       PosixPath('xxx_uncontrolled/ls2008_stats1yr/ls2008_stats1yr_p95_b5.vrt')],
      dtype=object)
[37]:
for var in scoll.variables:
    print("*" * 80)
    print(var)
    input_file_list = list(list(Path(dst_dir).glob(f"*{var}*")))
    input_file_list = list(list(Path(dst_dir).glob(f"*_min_*{var}*")))
    input_file_list += list(np.sort(list(Path(dst_dir).glob(f"*_p??_*{var}*"))))
    input_file_list += list(list(Path(dst_dir).glob(f"*_max_*{var}*")))
    input_file_list += list(np.sort(list(Path(dst_dir).glob(f"*_p??-p??_*{var}*"))))
    input_file_list += list(list(Path(dst_dir).glob(f"*_mean_*{var}*")))
    input_file_list += list(list(Path(dst_dir).glob(f"*_std_*{var}*")))
    output_file = Path(dst_dir) / "time_series_stacks" / f"ls_2008_stats__2008__{var}__stats1yr.vrt"
    print(output_file)
    print("Layers:")
    print("\n".join([p.stem for p in input_file_list]))
    gdalutils.buildvrt(input_file_list, output_file, relative=True, separate=True)
********************************************************************************
b3
xxx_uncontrolled/ls2008_stats1yr/time_series_stacks/ls_2008_stats__2008__b3__stats1yr.vrt
Layers:
ls2008_stats1yr_min_b3
ls2008_stats1yr_p05_b3
ls2008_stats1yr_p10_b3
ls2008_stats1yr_p25_b3
ls2008_stats1yr_p50_b3
ls2008_stats1yr_p75_b3
ls2008_stats1yr_p90_b3
ls2008_stats1yr_p95_b3
ls2008_stats1yr_max_b3
ls2008_stats1yr_p75-p25_b3
ls2008_stats1yr_p90-p10_b3
ls2008_stats1yr_p95-p05_b3
ls2008_stats1yr_mean_b3
ls2008_stats1yr_std_b3
********************************************************************************
b4
xxx_uncontrolled/ls2008_stats1yr/time_series_stacks/ls_2008_stats__2008__b4__stats1yr.vrt
Layers:
ls2008_stats1yr_min_b4
ls2008_stats1yr_p05_b4
ls2008_stats1yr_p10_b4
ls2008_stats1yr_p25_b4
ls2008_stats1yr_p50_b4
ls2008_stats1yr_p75_b4
ls2008_stats1yr_p90_b4
ls2008_stats1yr_p95_b4
ls2008_stats1yr_max_b4
ls2008_stats1yr_p75-p25_b4
ls2008_stats1yr_p90-p10_b4
ls2008_stats1yr_p95-p05_b4
ls2008_stats1yr_mean_b4
ls2008_stats1yr_std_b4
********************************************************************************
b5
xxx_uncontrolled/ls2008_stats1yr/time_series_stacks/ls_2008_stats__2008__b5__stats1yr.vrt
Layers:
ls2008_stats1yr_min_b5
ls2008_stats1yr_p05_b5
ls2008_stats1yr_p10_b5
ls2008_stats1yr_p25_b5
ls2008_stats1yr_p50_b5
ls2008_stats1yr_p75_b5
ls2008_stats1yr_p90_b5
ls2008_stats1yr_p95_b5
ls2008_stats1yr_max_b5
ls2008_stats1yr_p75-p25_b5
ls2008_stats1yr_p90-p10_b5
ls2008_stats1yr_p95-p05_b5
ls2008_stats1yr_mean_b5
ls2008_stats1yr_std_b5