EOCubeSceneCollection
- from level-2 to level-3 products¶
EOCubeSceneCollection
and EOCubeSceneCollectionChunk
are similar to EOCube
and EOCubeChunk
and have basically all the functionality of these classes but some more specific functionality and behavious.
While EOCube
and EOCubeChunk
do not make any assumption about the input data, EOCubeSceneCollection
and EOCubeSceneCollectionChunk
assume that the input is a collection (usally time series) of scenes where each scene consists of the same bands (including a quality assessment layer).
In the Sentinel-2 / Landsat context, the atmospherically corrected single scene data is often refered to Level-2. Level-3 data products are usually derived from time series of lewer level products with the goal to get spatially-contiguous and missing data / gap free products. Such products are:
temporal statistical metrics, e.g. median
virtual time series derived from interpolation
best available pixel composites
…
This tutorial shows how to derive virtual time series with EOCubeSceneCollection
.
First we load the packages required in this tutorial.
[1]:
%load_ext autoreload
%autoreload 2
import numpy as np
import pandas as pd
from pathlib import Path
import rasterio
import seaborn as sns
# from eobox.raster import MultiRasterIO
from eobox import sampledata
from eobox.raster import cube
from eobox.raster import gdalutils
from eobox.raster.utils import cleanup_df_values_for_given_dtype
from eobox.raster.utils import dtype_checker_df
Sample dataset¶
Lets get some typical dataset to make it more tangible what we are talking about.
[2]:
def get_sampledata(year):
dataset = sampledata.get_dataset("lsts")
layers_paths = [Path(p) for p in dataset["raster_files"]]
layers_df = pd.Series([p.stem for p in layers_paths]).str.split("_", expand=True) \
.rename({0: "sceneid", 1:"band"}, axis=1)
layers_df["date"] = pd.to_datetime(layers_df.sceneid.str[9:16], format="%Y%j")
layers_df["uname"] = layers_df.sceneid.str[:3] + "_" + layers_df.date.dt.strftime("%Y-%m-%d") + "_" + layers_df.band.str[::]
layers_df["path"] = layers_paths
layers_df = layers_df.sort_values(["date", "band"])
layers_df = layers_df.reset_index(drop=True)
layers_df_year = layers_df[(layers_df.date >= str(year)) & (layers_df.date < str(year+1))]
layers_df_year = layers_df_year.reset_index(drop=True)
return layers_df_year
The sample data we are loading here contains 23 scenes each of which consists of three bands (b3, b4, b5) and a QA (quality assessment) band (here fmask). This is a typical starting point for nowadays using-all-available-pixels EO analysis tasks.
[3]:
df_layers = get_sampledata(2008)
display(df_layers.head())
display(df_layers.tail(1))
df_layers.band.value_counts()
sceneid | band | date | uname | path | |
---|---|---|---|---|---|
0 | LT50350322008110PAC01 | b3 | 2008-04-19 | LT5_2008-04-19_b3 | /home/ben/Devel/Packages/eo-box/eobox/sampleda... |
1 | LT50350322008110PAC01 | b4 | 2008-04-19 | LT5_2008-04-19_b4 | /home/ben/Devel/Packages/eo-box/eobox/sampleda... |
2 | LT50350322008110PAC01 | b5 | 2008-04-19 | LT5_2008-04-19_b5 | /home/ben/Devel/Packages/eo-box/eobox/sampleda... |
3 | LT50350322008110PAC01 | fmask | 2008-04-19 | LT5_2008-04-19_fmask | /home/ben/Devel/Packages/eo-box/eobox/sampleda... |
4 | LE70350322008118EDC00 | b3 | 2008-04-27 | LE7_2008-04-27_b3 | /home/ben/Devel/Packages/eo-box/eobox/sampleda... |
sceneid | band | date | uname | path | |
---|---|---|---|---|---|
91 | LE70350322008342EDC00 | fmask | 2008-12-07 | LE7_2008-12-07_fmask | /home/ben/Devel/Packages/eo-box/eobox/sampleda... |
[3]:
b3 23
b5 23
fmask 23
b4 23
Name: band, dtype: int64
EOCubeSceneCollection
¶
Initialization¶
Typically we want to derive temporal features from these kind of dataset which involves masking out invalid pixels. Invalid pixels are typically covered by clouds, cloud shadows, saturated data, pixels which are outside the the sensed area, snow, etc. Usually, these categories are usually included in the QA layer.
For example, fmask is a typical Landsat QA-like (or pre classification) layer which has the following categories:
0 - clear land
1 - clear water
2 - cloud
3 - snow
4 - shadow
255 - NoData
As a result we initialize the EOCubeSceneCollection
as follows:
[4]:
df_layers=df_layers
chunksize=2**5
variables=["b3", "b4", "b5"]
qa="fmask"
qa_valid=[0, 1]
scoll = cube.EOCubeSceneCollection(df_layers=df_layers,
chunksize=chunksize,
variables=variables,
qa=qa,
qa_valid=qa_valid
)
scoll.df_layers.head()
[4]:
sceneid | band | date | uname | path | |
---|---|---|---|---|---|
0 | LT50350322008110PAC01 | b3 | 2008-04-19 | LT5_2008-04-19_b3 | /home/ben/Devel/Packages/eo-box/eobox/sampleda... |
1 | LT50350322008110PAC01 | b4 | 2008-04-19 | LT5_2008-04-19_b4 | /home/ben/Devel/Packages/eo-box/eobox/sampleda... |
2 | LT50350322008110PAC01 | b5 | 2008-04-19 | LT5_2008-04-19_b5 | /home/ben/Devel/Packages/eo-box/eobox/sampleda... |
3 | LT50350322008110PAC01 | fmask | 2008-04-19 | LT5_2008-04-19_fmask | /home/ben/Devel/Packages/eo-box/eobox/sampleda... |
4 | LE70350322008118EDC00 | b3 | 2008-04-27 | LE7_2008-04-27_b3 | /home/ben/Devel/Packages/eo-box/eobox/sampleda... |
Calculation of level-3 features¶
Virtual time series¶
[5]:
dst_dir = "./xxx_uncontrolled/ls2008_vts4w"
dst_pattern = "./xxx_uncontrolled/ls2008_vts4w/ls2008_vts4w_{date}_{var}.vrt"
freq = '4W'
idx_virtual = pd.date_range(start='2008-01-01', end="2008-12-31", freq=freq)
idx_virtual
scoll.create_virtual_time_series(idx_virtual=idx_virtual,
dst_pattern=dst_pattern,
dtypes="int16",
compress="lzw",
nodata=None,
num_workers=8)
4it [00:04, 1.25s/it]
Create a time series layer stack (VRT) for each variable.
[6]:
for var in scoll.variables:
input_file_list = list(list(Path(dst_dir).glob(f"*{var}*")))
input_file_list = np.sort(input_file_list)
output_file = Path(dst_dir) / "time_series_stacks" / f"ls_2008_vts__2008__{var}__vts4w.vrt"
print(output_file)
gdalutils.buildvrt(input_file_list, output_file, relative=True, separate=True)
xxx_uncontrolled/ls2008_vts4w/time_series_stacks/ls_2008_vts__2008__b3__vts4w.vrt
xxx_uncontrolled/ls2008_vts4w/time_series_stacks/ls_2008_vts__2008__b4__vts4w.vrt
xxx_uncontrolled/ls2008_vts4w/time_series_stacks/ls_2008_vts__2008__b5__vts4w.vrt
Statistical features¶
Such as mean, std, min, max, percentiles, iqr, …
TODO: Metrics such as regression slope (ordered by time), … - What else can be interesting?
[7]:
dst_dir = "./xxx_uncontrolled/ls2008_stats1yr"
dst_pattern = "./xxx_uncontrolled/ls2008_stats1yr/ls2008_stats1yr_{metric}_{var}.vrt"
percentiles = [.05, .1, .25, .50, .75, .9, .95]
iqr = True
diffs = True
scoll.create_statistical_metrics(percentiles=percentiles,
iqr=iqr,
diffs=diffs,
dst_pattern=dst_pattern,
dtypes="int16",
compress="lzw",
nodata=None,
num_workers=8)
4it [00:10, 2.58s/it]
Create a statistical metrics layer stack (VRT) for each variable.
[26]:
[26]:
array([PosixPath('xxx_uncontrolled/ls2008_stats1yr/ls2008_stats1yr_p05_b5.vrt'),
PosixPath('xxx_uncontrolled/ls2008_stats1yr/ls2008_stats1yr_p10_b5.vrt'),
PosixPath('xxx_uncontrolled/ls2008_stats1yr/ls2008_stats1yr_p25_b5.vrt'),
PosixPath('xxx_uncontrolled/ls2008_stats1yr/ls2008_stats1yr_p50_b5.vrt'),
PosixPath('xxx_uncontrolled/ls2008_stats1yr/ls2008_stats1yr_p75_b5.vrt'),
PosixPath('xxx_uncontrolled/ls2008_stats1yr/ls2008_stats1yr_p90_b5.vrt'),
PosixPath('xxx_uncontrolled/ls2008_stats1yr/ls2008_stats1yr_p95_b5.vrt')],
dtype=object)
[37]:
for var in scoll.variables:
print("*" * 80)
print(var)
input_file_list = list(list(Path(dst_dir).glob(f"*{var}*")))
input_file_list = list(list(Path(dst_dir).glob(f"*_min_*{var}*")))
input_file_list += list(np.sort(list(Path(dst_dir).glob(f"*_p??_*{var}*"))))
input_file_list += list(list(Path(dst_dir).glob(f"*_max_*{var}*")))
input_file_list += list(np.sort(list(Path(dst_dir).glob(f"*_p??-p??_*{var}*"))))
input_file_list += list(list(Path(dst_dir).glob(f"*_mean_*{var}*")))
input_file_list += list(list(Path(dst_dir).glob(f"*_std_*{var}*")))
output_file = Path(dst_dir) / "time_series_stacks" / f"ls_2008_stats__2008__{var}__stats1yr.vrt"
print(output_file)
print("Layers:")
print("\n".join([p.stem for p in input_file_list]))
gdalutils.buildvrt(input_file_list, output_file, relative=True, separate=True)
********************************************************************************
b3
xxx_uncontrolled/ls2008_stats1yr/time_series_stacks/ls_2008_stats__2008__b3__stats1yr.vrt
Layers:
ls2008_stats1yr_min_b3
ls2008_stats1yr_p05_b3
ls2008_stats1yr_p10_b3
ls2008_stats1yr_p25_b3
ls2008_stats1yr_p50_b3
ls2008_stats1yr_p75_b3
ls2008_stats1yr_p90_b3
ls2008_stats1yr_p95_b3
ls2008_stats1yr_max_b3
ls2008_stats1yr_p75-p25_b3
ls2008_stats1yr_p90-p10_b3
ls2008_stats1yr_p95-p05_b3
ls2008_stats1yr_mean_b3
ls2008_stats1yr_std_b3
********************************************************************************
b4
xxx_uncontrolled/ls2008_stats1yr/time_series_stacks/ls_2008_stats__2008__b4__stats1yr.vrt
Layers:
ls2008_stats1yr_min_b4
ls2008_stats1yr_p05_b4
ls2008_stats1yr_p10_b4
ls2008_stats1yr_p25_b4
ls2008_stats1yr_p50_b4
ls2008_stats1yr_p75_b4
ls2008_stats1yr_p90_b4
ls2008_stats1yr_p95_b4
ls2008_stats1yr_max_b4
ls2008_stats1yr_p75-p25_b4
ls2008_stats1yr_p90-p10_b4
ls2008_stats1yr_p95-p05_b4
ls2008_stats1yr_mean_b4
ls2008_stats1yr_std_b4
********************************************************************************
b5
xxx_uncontrolled/ls2008_stats1yr/time_series_stacks/ls_2008_stats__2008__b5__stats1yr.vrt
Layers:
ls2008_stats1yr_min_b5
ls2008_stats1yr_p05_b5
ls2008_stats1yr_p10_b5
ls2008_stats1yr_p25_b5
ls2008_stats1yr_p50_b5
ls2008_stats1yr_p75_b5
ls2008_stats1yr_p90_b5
ls2008_stats1yr_p95_b5
ls2008_stats1yr_max_b5
ls2008_stats1yr_p75-p25_b5
ls2008_stats1yr_p90-p10_b5
ls2008_stats1yr_p95-p05_b5
ls2008_stats1yr_mean_b5
ls2008_stats1yr_std_b5