eo-box

eobox is a Python package with a small collection of tools for working with Remote Sensing / Earth Observation data.

Package Overview

So far, the following subpackages are available:

  • eobox.sampledata contains small sample data that can be used for playing around and testing.

  • eobox.raster contains raster processing tools for

    • extracting raster values at given (by vector data) locations,

    • window- / chunk-wise processing of multiple single layer raster files as a stack.

  • eobox.vector contains vector processing tools for

    • clean convertion of polygons to lines and

    • distance-to-polygon border calculation.

Installation

The package requires Python 3. It can be installed with the following command:

pip3 install eobox

Docker

Note

This is work in progress. I am happy for any suggestions how to improve the Docker file and handling.

Built upon jupyter/scipy-notebook

Based on Jupyter Docker Stacks’ jupyter/scipy-notebook and inspired by SCiO-systems/cgspatial-notebook.

With the following changes with respect to cgspatial-notebook:

  • Built upon jupyter/scipy-notebook not jupyter/datascience-notebook.

  • ppa:ubuntugis/ubuntugis-unstable instead of ppa:ubuntugis/ppa

  • No R/MaxEnt stuff.

  • Installed dependencies in requirements-dev-examples.txt.

  • Added folders eobox, eo-box-examples and installed eobox in editable mode.

Build:

docker image build -t benmack/eobox-notebook:2020-08-11 -f docker/eobox-notebook.dockerfile .

Run - Jupyter Notebook - this is the default.:

docker run benmack/eobox-notebook:latest

Run - Bash:

docker run --rm benmack/eobox-notebook:latest bash -c ls

Run - Python, e.g. test if eobox can be imported:

docker run --rm benmack/eobox-notebook:latest python -c 'import eobox'

Run - (interactive) IPython:

docker run -it --rm benmack/eobox-notebook:latest ipython

Run - Jupyter Lab and hang in the sample data as a volume (assuming you are in the root dir of the repository):

docker run -p 8080:8888 -v ${PWD}/eobox/sampledata/data:/home/jovyan/data benmack/eobox-notebook:latest jupyter lab

If you have problems with the default 8888:8888 port you can change this as in the example above. But in this case you need to open the respective port in your browser (http://127.0.0.1:8080/tree) and enter the token you see in the logs.

See other Docker Options that might (or might not) work with this image.

Built upon osgeo/gdal:ubuntu-small-latest

Build & run:

docker image build -t benmack/eobox:latest -f docker/eobox.dockerfile .

docker run  -v ${PWD}:/home/eoboxer/host_pwd -p 8888:8888 benmack/eobox:latest

Push a new Docker Image

Currently:

  • Make changes in the image, e.g. docker/eobox-notebook.dockerfile.

  • Change the respective version file, e.g. version_eobox-notebook

  • Push to Docker Hub:

    docker push benmack/eobox-notebook:latest
    docker push benmack/eobox-notebook:0.0.1
    

TODO:

In the future install eobox from a release such that it is clear which exact version and code is in the Dockerfile.

Improve versioning and automize release. Possible starting point: How to Version Your Docker Images.

sampledata

Create S2 L1C Sample Data

Developing in the domain of earth observation we sometimes need a small dataset to play around, add as data sample to some code or package we shapre or to use for testing our code.
This notebook shows how to clip a small window from a Sentinel-2 (S2) Level-1C (i.e. top-of-atmosphere reflectance).

A S2 L1C (S2A_MSIL1C_20170216T102101_N0204_R065_T33UUU_20170216T102204.SAFE) scene downloaded from the Copernicus Open Access Hub comes as several individual JPEG2000 raster files which have different resolution. In this notebook we clip and save a small window from these layers by keeping the resolution intact. The data created here is contained as sample and test dataset s2l1c in the eo-box package.

Let’s first import the packages we need:

[2]:
from affine import Affine
import glob
import matplotlib.pyplot as plt
import os
import pandas as pd
import rasterio
from rasterio.windows import Window
import seaborn as sns

% matplotlib inline

Now we need the layer files and some information from the full-tile S2 L1C scene.

[5]:
scene_dir = r'xxx_uncontrolled/S2A_MSIL1C_20170216T102101_N0204_R065_T33UUU_20170216T102204.SAFE'
layer_files = glob.glob(f'{scene_dir}/**/*_[B]??.jp2', recursive=True)
parts = [os.path.basename(file).split(".")[0].split("_") for file in layer_files]
layers_df = pd.DataFrame(parts, columns=["tile", "datetime", "band"])
layers_df["file"] = layer_files
layers_df = layers_df.set_index("band").sort_index()
layers_df
[5]:
tile datetime file
band
B01 T33UUU 20170216T102101 xxx_uncontrolled/S2A_MSIL1C_20170216T102101_N0...
B02 T33UUU 20170216T102101 xxx_uncontrolled/S2A_MSIL1C_20170216T102101_N0...
B03 T33UUU 20170216T102101 xxx_uncontrolled/S2A_MSIL1C_20170216T102101_N0...
B04 T33UUU 20170216T102101 xxx_uncontrolled/S2A_MSIL1C_20170216T102101_N0...
B05 T33UUU 20170216T102101 xxx_uncontrolled/S2A_MSIL1C_20170216T102101_N0...
B06 T33UUU 20170216T102101 xxx_uncontrolled/S2A_MSIL1C_20170216T102101_N0...
B07 T33UUU 20170216T102101 xxx_uncontrolled/S2A_MSIL1C_20170216T102101_N0...
B08 T33UUU 20170216T102101 xxx_uncontrolled/S2A_MSIL1C_20170216T102101_N0...
B09 T33UUU 20170216T102101 xxx_uncontrolled/S2A_MSIL1C_20170216T102101_N0...
B10 T33UUU 20170216T102101 xxx_uncontrolled/S2A_MSIL1C_20170216T102101_N0...
B11 T33UUU 20170216T102101 xxx_uncontrolled/S2A_MSIL1C_20170216T102101_N0...
B12 T33UUU 20170216T102101 xxx_uncontrolled/S2A_MSIL1C_20170216T102101_N0...
B8A T33UUU 20170216T102101 xxx_uncontrolled/S2A_MSIL1C_20170216T102101_N0...

We want to clip the same small spatial area from all layers. Let’s get the window from one of the 60m layers.

[6]:
win_60m = Window(col_off=500, row_off=1300, width=2**8, height=2**7)
with rasterio.open(layers_df.at["B01", "file"]) as src:
    meta_60m = src.meta
    arr_60m = src.read(1, window=win_60m)

fig, ax = plt.subplots(1, 1, figsize=(16, 8))
sns.heatmap(arr_60m, robust=True, cbar=False)
[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4985dc9748>
_images/examples_sampledata_create_s2_l1c_sample_data_5_1.png

Since the layers have all different resolution / cell size, we need to derive the corresponding windows for the other layer. This is implemented in the follwoing function.

[7]:
def transform_window(window_src, transform_src, transform_dst):
    spatial_bounds = rasterio.windows.bounds(window=window_src, transform=transform_src,
                                             height=0, width=0)  # defaults
    window_dst = rasterio.windows.from_bounds(spatial_bounds[0],
                                              spatial_bounds[1],
                                              spatial_bounds[2],
                                              spatial_bounds[3],
                                              transform=transform_dst,
                                              height=None, width=None, precision=None)  # defaults
    return window_dst

Visually and with respect to the window relations the function seems to do what it should.

[8]:
with rasterio.open(layers_df.at["B02", "file"]) as src:
    meta_10m = src.meta
    win_10m = transform_window(window_src=win_60m,
                               transform_src=meta_60m["transform"],
                               transform_dst=meta_10m["transform"])
    arr_10m = src.read(1, window=win_10m)

assert(win_10m.height == win_60m.height * 6)
assert(win_10m.width == win_60m.width * 6)
assert(win_10m.row_off == win_60m.row_off * 6)
assert(win_10m.col_off == win_60m.col_off* 6)

fig, ax = plt.subplots(1, 1, figsize=(16, 8))
sns.heatmap(arr_10m, robust=True, cbar=False)
[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f498370b0f0>
_images/examples_sampledata_create_s2_l1c_sample_data_9_1.png

Now we can loop through all the layers and save the subset. The destination files of the subsetted files are printed out.

[13]:
dst_dir = r'xxx_uncontrolled/s2l1c'
# from the 60 m
win = Window(col_off=500, row_off=1300, width=2**8, height=2**7)
transform = Affine(60.0, 0.0, 300000.0, 0.0, -60.0, 5900040.0)
blockxsize = blockysize = 2**6

for filename_src in layers_df["file"].values:
    filename_dst = os.path.join(dst_dir, *filename_src.split("/")[-2:])
    bprint(filename_dst)
    if not os.path.exists(os.path.dirname(filename_dst)):
        os.makedirs(os.path.dirname(filename_dst))
    if not os.path.exists(filename_dst):
        with rasterio.open(filename_src) as src:
            win_this = transform_window(window_src=win,
                                        transform_src=transform,
                                        transform_dst=src.meta["transform"])
            arr = src.read(window=win_this)
            kwargs = src.meta.copy()
            kwargs.update({
                'height': win_this.height,
                'width': win_this.width,
                'transform': rasterio.windows.transform(win_this, src.transform),
                'compress': "JPEG2000"})
            with rasterio.open(filename_dst, 'w', **kwargs,
                               tiled=True, blockxsize=blockxsize, blockysize=blockysize) as dst:
                dst.write(arr)
xxx_uncontrolled/s2l1c/IMG_DATA/T33UUU_20170216T102101_B01.jp2
xxx_uncontrolled/s2l1c/IMG_DATA/T33UUU_20170216T102101_B02.jp2
xxx_uncontrolled/s2l1c/IMG_DATA/T33UUU_20170216T102101_B03.jp2
xxx_uncontrolled/s2l1c/IMG_DATA/T33UUU_20170216T102101_B04.jp2
xxx_uncontrolled/s2l1c/IMG_DATA/T33UUU_20170216T102101_B05.jp2
xxx_uncontrolled/s2l1c/IMG_DATA/T33UUU_20170216T102101_B06.jp2
xxx_uncontrolled/s2l1c/IMG_DATA/T33UUU_20170216T102101_B07.jp2
xxx_uncontrolled/s2l1c/IMG_DATA/T33UUU_20170216T102101_B08.jp2
xxx_uncontrolled/s2l1c/IMG_DATA/T33UUU_20170216T102101_B09.jp2
xxx_uncontrolled/s2l1c/IMG_DATA/T33UUU_20170216T102101_B10.jp2
xxx_uncontrolled/s2l1c/IMG_DATA/T33UUU_20170216T102101_B11.jp2
xxx_uncontrolled/s2l1c/IMG_DATA/T33UUU_20170216T102101_B12.jp2
xxx_uncontrolled/s2l1c/IMG_DATA/T33UUU_20170216T102101_B8A.jp2

The End.

Get Landsat Time Series (lsts) Sample Data

In this notebook we download the Forest in Colorado, USA (P035-R032) dataset from Chris Holdens landsat_stack repository and save part of the data as single layer TIFF files.

Specifically, we run through the following steps:

  • We download and extract the data.

  • We save the RED, NIR and SWIR1 bands and the FMASK band from the years 2008 to 2013 as single layer TIFF files.

  • We delete the stack.

  • We create a dataframe with all .tif files.

The following is a lists of all Landsat / Fmask bands we download. The ones that are not striked out are the ones we keep.

  • Band 1 SR (SR * 10000)

  • Band 2 SR (SR * 10000)

  • Band 3 SR (SR * 10000)

  • Band 4 SR (SR * 10000)

  • Band 5 SR (SR * 10000)

  • Band 7 SR (SR * 10000)

  • Band 6 Thermal Brightness (C * 100)

  • Fmask

    • 0 - clear land

    • 1 - clear water

    • 2 - cloud

    • 3 - snow

    • 4 - shadow

    • 255 - NoData

The dataset is a sample dataset included under the name lsts (Landsat time series) in the eotools-dataset package.

First, download and extract the data.

[1]:
! wget http://ftp-earth.bu.edu/public/ceholden/landsat_stacks/p035r032.tar.bz2
!tar xf p035r032.tar.bz2
--2019-02-03 13:51:17--  http://ftp-earth.bu.edu/public/ceholden/landsat_stacks/p035r032.tar.bz2
Resolving ftp-earth.bu.edu (ftp-earth.bu.edu)... 128.197.229.67
Connecting to ftp-earth.bu.edu (ftp-earth.bu.edu)|128.197.229.67|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11963073 (11M) [application/x-bzip2]
Saving to: ‘p035r032.tar.bz2’

p035r032.tar.bz2    100%[===================>]  11,41M   535KB/s    in 28s

2019-02-03 13:51:46 (418 KB/s) - ‘p035r032.tar.bz2’ saved [11963073/11963073]

Imports and the helper functions.

[3]:
import os
import pandas as pd
from pathlib import Path
import shutil
import subprocess

def save_as_single_layer_file(src_dir, overwrite=False, remove_stack=True):
    keep = [2, 3, 4, 7]
    band_names = ["b1", "b2", "b3", "b4", "b5", "b7", "b6", "fmask"]
    src = list(src_dir.glob("*gtif"))[0]
    for bindex, bname in enumerate(band_names):
        if bindex not in keep:
            continue

        dst_dir = sdir.parent.parent / src_dir.stem
        dst_dir.mkdir(exist_ok=True)
        dst = dst_dir / f"{src_dir.stem.split('_')[0]}_{band_names[bindex]}.tif"
        if (not dst.exists() or overwrite):
            ot = "Byte" if bname == "fmask" else "Int16"
            exit_code = subprocess.check_call(
                f"gdal_translate -ot {ot} -b {bindex+1} -co COMPRESS=DEFLATE {str(src)} {str(dst)}",
                shell=True)

Save the selected bands and fmask of the selected years as single layer files.

[4]:
bdir_scenes_single = Path("./p035r032")
bdir_scenes = Path("./p035r032/images")
scene_dirs = list(bdir_scenes.glob("L*"))

counter = 0
for i, sdir in enumerate(scene_dirs):
    if int(sdir.stem[9:13]) < 2008:
        continue
    counter += 1
    print(f"{counter} / {len(scene_dirs)} - {sdir}")
    save_as_single_layer_file(src_dir=sdir, overwrite=False)
1 / 447 - p035r032/images/LE70350322009312EDC00
2 / 447 - p035r032/images/LE70350322010283EDC00
3 / 447 - p035r032/images/LE70350322011110EDC00
4 / 447 - p035r032/images/LT50350322011278PAC01
5 / 447 - p035r032/images/LE70350322008262EDC00
6 / 447 - p035r032/images/LT50350322009256PAC01
7 / 447 - p035r032/images/LT50350322009128PAC01
8 / 447 - p035r032/images/LT50350322009208PAC01
9 / 447 - p035r032/images/LT50350322011246PAC01
10 / 447 - p035r032/images/LE70350322008214EDC00
11 / 447 - p035r032/images/LE70350322009072EDC00
12 / 447 - p035r032/images/LE70350322012257EDC01
13 / 447 - p035r032/images/LE70350322012241EDC00
14 / 447 - p035r032/images/LE70350322009248EDC00
15 / 447 - p035r032/images/LE70350322008198EDC00
16 / 447 - p035r032/images/LT50350322010147PAC01
17 / 447 - p035r032/images/LT50350322008190PAC01
18 / 447 - p035r032/images/LE70350322012113EDC00
19 / 447 - p035r032/images/LE70350322011238EDC00
20 / 447 - p035r032/images/LE70350322011190EDC00
21 / 447 - p035r032/images/LE70350322012081EDC00
22 / 447 - p035r032/images/LE70350322009120EDC00
23 / 447 - p035r032/images/LE70350322011174EDC00
24 / 447 - p035r032/images/LE70350322009088EDC00
25 / 447 - p035r032/images/LT50350322010259PAC02
26 / 447 - p035r032/images/LT50350322008174PAC01
27 / 447 - p035r032/images/LE70350322012209EDC00
28 / 447 - p035r032/images/LE70350322012337EDC00
29 / 447 - p035r032/images/LT50350322011262PAC01
30 / 447 - p035r032/images/LT50350322009240PAC02
31 / 447 - p035r032/images/LE70350322008118EDC00
32 / 447 - p035r032/images/LT50350322011294PAC01
33 / 447 - p035r032/images/LT50350322008126PAC01
34 / 447 - p035r032/images/LT50350322008270PAC01
35 / 447 - p035r032/images/LE70350322010267EDC00
36 / 447 - p035r032/images/LE70350322012273EDC00
37 / 447 - p035r032/images/LE70350322010187EDC01
38 / 447 - p035r032/images/LE70350322008326EDC00
39 / 447 - p035r032/images/LT50350322010179EDC00
40 / 447 - p035r032/images/LT50350322008222PAC01
41 / 447 - p035r032/images/LT50350322009288PAC01
42 / 447 - p035r032/images/LE70350322011206EDC00
43 / 447 - p035r032/images/LE70350322010155EDC00
44 / 447 - p035r032/images/LT50350322009272PAC01
45 / 447 - p035r032/images/LE70350322012129EDC00
46 / 447 - p035r032/images/LE70350322010219EDC03
47 / 447 - p035r032/images/LE70350322013131EDC00
48 / 447 - p035r032/images/LE70350322012193EDC00
49 / 447 - p035r032/images/LE70350322012289EDC00
50 / 447 - p035r032/images/LE70350322012161EDC00
51 / 447 - p035r032/images/LT50350322011118PAC01
52 / 447 - p035r032/images/LT50350322009224PAC01
53 / 447 - p035r032/images/LT50350322008110PAC01
54 / 447 - p035r032/images/LE70350322009216EDC00
55 / 447 - p035r032/images/LE70350322008342EDC00
56 / 447 - p035r032/images/LT50350322010275PAC01
57 / 447 - p035r032/images/LE70350322010203EDC00
58 / 447 - p035r032/images/LE70350322009232EDC00
59 / 447 - p035r032/images/LT50350322010211EDC00
60 / 447 - p035r032/images/LE70350322011270EDC00
61 / 447 - p035r032/images/LT50350322010195EDC00
62 / 447 - p035r032/images/LE70350322008246EDC00
63 / 447 - p035r032/images/LT50350322011134PAC01
64 / 447 - p035r032/images/LT50350322011166PAC01
65 / 447 - p035r032/images/LE70350322012145EDC00
66 / 447 - p035r032/images/LT50350322009112PAC01
67 / 447 - p035r032/images/LT50350322009176PAC01
68 / 447 - p035r032/images/LE70350322012225EDC00
69 / 447 - p035r032/images/LT50350322008302PAC01
70 / 447 - p035r032/images/LT50350322011182PAC01
71 / 447 - p035r032/images/LT50350322011230PAC01
72 / 447 - p035r032/images/LT50350322009192PAC01
73 / 447 - p035r032/images/LT50350322008206PAC01
74 / 447 - p035r032/images/LE70350322009296EDC00
75 / 447 - p035r032/images/LE70350322013147EDC00
76 / 447 - p035r032/images/LT50350322008158PAC01
77 / 447 - p035r032/images/LT50350322008238PAC01
78 / 447 - p035r032/images/LE70350322009136EDC00
79 / 447 - p035r032/images/LT50350322010099PAC01
80 / 447 - p035r032/images/LT50350322008142PAC01
81 / 447 - p035r032/images/LE70350322012305EDC00
82 / 447 - p035r032/images/LT50350322009160PAC01
83 / 447 - p035r032/images/LE70350322010235EDC00
84 / 447 - p035r032/images/LE70350322012177EDC00
85 / 447 - p035r032/images/LE70350322010171EDC00
86 / 447 - p035r032/images/LE70350322011142EDC00
87 / 447 - p035r032/images/LE70350322011286EDC00
88 / 447 - p035r032/images/LE70350322008166EDC00
89 / 447 - p035r032/images/LT50350322008286PAC01
90 / 447 - p035r032/images/LE70350322008230EDC00
91 / 447 - p035r032/images/LE70350322012321EDC00
92 / 447 - p035r032/images/LE70350322011254EDC00
93 / 447 - p035r032/images/LE70350322013115EDC00
94 / 447 - p035r032/images/LE70350322011158EDC00
95 / 447 - p035r032/images/LE70350322009184EDC00
96 / 447 - p035r032/images/LT50350322011214PAC01
97 / 447 - p035r032/images/LT50350322011198PAC01
98 / 447 - p035r032/images/LE70350322012097EDC00
99 / 447 - p035r032/images/LT50350322010243EDC00
100 / 447 - p035r032/images/LE70350322009200EDC00
101 / 447 - p035r032/images/LT50350322010307EDC00
102 / 447 - p035r032/images/LT50350322010227EDC00
103 / 447 - p035r032/images/LE70350322011222EDC00
104 / 447 - p035r032/images/LE70350322008150EDC00
105 / 447 - p035r032/images/LE70350322008182EDC00

Delete the image directory which contained the complete downloaded data.

[6]:
shutil.rmtree(bdir_scenes)

Lets derive the paths and some thereof derived metadata and put the info in a dataframe.

[8]:
layers_paths = list(Path(bdir_scenes_single).rglob("*.tif"))
layers_df = pd.Series([p.stem for p in layers_paths]).str.split("_", expand=True) \
    .rename({0: "sceneid", 1:"band"}, axis=1)
layers_df["date"] = pd.to_datetime(layers_df.sceneid.str[9:16], format="%Y%j")
layers_df["path"] = layers_paths
layers_df = layers_df.sort_values(["date", "band"])
layers_df = layers_df.reset_index(drop=True)
layers_df.head(10)
[8]:
sceneid band date path
0 LT50350322008110PAC01 b3 2008-04-19 p035r032/LT50350322008110PAC01/LT5035032200811...
1 LT50350322008110PAC01 b4 2008-04-19 p035r032/LT50350322008110PAC01/LT5035032200811...
2 LT50350322008110PAC01 b5 2008-04-19 p035r032/LT50350322008110PAC01/LT5035032200811...
3 LT50350322008110PAC01 fmask 2008-04-19 p035r032/LT50350322008110PAC01/LT5035032200811...
4 LE70350322008118EDC00 b3 2008-04-27 p035r032/LE70350322008118EDC00/LE7035032200811...
5 LE70350322008118EDC00 b4 2008-04-27 p035r032/LE70350322008118EDC00/LE7035032200811...
6 LE70350322008118EDC00 b5 2008-04-27 p035r032/LE70350322008118EDC00/LE7035032200811...
7 LE70350322008118EDC00 fmask 2008-04-27 p035r032/LE70350322008118EDC00/LE7035032200811...
8 LT50350322008126PAC01 b3 2008-05-05 p035r032/LT50350322008126PAC01/LT5035032200812...
9 LT50350322008126PAC01 b4 2008-05-05 p035r032/LT50350322008126PAC01/LT5035032200812...

Reformat the data such that we can check if some of the scenes have missing bands.

[9]:
counts_bands_per_sceneid = layers_df[["sceneid", "band", "path"]] \
    .pivot_table(index="sceneid", columns="band", aggfunc="count")
display(counts_bands_per_sceneid.head(2))
display(counts_bands_per_sceneid.tail(2))
counts_bands_per_sceneid.apply("sum", axis=0)
path
band b3 b4 b5 fmask
sceneid
LE70350322008118EDC00 1 1 1 1
LE70350322008150EDC00 1 1 1 1
path
band b3 b4 b5 fmask
sceneid
LT50350322011278PAC01 1 1 1 1
LT50350322011294PAC01 1 1 1 1
[9]:
      band
path  b3       105
      b4       105
      b5       105
      fmask    105
dtype: int64

Which is not the case (;

The End

raster

Extract Raster Values with Vector Data

Extracting raster values of pixels overlapping with vector data is one of the most common tasks for developers of earth observation applications. This notebook shows on the basis of a small toy dataset how to perform this task in Python with the eobox package.

In general there are tow two main approaches to solve this task in Python. (I am eager to know if you see another approach or additional advantages or disadvantages that should be added to this list.)

  1. The rasterstats-way:

    • For each feature (point, line, polygon) in the vector dataset

      • get the geometry of the feature,

      • get the bounding box for that geometry,

      • get the raster data corresponding to the bounding box as array (extraction core),

      • rasterize the geometry such that it correponds to the array,

      • return only the raster values overlapping with the rasterized geometry.

A way of doing this in Python can be found in the gen_zonal_stats function implemented in the rasterstats package.

  1. The rasterize-way:

    • Rasterize the whole vector dataset such that the rasterized raster matches the raster data from which to extract the values (e.g. with the gdal_rasterize function).

    • load the rasterized data as array and create a mask,

    • load the raster from which to extract the data as array and select the values to be kept by the mask.

This tutorial shows how to use the extract function implemented in the eobox package to go this way.

There is a major difference between the two approaches which might help you to find the best for your case: The first approach is memory sparing since the whole raster does not need to be loaded in memory at once, which is necessary in the second approach. However, if your vector dataset contains many features, it might be much faster to go for the second approach since it is not necessary to loop through all the features in Python.

Let’s get stated and import the required modules.

[1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import fnmatch
import geopandas as gpd
import os
import pandas as pd
from pathlib import Path

from eobox.raster import extraction
from eobox.raster.extraction import convert_df_to_geodf
from eobox import sampledata

We now load a small sample dataset from which we can derive all requird arguments for running the extract function.

Note that the sample dataset is S2 L1C that comes in different resolutions. Extraction from multiple resolutions is not supported and handling it is (so far) the responsibilit of the user.

Here we only slect the 10m bands for extraction.

[2]:
dataset = sampledata.get_dataset("s2l1c")

src_vector = dataset["vector_file"]
burn_attribute = "pid"  # should be unique for the polygons and not contain zero
src_raster = fnmatch.filter(dataset["raster_files"], "*B0[2,3,4,8]*")  # 10 m bands
dst_names = ["_".join(Path(src).stem.split("_")[1::]) for src in src_raster]
extraction_dir = Path("s2l1c_ref__s2l1c/s2_l1c/10m")

print("***ARGUMENTS***")
print("\nsrc_vector:\n-----------\n -", src_vector)
print("\ndst_dir:\n--------\n -", extraction_dir)
print("\nsrc_raster:\n-----------\n -", "\n - ".join(src_raster))
print("\ndst_names:\n----------\n -", "\n - ".join(dst_names))
print("\n***ARGUMENTS - END***")
***ARGUMENTS***

src_vector:
-----------
 - /home/ben/Devel/Packages/eo-box/eobox/sampledata/data/s2l1c/s2l1c_ref.gpkg

dst_dir:
--------
 - s2l1c_ref__s2l1c/s2_l1c/10m

src_raster:
-----------
 - /home/ben/Devel/Packages/eo-box/eobox/sampledata/data/s2l1c/IMG_DATA/T33UUU_20170216T102101_B08.jp2
 - /home/ben/Devel/Packages/eo-box/eobox/sampledata/data/s2l1c/IMG_DATA/T33UUU_20170216T102101_B02.jp2
 - /home/ben/Devel/Packages/eo-box/eobox/sampledata/data/s2l1c/IMG_DATA/T33UUU_20170216T102101_B03.jp2
 - /home/ben/Devel/Packages/eo-box/eobox/sampledata/data/s2l1c/IMG_DATA/T33UUU_20170216T102101_B04.jp2

dst_names:
----------
 - 20170216T102101_B08
 - 20170216T102101_B02
 - 20170216T102101_B03
 - 20170216T102101_B04

***ARGUMENTS - END***

The vector dataset looks as follows:

[3]:
gdf = gpd.read_file(src_vector)
gdf.plot(column="cid", figsize=(18, 8))
[3]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f20f4505c50>
_images/examples_raster_extract_raster_values_with_vector_data_5_1.png

Note that the cid column represents the classes water (cid:1) and forest (cid:2). Even though we are usually interested in such an attribute we use as burn attribute the pid column contains unique ids for the polygons. Using a unique id for each feature is recommended since it assures that for each extracted pixel the polygon it belongs to can easily been identified. We will see later that it is easy to merge the cid information or any other information of the vector data attribute table can easily be merged into the extrected data.

But now let’s do the extraction.

[4]:
%%time
extraction.extract(src_vector=src_vector,
                   burn_attribute=burn_attribute,
                   src_raster=src_raster,
                   dst_names=dst_names,
                   dst_dir=extraction_dir)
100%|██████████| 4/4 [00:00<00:00, 11.67it/s]
CPU times: user 2.18 s, sys: 343 ms, total: 2.53 s
Wall time: 397 ms

[4]:
0

So what just happened? - Roughly the function made what was described above under the rasterize-way and stored the extracted data and some auxiliary information to .npy files:

[5]:
list(extraction_dir.glob("*.npy"))
[5]:
[PosixPath('s2l1c_ref__s2l1c/s2_l1c/10m/aux_coord_x.npy'),
 PosixPath('s2l1c_ref__s2l1c/s2_l1c/10m/20170216T102101_B04.npy'),
 PosixPath('s2l1c_ref__s2l1c/s2_l1c/10m/20170216T102101_B02.npy'),
 PosixPath('s2l1c_ref__s2l1c/s2_l1c/10m/aux_coord_y.npy'),
 PosixPath('s2l1c_ref__s2l1c/s2_l1c/10m/20170216T102101_B08.npy'),
 PosixPath('s2l1c_ref__s2l1c/s2_l1c/10m/aux_vector_pid.npy'),
 PosixPath('s2l1c_ref__s2l1c/s2_l1c/10m/20170216T102101_B03.npy')]

Note that the aux(iliary) variables are * aux_coord_x: the x coordinates of the pixels, * aux_coord_y: the y coordinates of the pixels, * aux_coord_y: the polygon id (here) or whatever attribute has been passed burn_attribute before.

The rest of the variables contain the data extracted from the raster layers.

  • 20170216T102101_B02.npy

  • 20170216T102101_B03.npy

  • 20170216T102101_B04.npy

  • 20170216T102101_B08.npy

It is worth noting that the function checks what has already been extracted by using the dst_names. Thus, only the dst_names that alread exist are skipped. So far, it is the responsibility of the user to delete existing bands in case they should be re-extracted.

With the the function load_extracted it is easy to load the data. It has a couple of options which returns different dataframes:

  • The defaults return all variable in columns.

  • patterns can be used to load specific variables matching a list of patterns.

[6]:
extraction.load_extracted(extraction_dir, patterns=["aux_*.npy"]).head()
[6]:
aux_coord_x aux_coord_y aux_vector_pid
0 344545.0 5822035.0 17
1 344555.0 5822035.0 17
2 344615.0 5821885.0 6
3 344625.0 5821885.0 6
4 344615.0 5821875.0 6
[7]:
extraction.load_extracted(extraction_dir, patterns=["aux_vector_pid.npy", "*_B??.npy"]).head()
[7]:
aux_vector_pid 20170216T102101_B04 20170216T102101_B02 20170216T102101_B08 20170216T102101_B03
0 17 928 1456 672 1072
1 17 928 1392 672 1040
2 6 928 1456 608 1072
3 6 864 1424 608 1072
4 6 928 1424 608 1072
  • vars_in_cols can be used to load the variables in the columns.

    This does not make so much sense here, but if you have e.g. a long time series of the NDVI and you want to make time series analysis you might want to have the dates in the rows.

[8]:
extraction.load_extracted(extraction_dir, patterns=["*B??.npy"], vars_in_cols=False).head()
[8]:
0 1 2 3 4 5 6 7 8 9 ... 3490 3491 3492 3493 3494 3495 3496 3497 3498 3499
20170216T102101_B04 928 928 928 864 928 864 864 864 864 928 ... 880 912 880 848 848 848 816 816 816 816
20170216T102101_B02 1456 1392 1456 1424 1424 1424 1424 1392 1424 1392 ... 1448 1464 1448 1432 1448 1480 1432 1448 1400 1400
20170216T102101_B08 672 672 608 608 608 608 608 608 608 608 ... 624 656 592 560 592 592 592 560 560 560
20170216T102101_B03 1072 1040 1072 1072 1072 1072 1040 1040 1040 1072 ... 1056 1120 1056 1056 1056 1056 1056 1056 1056 1056

4 rows × 3500 columns

You can also get the head of the data loaded by load_extracted, i.e. the first five rows.

[9]:
extraction.load_extracted(extraction_dir, vars_in_cols=False, head=True)
[9]:
0 1 2 3 4 5
aux_coord_x 344545.0 344555.0 344615.0 344625.0 344615.0 344625.0
20170216T102101_B04 928.0 928.0 928.0 864.0 928.0 864.0
20170216T102101_B02 1456.0 1392.0 1456.0 1424.0 1424.0 1424.0
aux_coord_y 5822035.0 5822035.0 5821885.0 5821885.0 5821875.0 5821875.0
20170216T102101_B08 672.0 672.0 608.0 608.0 608.0 608.0
aux_vector_pid 17.0 17.0 6.0 6.0 6.0 6.0
20170216T102101_B03 1072.0 1040.0 1072.0 1072.0 1072.0 1072.0

And finally you can convert the data to a geodataframe.

[11]:
df_extracted = extraction.load_extracted(extraction_dir, patterns=["aux_*.npy", "*_B??.npy"])
gdf_extracted = convert_df_to_geodf(df_extracted, crs=extraction_dir)
gdf_extracted.plot(column='aux_vector_pid', marker='x', figsize=(20, 12))
[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f20bc8b81d0>
_images/examples_raster_extract_raster_values_with_vector_data_19_1.png

It has been said above that it is easy to join the extracted values to the values stored in the attribute table of the vector file.

[12]:
df_extracted = df_extracted.merge(gdf[["pid", "cid", "comment"]], how="left", left_on="aux_vector_pid", right_on="pid")
df_extracted.head()
[12]:
aux_coord_x aux_coord_y aux_vector_pid 20170216T102101_B04 20170216T102101_B02 20170216T102101_B08 20170216T102101_B03 geometry pid cid comment
0 344545.0 5822035.0 17 928 1456 672 1072 POINT (344545 5822035) 17 1 partially out; expecting: 2 - with default ras...
1 344555.0 5822035.0 17 928 1392 672 1040 POINT (344555 5822035) 17 1 partially out; expecting: 2 - with default ras...
2 344615.0 5821885.0 6 928 1456 608 1072 POINT (344615 5821885) 6 1 None
3 344625.0 5821885.0 6 864 1424 608 1072 POINT (344625 5821885) 6 1 None
4 344615.0 5821875.0 6 928 1424 608 1072 POINT (344615 5821875) 6 1 None

And now you can start analyzing the data.

[13]:
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 4, figsize=(20, 6))
for ax, var in zip(axes.flatten(), ["20170216T102101_B02", "20170216T102101_B03", "20170216T102101_B04", "20170216T102101_B08"]):
    df_extracted.boxplot(column=var, by="cid", ax=ax)
_images/examples_raster_extract_raster_values_with_vector_data_23_0.png

Load Sample Subset from Extracted Data

[11]:
import fnmatch
import geopandas as gpd
import os
import pandas as pd
from pathlib import Path

from eobox.raster import extraction
from eobox import sampledata

% matplotlib inline
[12]:
dataset = sampledata.get_dataset("s2l1c")

src_vector = dataset["vector_file"]
burn_attribute = "pid"  # should be unique for the polygons and not contain zero
src_raster = fnmatch.filter(dataset["raster_files"], "*B0[2,3,4,8]*")  # 10 m bands
dst_names = ["_".join(Path(src).stem.split("_")[1::]) for src in src_raster]
extraction_dir = Path("s2l1c_ref__s2l1c/s2_l1c/10m")
extraction.extract(src_vector=src_vector,
                   burn_attribute=burn_attribute,
                   src_raster=src_raster,
                   dst_names=dst_names,
                   dst_dir=extraction_dir)
df_extracted = extraction.load_extracted(extraction_dir, "*pid.npy")
display(df_extracted.head())
index_29 = (df_extracted["aux_vector_pid"] == 29)
index_29.sum()
aux_vector_pid
0 17
1 17
2 6
3 6
4 6
[12]:
109
[13]:
display(df_extracted[index_29].head(2))
display(df_extracted[index_29].tail(2))
aux_vector_pid
839 29
840 29
aux_vector_pid
946 29
947 29
[14]:
df_extracted_29 = extraction.load_extracted(extraction_dir, index=index_29)
df_extracted_29.head()
[14]:
aux_coord_y 20170216T102101_B02 20170216T102101_B03 aux_coord_x aux_vector_pid 20170216T102101_B04 20170216T102101_B08
839 5820745.0 1456 1136 341675.0 29 1056 1696
840 5820745.0 1456 1136 341685.0 29 1056 1632
841 5820745.0 1456 1136 341695.0 29 1056 1632
842 5820745.0 1424 1136 341705.0 29 1056 1632
843 5820745.0 1488 1136 341715.0 29 1056 1632
[15]:
df_extracted_29 = extraction.load_extracted(extraction_dir, vars_in_cols=False, index=index_29)
df_extracted_29.head()
[15]:
839 840 841 842 843 844 845 846 847 848 ... 938 939 940 941 942 943 944 945 946 947
aux_coord_y 5820745.0 5820745.0 5820745.0 5820745.0 5820745.0 5820745.0 5820745.0 5820745.0 5820745.0 5820745.0 ... 5820665.0 5820665.0 5820665.0 5820665.0 5820665.0 5820665.0 5820665.0 5820665.0 5820665.0 5820665.0
20170216T102101_B02 1456.0 1456.0 1456.0 1424.0 1488.0 1456.0 1488.0 1488.0 1488.0 1488.0 ... 1424.0 1456.0 1424.0 1424.0 1424.0 1456.0 1456.0 1456.0 1424.0 1392.0
20170216T102101_B03 1136.0 1136.0 1136.0 1136.0 1136.0 1168.0 1168.0 1168.0 1136.0 1136.0 ... 1104.0 1104.0 1104.0 1072.0 1072.0 1104.0 1136.0 1136.0 1104.0 1104.0
aux_coord_x 341675.0 341685.0 341695.0 341705.0 341715.0 341725.0 341735.0 341745.0 341755.0 341765.0 ... 341705.0 341715.0 341725.0 341735.0 341745.0 341755.0 341765.0 341775.0 341785.0 341795.0
aux_vector_pid 29.0 29.0 29.0 29.0 29.0 29.0 29.0 29.0 29.0 29.0 ... 29.0 29.0 29.0 29.0 29.0 29.0 29.0 29.0 29.0 29.0

5 rows × 109 columns

Calculate NDVI with MultiRasterIO

This tutorial shows how to use the MultiRasterIO class to process multiple single raster files with a custom (user defined) function in chunks or windows. This can be useful if the input or generated data does not fit in memory.

Here we define a custom function that calculates two different NDVI versions from the Sentinel 2 bands B04, B08 and B8A.

In this tutorial we need the following imports:

[33]:
import fnmatch
import matplotlib.pyplot as plt
from pathlib import Path
import rasterio

from eobox.raster import extract
from eobox.raster import MultiRasterIO
from eobox import sampledata

Furthermore we need some data. Here derive the paths to the three S2 bands required in the custom function to be developed.

[4]:
dataset = sampledata.get_dataset("s2l1c")
layer_files = fnmatch.filter(dataset["raster_files"], "*B04*")  # 10 m bands
layer_files += fnmatch.filter(dataset["raster_files"], "*B08*")  # 10 m bands
layer_files += fnmatch.filter(dataset["raster_files"], "*B8A*")  # 10 m bands
layer_files
[4]:
['/home/ben/anaconda/envs/eobox/lib/python3.6/site-packages/eobox/sampledata/data/s2l1c/S2A_MSIL1C_20170216T102101_N0204_R065_T33UUU_20170216T102204.SAFE/S2A_MSIL1C_20170216T102101_N0204_R065_T33UUU_20170216T102204.SAFE/GRANULE/L1C_T33UUU_A008642_20170216T102204/IMG_DATA/T33UUU_20170216T102101_B04.jp2',
 '/home/ben/anaconda/envs/eobox/lib/python3.6/site-packages/eobox/sampledata/data/s2l1c/S2A_MSIL1C_20170216T102101_N0204_R065_T33UUU_20170216T102204.SAFE/S2A_MSIL1C_20170216T102101_N0204_R065_T33UUU_20170216T102204.SAFE/GRANULE/L1C_T33UUU_A008642_20170216T102204/IMG_DATA/T33UUU_20170216T102101_B08.jp2',
 '/home/ben/anaconda/envs/eobox/lib/python3.6/site-packages/eobox/sampledata/data/s2l1c/S2A_MSIL1C_20170216T102101_N0204_R065_T33UUU_20170216T102204.SAFE/S2A_MSIL1C_20170216T102101_N0204_R065_T33UUU_20170216T102204.SAFE/GRANULE/L1C_T33UUU_A008642_20170216T102204/IMG_DATA/T33UUU_20170216T102101_B8A.jp2']
Develop a Custom Function

Assume we want to process these files with our own function, e.g. calculate_ndvi, but but we need to do this chunk-wise because the whole input and/or output data does not fit in memory. With MultiRasterIO we can focus on the developement of our function because the class facilitates the chunk-wise processing.

Before we start developing our costum function, let us get a chunk of the data. This is usually useful for developing the function.

In the following cell we

  • initialize the MultiRasterIO class,

  • derive windows (the ones that are stored internally and can be derived with ADD block_windows() LINK,

  • and load the first (index 0) chunk of data with.

[7]:
mr = MultiRasterIO(layer_files=layer_files, dst_res=10)
mr.block_windows()
arr = mr.get_arrays(0)
arr.shape
[7]:
(128, 128, 3)

Based on that data we can develop our custom function.

To make the custom function work with MultiRasterIO it is important that it meets the following two requirements:

  • The first input argument must be a 3-dimensional array with the dimensions being (X, Y, Bands), just as loaded in the last cell. Note that the X, Y dimensions might vary in size with the chunk or window being processed. Additionally, any keword arguments can be defined.

  • The output must be a list of arrays where the number of elements in the list corresponds to the output layers and the arrays are 2-dimensional with the dimensions being (X, Y).

The following function calculates two NDVIs from the S2 data, one with the NIR band B08 and the other with B8A.

[13]:
def calculate_ndvi(arr, idx_b04, idx_b08, idx_b8a):
    """A custom function for calculating an NDVI based on the two S2 NIR bands."""
    import numpy as np
    def normalized_difference_index(arr, idx_minuend, idx_subtrahend):
        di = (arr[:,:,idx_minuend] - arr[:,:,idx_subtrahend]) / (arr[:,:,idx_minuend] + arr[:,:,idx_subtrahend])
        di[di > 1.0] = 1.0001
        di[di < -1.0] = -1.0001
        di = (di*10000).astype(np.int16)
        return di
    ndvi_b08 = normalized_difference_index(arr, idx_b08, idx_b04)
    ndvi_b8a = normalized_difference_index(arr, idx_b8a, idx_b04)
    return [ndvi_b08, ndvi_b8a]

ndvis = calculate_ndvi(arr, idx_b04=0, idx_b08=1, idx_b8a=2)
print(len(ndvis))
print(ndvis[0].shape)
print(ndvis[1].shape)

2
(128, 128)
(128, 128)

As expected it returns a list of two 2D-array.

[14]:
fig, axes = plt.subplots(1,2, figsize=(20, 10))

axes.flatten()[0].imshow(ndvis[0], vmin=-1000, vmax=6000)
axes.flatten()[1].imshow(ndvis[1], vmin=-1000, vmax=6000)
fig.tight_layout()
_images/examples_raster_multirasterio_ndvi_9_0.png

If we are happy with our function and if it meets the two above mentioned requirements we are ready to process the whole image. To do so, all we need to do is to define a destination file per output layer.

[30]:
dst_dir = Path(".").absolute() / "NDVI"
ndvi_files = [str(dst_dir / "ndvib08.jp2"),
              str(dst_dir / "ndvib8a.jp2")]

Then we can just pass the destination filenames, the custom function and it required its keword arguments to the apply_and_save method and wait for the result.

[31]:
exit_code = mr.apply_and_save(dst_files=ndvi_files,
                              func=calculate_ndvi,
                              idx_b04=0, idx_b08=1, idx_b8a=2)
assert exit_code == 0

Lets load and plot the array of the created NDVI layers:

[38]:
fig, axes = plt.subplots(2,1, figsize=(20, 20))
with rasterio.open(ndvi_files[0]) as ndvib08:
    arr = ndvib08.read()
    axes.flatten()[0].imshow(arr[0,:,:], vmin=-1000, vmax=6000)
with rasterio.open(ndvi_files[1]) as ndvib8a:
    arr = ndvib8a.read()
    axes.flatten()[1].imshow(arr[0,:,:], vmin=-1000, vmax=6000)
_images/examples_raster_multirasterio_ndvi_15_0.png

Image Classification with MultiRasterIO

In this tutorial it is shown how to use the class MultiRasterIO to perform image classification. We will

  • extract a training dataset (see tutorial Extract Raster Values of Pixels Overlapping with Vector Data),

  • build a model,

  • classify the raster dataset with the model.

We will run through the first two steps and focus on the last here.

[3]:
import fnmatch
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
from pathlib import Path
from sklearn.ensemble import RandomForestClassifier

from eobox import raster
from eobox import sampledata
Extract Reference Data
[4]:
dataset = sampledata.get_dataset("s2l1c")

src_vector = dataset["vector_file"]
burn_attribute = "pid"  # should be unique for the polygons and not contain zero
src_raster = sorted(fnmatch.filter(dataset["raster_files"], "*B0[2,3,4,8]*"))  # 10 m bands
feature_names = ["_".join(Path(src).stem.split("_")[1::]) for src in src_raster]
extraction_dir = Path("s2l1c_ref__s2l1c/s2_l1c/10m")

raster.extract(src_vector=src_vector,
                    burn_attribute=burn_attribute,
                    src_raster=src_raster,
                    dst_names=feature_names,
                    dst_dir=extraction_dir)

df_extracted = raster.load_extracted(extraction_dir)
gdf = gpd.read_file(src_vector)
df_extracted = df_extracted.merge(gdf[["pid", "cid", "comment"]], how="left", left_on="aux_vector_pid", right_on="pid")
df_extracted.head(2)
[4]:
aux_coord_y 20170216T102101_B02 20170216T102101_B03 aux_coord_x aux_vector_pid 20170216T102101_B04 20170216T102101_B08 pid cid comment
0 5822035.0 1456 1072 344545.0 17 928 672 17 1 partially out; expecting: 2 - with default ras...
1 5822035.0 1392 1040 344555.0 17 928 672 17 1 partially out; expecting: 2 - with default ras...
Build a Model

Lets train a Random Forest model with the whole reference data on four Sentinel-2 bands.

With the extracted data that can be done as follows:

[5]:
y = df_extracted["cid"].values
X = df_extracted[feature_names].values
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X, y)
[5]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
Prediction of Probabilities

This section is the focus of this notebook. We will see show how to predict the image by using the model trained in the last section and the MultiRasterIO class. As in the tutorial on NDVI calculation with MultiRasterIO we first setup the class instance and get a window of data for developing the prediction.

[6]:
mr = raster.MultiRasterIO(layer_files=src_raster)
mr.block_windows()
arr = mr.get_arrays(0)
arr.shape
[6]:
(64, 64, 4)

We get the data as a three dimensional array where the dimensions are (X, Y, Layers). However the predict* methods of a sklearn estimator require the features in a two dimensional array where the dimensions are (Sample = Pixel, Features = Layers).

We can use the PixelClassifier from the s2cloudless package or ImagePixelClassifier from the eolearn.ml_tools package both developed by Synergise.

If you do not have these packages installed you can simply copy and paste the PixelClassifier from the PixelClassifier.py module in the sentinel2-cloud-detector project as has it been done to fill the following cell.

[7]:
class PixelClassifier(object):
    """
    Pixel classifier extends a receptive field of a classifier over an entire image.
    The classifier's receptive field is in case of PixelClassifier a pixel (i.e, it
    has dimension of (1,1))

    Pixel classifier divides the image into individual pixels, runs classifier over
    them, and finally produces a classification mask of the same size as image.

    The classifier can be of any type as long as it has the following two methods
    implemented:
        - predict(X)
        - predict_proba(X)

    This is true for all classifiers that follow scikit-learn's API.
    The APIs of scikit-learn's objects is described
    at: http://scikit-learn.org/stable/developers/contributing.html#apis-of-scikit-learn-objects.

    :param classifier: trained classifier that will be executed over an entire image
    :type classifier: any classifier with predict(X) and predict_proba(X) methods
    """

    def __init__(self, classifier):
        self.receptive_field = (1, 1)
        self._check_classifier(classifier)
        self.classifier = classifier

    def _check_classifier(self, classifier):
        """
        Check if the classifier implements predict and predict_proba methods.
        """
        predict = getattr(classifier, "predict", None)
        if not callable(predict):
            raise ValueError('Classifier does not have a predict method!')

        predict_proba = getattr(classifier, "predict_proba", None)
        if not callable(predict_proba):
            raise ValueError('Classifier does not have a predict_proba method!')

    @staticmethod
    def extract_pixels(X):
        """ Extract pixels from array X

        :param X: Array of images to be classified.
        :type X: numpy array, shape = [n_images, n_pixels_y, n_pixels_x, n_bands]
        :return: Reshaped 2D array
        :rtype: numpy array, [n_samples*n_pixels_y*n_pixels_x,n_bands]
        :raises: ValueError is input array has wrong dimensions
        """
        if len(X.shape) != 4:
            raise ValueError('Array of input images has to be a 4-dimensional array of shape'
                             '[n_images, n_pixels_y, n_pixels_x, n_bands]')

        new_shape = (X.shape[0] * X.shape[1] * X.shape[2], X.shape[3],)
        pixels = X.reshape(new_shape)
        return pixels

    def image_predict(self, X):
        """
        Predicts class label for the entire image.

        :param X: Array of images to be classified.
        :type X: numpy array, shape = [n_images, n_pixels_y, n_pixels_x, n_bands]

        :return: raster classification map
        :rtype: numpy array, [n_samples, n_pixels_y, n_pixels_x]
        """

        pixels = self.extract_pixels(X)

        predictions = self.classifier.predict(pixels)

        return predictions.reshape(X.shape[0], X.shape[1], X.shape[2])

    def image_predict_proba(self, X):
        """
        Predicts class probabilities for the entire image.

        :param X: Array of images to be classified.
        :type X: numpy array, shape = [n_images, n_pixels_y, n_pixels_x, n_bands]

        :return: classification probability map
        :rtype: numpy array, [n_samples, n_pixels_y, n_pixels_x]
        """

        pixels = self.extract_pixels(X)

        probabilities = self.classifier.predict_proba(pixels)

        return probabilities.reshape(X.shape[0], X.shape[1], X.shape[2], probabilities.shape[1])

In the docstring we can read that the prediction methods want the data in the format [n_images, n_pixels_y, n_pixels_x, n_bands]. So let us expand the dimensions to fulfil the requirement and predict the array.

[8]:
pclf = PixelClassifier(clf)
probs = pclf.image_predict_proba(np.expand_dims(arr, 0))
probs = (probs * 100).astype(np.uint8)  # lets store the probabilities as integer later
print("Shape of input array:", np.expand_dims(arr, 0).shape)
print("Shape of output arrray:", probs.shape)
Shape of input array: (1, 64, 64, 4)
Shape of output arrray: (1, 64, 64, 2)

Now we can plot the result and check if it makes sense. Since there is a small lake in the upper left corner of the image (for which we loaded the data) it seems to make sense.

[9]:
fig, axes = plt.subplots(1,2, figsize=(20, 10))

axes.flatten()[0].imshow(probs[0,:,:,0], vmin=0, vmax=100)
axes.flatten()[1].imshow(probs[0,:,:,1], vmin=0, vmax=100)
fig.tight_layout()
_images/examples_raster_multirasterio_classification_13_0.png

We can now wrap the code in a function and apply it on the whole raster dataset and store the probability layers to disc as follows:

[10]:
def image_predict_proba(arr, pclf):
    """A custom function for running predict_proba of a sklearn estimator on an image."""
    import numpy as np
    probs = (pclf.image_predict_proba(np.expand_dims(arr, 0)) * 100).astype(np.uint8)
    probs = [probs[0,:,:,i] for i in range(probs.shape[3])]
    return probs

probs = image_predict_proba(arr, pclf=pclf)
print("Length of probs: ", len(probs))
print("Shape of probs[0]: ", probs[0].shape)
print("Shape of probs[1]: ", probs[1].shape)
dst_dir = Path(".").absolute() / "pclf_water_predict_proba"
prob_files = [str(dst_dir / "prob_cid1.jp2"),
              str(dst_dir / "prob_cid2.jp2")]
prob_files
Length of probs:  2
Shape of probs[0]:  (64, 64)
Shape of probs[1]:  (64, 64)
[10]:
['/home/ben/Devel/Repos/eo-box/examples/raster/pclf_water_predict_proba/prob_cid1.jp2',
 '/home/ben/Devel/Repos/eo-box/examples/raster/pclf_water_predict_proba/prob_cid2.jp2']
[11]:
exit_code = mr.apply_and_save(dst_files=prob_files,
                              func=image_predict_proba,
                              pclf=pclf)
Class, Probability and Confidence Layers

Often we want as an outcome the class or prediction layer, probability layers and confidence layers. We can extend the image_predict_proba above as follows:

[12]:
def image_predict(arr, pclf):
    probs = (pclf.image_predict_proba(np.expand_dims(arr, 0)) * 100).astype(np.uint8)[0,:,:,:]
    # get the prediction image (class ids)
    pred_idx = probs.argmax(axis=2)
    pred = np.zeros_like(pred_idx).astype(np.uint8)
    for i in range(probs.shape[2]):
        pred[pred_idx==i] = pclf.classifier.classes_[i]
    # get reliability layers (maximum probability and margin, i.e. maximum probability minus second highest probability)
    probs_sorted = np.sort(probs, axis=2)
    max_prob = probs_sorted[:,:,probs_sorted.shape[2]-1]
    margin = probs_sorted[:,:,probs_sorted.shape[2]-1] - probs_sorted[:,:,probs_sorted.shape[2]-2]
    out_layers = [probs[:,:,i] for i in range(probs.shape[2])]
    out_layers.append(pred)
    out_layers.append(max_prob)
    out_layers.append(margin)
    return out_layers

out_layers = image_predict(arr, pclf=pclf)
[13]:
fig, axes = plt.subplots(2,3, figsize=(20, 10))

axes.flatten()[0].imshow(out_layers[0])
axes.flatten()[0].set_title("Probability CID 1")

axes.flatten()[1].imshow(out_layers[1])
axes.flatten()[1].set_title("Probability CID 2")

axes.flatten()[2].imshow(out_layers[2])
axes.flatten()[2].set_title("Prediction Layer: CID")

axes.flatten()[3].imshow(out_layers[3])
axes.flatten()[3].set_title("Maximum Probability")

axes.flatten()[4].imshow(out_layers[3])
axes.flatten()[4].set_title("Margin")

axes.flatten()[5] = None

fig.tight_layout()
_images/examples_raster_multirasterio_classification_19_0.png

And for the whole image:

[14]:
out_layer_filenames = [str(dst_dir / "prob_cid1.jp2"),
                       str(dst_dir / "prob_cid2.jp2"),
                       str(dst_dir / "predictions.jp2"),
                       str(dst_dir / "maximum_probability.jp2"),
                       str(dst_dir / "margin.jp2"),
                       ]

exit_code = mr.apply_and_save(dst_files=out_layer_filenames,
                              func=image_predict,
                              pclf=pclf)
[18]:
import rasterio
fig, ax = plt.subplots(1,1, figsize=(20, 20))
with rasterio.open(str(dst_dir / "prob_cid1.jp2")) as prob_cid1:
    arr = prob_cid1.read()
    ax.imshow(arr[0,:,:], vmin=0, vmax=100)
_images/examples_raster_multirasterio_classification_22_0.png

Write chunk array as GeoTIFF

Or: How to safe a single layer chunk array as georeferenced raster with MultiRasterIO?

[4]:
import pandas as pd
from pathlib import Path
import rasterio

from eobox.raster import MultiRasterIO
from eobox import sampledata


year = 2008
dataset = sampledata.get_dataset("lsts")
layers_paths = [Path(p) for p in dataset["raster_files"]]
layers_df = pd.Series([p.stem for p in layers_paths]).str.split("_", expand=True) \
.rename({0: "sceneid", 1:"band"}, axis=1)

layers_df["date"] = pd.to_datetime(layers_df.sceneid.str[9:16], format="%Y%j")
layers_df["uname"] = layers_df.sceneid.str[:3] + "_" + layers_df.date.dt.strftime("%Y-%m-%d") + "_" + layers_df.band.str[::]
layers_df["path"] = layers_paths

layers_df = layers_df.sort_values(["date", "band"])
layers_df = layers_df.reset_index(drop=True)

df_layers = layers_df[(layers_df.date >= str(year)) & (layers_df.date < str(year+1))]
df_layers = df_layers.reset_index(drop=True)
df_layers.head()
[4]:
sceneid band date uname path
0 LT50350322008110PAC01 b3 2008-04-19 LT5_2008-04-19_b3 /home/ben/anaconda/envs/eocube/lib/python3.6/s...
1 LT50350322008110PAC01 b4 2008-04-19 LT5_2008-04-19_b4 /home/ben/anaconda/envs/eocube/lib/python3.6/s...
2 LT50350322008110PAC01 b5 2008-04-19 LT5_2008-04-19_b5 /home/ben/anaconda/envs/eocube/lib/python3.6/s...
3 LT50350322008110PAC01 fmask 2008-04-19 LT5_2008-04-19_fmask /home/ben/anaconda/envs/eocube/lib/python3.6/s...
4 LE70350322008118EDC00 b3 2008-04-27 LE7_2008-04-27_b3 /home/ben/anaconda/envs/eocube/lib/python3.6/s...
[11]:
mrio = MultiRasterIO(df_layers.path.values) \
    .windows_from_blocksize(2**5)
n_chunks = len(mrio.windows)
print("Number of chunks : ", n_chunks)

write_to_disk = False

with rasterio.open(mrio._get_template_for_given_resolution(mrio.dst_res, "path")) as src_layer:
    pass # later we need src_layer for src_layer.window_transform(win)

for ji in range(n_chunks):
    dst_path = f"./xxx_uncontrolled_99_chunk_ji{ji:02}.tif"
    chunk_arrays_ji = mrio.get_arrays(ji)
    chunk_layer_ji = chunk_arrays_ji[:,:,[0]]
    print("shape of chunk_layer_ji : ", chunk_layer_ji.shape)
    win = mrio.windows[ji]
    kwargs = mrio._get_template_for_given_resolution(
        res=mrio.dst_res, return_="meta").copy()
    kwargs.update({"height": win.height,
                   "width": win.width,
                   "transform": src_layer.window_transform(win)})
    kwargs["dtype"] = chunk_layer_ji.dtype
    with rasterio.open(dst_path, "w", **kwargs) as dst:
        dst.write(chunk_layer_ji[:,:,0], 1)
        print(f"Chunk written to path: {dst_path}")
Number of chunks :  4
shape of chunk_layer_ji :  (32, 32, 1)
Chunk written to path: ./xxx_uncontrolled_99_chunk_ji00.tif
shape of chunk_layer_ji :  (32, 29, 1)
Chunk written to path: ./xxx_uncontrolled_99_chunk_ji01.tif
shape of chunk_layer_ji :  (29, 32, 1)
Chunk written to path: ./xxx_uncontrolled_99_chunk_ji02.tif
shape of chunk_layer_ji :  (29, 29, 1)
Chunk written to path: ./xxx_uncontrolled_99_chunk_ji03.tif

An intro to EOCube

[1]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
from pathlib import Path
import rasterio
import seaborn as sns

# from eobox.raster import MultiRasterIO
from eobox import sampledata
from eobox.raster import cube



def get_sampledata(year):
    dataset = sampledata.get_dataset("lsts")
    layers_paths = [Path(p) for p in dataset["raster_files"]]
    layers_df = pd.Series([p.stem for p in layers_paths]).str.split("_", expand=True) \
    .rename({0: "sceneid", 1:"band"}, axis=1)

    layers_df["date"] = pd.to_datetime(layers_df.sceneid.str[9:16], format="%Y%j")
    layers_df["uname"] = layers_df.sceneid.str[:3] + "_" + layers_df.date.dt.strftime("%Y-%m-%d") + "_" + layers_df.band.str[::]
    layers_df["path"] = layers_paths

    layers_df = layers_df.sort_values(["date", "band"])
    layers_df = layers_df.reset_index(drop=True)

    layers_df_year = layers_df[(layers_df.date >= str(year)) & (layers_df.date < str(year+1))]
    layers_df_year = layers_df_year.reset_index(drop=True)
    return layers_df_year

Input data constraints

With EOCube we can for example calculate temporal features from stacks of data that do not fit in memory.

We accept the following data constrains:

  • All layers, i.e. a single band of an acquisition, are available as single band GDAL readable raster files.

  • All files represent raster that are exactly spatially aligned with the same resolution, extend, projection, number of rows and columns etc.

  • (If all the above holds true you might band-subset or cut the data as VRTs to meet the spatial extend and single layer file constraint.

  • The data needs to come in a pandas dataframe containing at least the columns path containing the respective information.

For calculating temporal features date, band, sceneid - as in the following sample dataset - are useful additional columns that you then can accessed later in your custom code.

[2]:
df_layers = get_sampledata(2008)
display(df_layers.head())
df_layers.band.value_counts()
sceneid band date uname path
0 LT50350322008110PAC01 b3 2008-04-19 LT5_2008-04-19_b3 /home/ben/anaconda/envs/eocube/lib/python3.6/s...
1 LT50350322008110PAC01 b4 2008-04-19 LT5_2008-04-19_b4 /home/ben/anaconda/envs/eocube/lib/python3.6/s...
2 LT50350322008110PAC01 b5 2008-04-19 LT5_2008-04-19_b5 /home/ben/anaconda/envs/eocube/lib/python3.6/s...
3 LT50350322008110PAC01 fmask 2008-04-19 LT5_2008-04-19_fmask /home/ben/anaconda/envs/eocube/lib/python3.6/s...
4 LE70350322008118EDC00 b3 2008-04-27 LE7_2008-04-27_b3 /home/ben/anaconda/envs/eocube/lib/python3.6/s...
[2]:
b5       23
b4       23
fmask    23
b3       23
Name: band, dtype: int64
[3]:
df_layers.head(5)
[3]:
sceneid band date uname path
0 LT50350322008110PAC01 b3 2008-04-19 LT5_2008-04-19_b3 /home/ben/anaconda/envs/eocube/lib/python3.6/s...
1 LT50350322008110PAC01 b4 2008-04-19 LT5_2008-04-19_b4 /home/ben/anaconda/envs/eocube/lib/python3.6/s...
2 LT50350322008110PAC01 b5 2008-04-19 LT5_2008-04-19_b5 /home/ben/anaconda/envs/eocube/lib/python3.6/s...
3 LT50350322008110PAC01 fmask 2008-04-19 LT5_2008-04-19_fmask /home/ben/anaconda/envs/eocube/lib/python3.6/s...
4 LE70350322008118EDC00 b3 2008-04-27 LE7_2008-04-27_b3 /home/ben/anaconda/envs/eocube/lib/python3.6/s...
EOCube and EOCubeChunk
EOCube
[6]:
eoc = cube.EOCube(df_layers, chunksize=2**5)

Attributes you can get.

[7]:
print("The data frame with your data.")
display(eoc.df_layers.head(2))
print("Chunksize: ", eoc.chunksize)
print("Number of Chunks: ", eoc.n_chunks)
The data frame with your data.
sceneid band date uname path
0 LT50350322008110PAC01 b3 2008-04-19 LT5_2008-04-19_b3 /home/ben/anaconda/envs/eocube/lib/python3.6/s...
1 LT50350322008110PAC01 b4 2008-04-19 LT5_2008-04-19_b4 /home/ben/anaconda/envs/eocube/lib/python3.6/s...
Chunksize:  32
Number of Chunks:  4

Attributes you can set.

[8]:
eoc.chunksize=2**4
print("Chunksize: ", eoc.chunksize)
print("Number of Chunks: ", eoc.n_chunks)

eoc.chunksize=2**5
print("Chunksize: ", eoc.chunksize)
print("Number of Chunks: ", eoc.n_chunks)
Chunksize:  16
Number of Chunks:  16
Chunksize:  32
Number of Chunks:  4
EOCubeChunk

This is a child class of EOCube and you get an EOCubeChunk object from an EOCube object.

[9]:
ji = 1
eoc_chunk = eoc.get_chunk(ji)
print("Chunk ID: ", eoc_chunk.ji)
Chunk ID:  1

Load data and convert it between ndarray and DataFrame.

[10]:
ji = 1
eoc_chunk = eoc.get_chunk(ji)
print("****************")
print(eoc_chunk.__class__)
print("eochunk.data is : ", eoc_chunk.data)
print("Chunk ID: ", eoc_chunk.ji)

print("****************")
eoc_chunk = eoc_chunk.read_data()
print(eoc_chunk.data.__class__)
print("eochunk.data is ndarray with shape : ", eoc_chunk.data.shape)
data_ndarray = eoc_chunk.data.copy()

print("****************")
eoc_chunk = eoc_chunk.convert_data_to_dataframe()
print("eochunk.data is DataFrame with shape : ", eoc_chunk.data.shape)
print("(column names come from eoc_chunk.df_layers['uname']")
display(eoc_chunk.data.head())

print("****************")
eoc_chunk = eoc_chunk.convert_data_to_ndarray()
print("eochunk.data is again a ndarray with shape : ", eoc_chunk.data.shape)

****************
<class 'eobox.raster.cube.EOCubeChunk'>
eochunk.data is :  None
Chunk ID:  1
****************
<class 'numpy.ndarray'>
eochunk.data is ndarray with shape :  (32, 29, 92)
****************
eochunk.data is DataFrame with shape :  (928, 92)
(column names come from eoc_chunk.df_layers['uname']
uname LT5_2008-04-19_b3 LT5_2008-04-19_b4 LT5_2008-04-19_b5 LT5_2008-04-19_fmask LE7_2008-04-27_b3 LE7_2008-04-27_b4 LE7_2008-04-27_b5 LE7_2008-04-27_fmask LT5_2008-05-05_b3 LT5_2008-05-05_b4 ... LT5_2008-10-28_b5 LT5_2008-10-28_fmask LE7_2008-11-21_b3 LE7_2008-11-21_b4 LE7_2008-11-21_b5 LE7_2008-11-21_fmask LE7_2008-12-07_b3 LE7_2008-12-07_b4 LE7_2008-12-07_b5 LE7_2008-12-07_fmask
0 16000 6889 292 3 16000 7468 540 3 16000 6448 ... 2410 0 5609 6227 3191 4 5040 6126 575 3
1 6868 6584 361 3 16000 6865 584 3 6149 6025 ... 2134 0 5609 6140 3231 4 5248 6126 710 3
2 5590 5705 478 3 16000 6045 606 3 4933 5342 ... 2099 0 5550 6271 3272 4 4624 5976 802 3
3 6008 5772 501 3 16000 5795 584 3 5065 5276 ... 2514 0 5550 6227 3311 4 6176 7224 892 3
4 4917 4957 548 3 3874 4683 606 3 4134 4362 ... 2099 0 5639 6271 3391 4 5627 6126 846 3

5 rows × 92 columns

****************
eochunk.data is again a ndarray with shape :  (32, 29, 92)

Write results with ``EOCubeWindow.write_ndarray()``.

Usually we want to chunk-wise process all the data but save each of the resulting layers as single files, i.e. all chunks merged.

However, when we want to parallel process the chunks it is easier better to first read, process and write single chunk data and later put them together when all the chunks are computed.

Thus, for saving data you need

  • your processing logic starting from an ndarray (if you want to start from a DataFrame see below), and

  • the destination files.

Let us get the data which will be the input to our custom process:

[11]:
ji = 1
eoc_chunk = eoc.get_chunk(ji)
eoc_chunk = eoc_chunk.read_data()

Apply our logic and get the result as a ndarray.

Here as an example we calculate a boolean mask with the valid pixels derived from the fmask layers.

[12]:
ilocs_fmask = np.where((eoc_chunk.df_layers["band"] == "fmask").values)[0]
result = eoc_chunk.data[:,:,ilocs_fmask]
print("shape of data   : ", eoc_chunk.data.shape)
result = np.isin(result, [0, 1]).astype("uint8")
print("shape of result : ", result.shape)
shape of data   :  (32, 29, 92)
shape of result :  (32, 29, 23)

Note that the first two numbers of the result array shape have to be identical to the ones of the input array.

Instead the number of layers (third number) have to match the number of your destination paths. Note that we only need filenames for the whole layers,not the chunks. The files for the chunk files will be derived in EOCubeWindow. Here we assume that later we will merge the chunks virtually in form of a VRT, thus the .vrt extension. So let us get the destination paths:

[13]:
dst_paths = [Path("./results") / (sid + "_valid.vrt") for sid in eoc_chunk.df_layers.iloc[ilocs_fmask]["sceneid"]]
dst_paths[:3]
[13]:
[PosixPath('results/LT50350322008110PAC01_valid.vrt'),
 PosixPath('results/LE70350322008118EDC00_valid.vrt'),
 PosixPath('results/LT50350322008126PAC01_valid.vrt')]

And finally use the EOCubeChunk.write_ndarray to save the results.

[14]:
eoc_chunk.write_ndarray(result, dst_paths)

**TODO: Show how to execute this for all chunks with EOCube. By now see how this is done in the next eocube tutorial: *Calculate virtual time series with EOCube.**

Visualization with EOCube

[1]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
from pathlib import Path
import rasterio
import seaborn as sns
import matplotlib.pyplot as plt

# from eobox.raster import MultiRasterIO
from eobox import sampledata
from eobox.raster import cube
from eobox.raster import gdalutils

from eobox.raster.utils import cleanup_df_values_for_given_dtype
from eobox.raster.utils import dtype_checker_df

print(cube.__file__)
print(sampledata.__file__)
/home/ben/Devel/Packages/eo-box/raster/eobox/raster/cube.py
/home/ben/Devel/Packages/eo-box/sampledata/eobox/sampledata/__init__.py
Sample dataset
[2]:
def get_sampledata(year):
    dataset = sampledata.get_dataset("lsts")
    layers_paths = [Path(p) for p in dataset["raster_files"]]
    layers_df = pd.Series([p.stem for p in layers_paths]).str.split("_", expand=True) \
    .rename({0: "sceneid", 1:"band"}, axis=1)

    layers_df["date"] = pd.to_datetime(layers_df.sceneid.str[9:16], format="%Y%j")
    layers_df["uname"] = layers_df.sceneid.str[:3] + "_" + layers_df.date.dt.strftime("%Y-%m-%d") + "_" + layers_df.band.str[::]
    layers_df["path"] = layers_paths

    layers_df = layers_df.sort_values(["date", "band"])
    layers_df = layers_df.reset_index(drop=True)

    layers_df_year = layers_df[(layers_df.date >= str(year)) & (layers_df.date < str(year+1))]
    layers_df_year = layers_df_year.reset_index(drop=True)
    return layers_df_year

The sample data we are loading here contains 23 scenes each of which consists of three bands (b3, b4, b5) and a QA (quality assessment) band (here fmask).

fmask has the following categories:

  0 - clear land
  1 - clear water
  2 - cloud
  3 - snow
  4 - shadow
255 - NoData
[3]:
df_layers = get_sampledata(2008)
display(df_layers.head(4))
print(df_layers.band.value_counts())
print(df_layers.date.dt.strftime("%Y-%m-%d").unique())
sceneid band date uname path
0 LT50350322008110PAC01 b3 2008-04-19 LT5_2008-04-19_b3 /home/ben/Devel/Packages/eo-box/sampledata/eob...
1 LT50350322008110PAC01 b4 2008-04-19 LT5_2008-04-19_b4 /home/ben/Devel/Packages/eo-box/sampledata/eob...
2 LT50350322008110PAC01 b5 2008-04-19 LT5_2008-04-19_b5 /home/ben/Devel/Packages/eo-box/sampledata/eob...
3 LT50350322008110PAC01 fmask 2008-04-19 LT5_2008-04-19_fmask /home/ben/Devel/Packages/eo-box/sampledata/eob...
fmask    23
b4       23
b5       23
b3       23
Name: band, dtype: int64
['2008-04-19' '2008-04-27' '2008-05-05' '2008-05-21' '2008-05-29'
 '2008-06-06' '2008-06-14' '2008-06-22' '2008-06-30' '2008-07-08'
 '2008-07-16' '2008-07-24' '2008-08-01' '2008-08-09' '2008-08-17'
 '2008-08-25' '2008-09-02' '2008-09-18' '2008-09-26' '2008-10-12'
 '2008-10-28' '2008-11-21' '2008-12-07']

Initialize an ``EOCubeSceneCollection``

[4]:
df_layers=df_layers
chunksize=2**5
variables=["b3", "b4", "b5"]
qa="fmask"
qa_valid=[0, 1]

Get a chunk and read the data

[5]:
scoll = cube.EOCubeSceneCollection(df_layers=df_layers,
                                   chunksize=chunksize,
                                   variables=variables,
                                   qa=qa,
                                   qa_valid=qa_valid
                                  )
display(scoll.df_layers.head())

scoll_chunk = scoll.get_chunk(1)
scoll_chunk.read_data()
scoll_chunk.data.shape
sceneid band date uname path
0 LT50350322008110PAC01 b3 2008-04-19 LT5_2008-04-19_b3 /home/ben/Devel/Packages/eo-box/sampledata/eob...
1 LT50350322008110PAC01 b4 2008-04-19 LT5_2008-04-19_b4 /home/ben/Devel/Packages/eo-box/sampledata/eob...
2 LT50350322008110PAC01 b5 2008-04-19 LT5_2008-04-19_b5 /home/ben/Devel/Packages/eo-box/sampledata/eob...
3 LT50350322008110PAC01 fmask 2008-04-19 LT5_2008-04-19_fmask /home/ben/Devel/Packages/eo-box/sampledata/eob...
4 LE70350322008118EDC00 b3 2008-04-27 LE7_2008-04-27_b3 /home/ben/Devel/Packages/eo-box/sampledata/eob...
[5]:
(32, 29, 92)
Plotting raster
Single layer
Continuous
[6]:
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 6))

aximg = scoll_chunk.plot_raster(0, spatial_bounds=False, ax=axes[0])
axes[0].set_title(scoll_chunk.df_layers.uname[0])

aximg = scoll_chunk.plot_raster(1, spatial_bounds=False, ax=axes[1])
axes[1].set_title(scoll_chunk.df_layers.uname[1])

aximg = scoll_chunk.plot_raster(2, spatial_bounds=False, ax=axes[2])
axes[2].set_title(scoll_chunk.df_layers.uname[2])

plt.tight_layout()
_images/examples_raster_cube_viz_12_0.png
Categorical

ToDo

RGBs

There is a helper for getting the indices of commonly required for RGB plots:

[7]:
# three bands, one date
ilocs = scoll_chunk.get_df_ilocs(band=["b5", "b4", "b3"],
                                 date="2008-06-30")
print(ilocs)
display(scoll_chunk.df_layers.iloc[ilocs])
# one band, three dates
ilocs = scoll_chunk.get_df_ilocs(band="b5",
                                 date=['2008-04-19', '2008-07-08', '2008-09-26'])
print(ilocs)
display(scoll_chunk.df_layers.iloc[ilocs])

[34, 33, 32]
sceneid band date uname path
34 LE70350322008182EDC00 b5 2008-06-30 LE7_2008-06-30_b5 /home/ben/Devel/Packages/eo-box/sampledata/eob...
33 LE70350322008182EDC00 b4 2008-06-30 LE7_2008-06-30_b4 /home/ben/Devel/Packages/eo-box/sampledata/eob...
32 LE70350322008182EDC00 b3 2008-06-30 LE7_2008-06-30_b3 /home/ben/Devel/Packages/eo-box/sampledata/eob...
[2, 38, 74]
sceneid band date uname path
2 LT50350322008110PAC01 b5 2008-04-19 LT5_2008-04-19_b5 /home/ben/Devel/Packages/eo-box/sampledata/eob...
38 LT50350322008190PAC01 b5 2008-07-08 LT5_2008-07-08_b5 /home/ben/Devel/Packages/eo-box/sampledata/eob...
74 LT50350322008270PAC01 b5 2008-09-26 LT5_2008-09-26_b5 /home/ben/Devel/Packages/eo-box/sampledata/eob...

Plot false color RGB (SWIR-1, NIR, RED) of tree days. Each with its own robust color stretch.

[8]:
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 6))

ilocs = scoll_chunk.get_df_ilocs(band=["b5", "b4", "b3"],
                                 date="2008-04-19")
print(ilocs)
scoll_chunk.plot_raster_rgb(ilocs,
                            spatial_bounds=False,
                            robust=True,
                            ax=axes[0])
axes[0].set_title("2008-04-19")


ilocs = scoll_chunk.get_df_ilocs(band=["b5", "b4", "b3"],
                                 date="2008-07-08")
print(ilocs)
scoll_chunk.plot_raster_rgb(ilocs,
                            spatial_bounds=False,
                            robust=True,
                            ax=axes[1])
axes[1].set_title("2008-07-08")

ilocs = scoll_chunk.get_df_ilocs(band=["b5", "b4", "b3"],
                                 date="2008-09-26")
print(ilocs)
scoll_chunk.plot_raster_rgb(ilocs,
                            spatial_bounds=False,
                            robust=True,
                            ax=axes[2])
axes[2].set_title("2008-09-26")

plt.tight_layout()
[2, 1, 0]
[38, 37, 36]
[74, 73, 72]
_images/examples_raster_cube_viz_16_1.png

Plot false color RGB (SWIR-1, NIR, RED) of tree days. All with the same color stretch derived from the first image.

[9]:
master_ilocs = scoll_chunk.get_df_ilocs(band=["b5", "b4", "b3"],
                                        date="2008-04-19")
master_array = scoll_chunk.data[:, :, master_ilocs]

vmin, vmax = scoll_chunk.robust_data_range(master_array, robust=True)
print(vmin)
print(vmax)
[339.0, 2016.0, 1092.0]
[734.0, 6972.180000000001, 16000.0]
[10]:
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 6))

ilocs = scoll_chunk.get_df_ilocs(band=["b5", "b4", "b3"],
                                 date="2008-04-19")
scoll_chunk.plot_raster_rgb(ilocs,
                            spatial_bounds=False,
                            vmin=vmin, vmax=vmax,
                            ax=axes[0])
axes[0].set_title("2008-04-19")


ilocs = scoll_chunk.get_df_ilocs(band=["b5", "b4", "b3"],
                                 date="2008-07-08")
scoll_chunk.plot_raster_rgb(ilocs,
                            spatial_bounds=False,
                            vmin=vmin, vmax=vmax,
                            ax=axes[1])
axes[1].set_title("2008-07-08")

ilocs = scoll_chunk.get_df_ilocs(band=["b5", "b4", "b3"],
                                 date="2008-09-26")
scoll_chunk.plot_raster_rgb(ilocs,
                            spatial_bounds=False,
                            vmin=vmin, vmax=vmax,
                            ax=axes[2])
axes[2].set_title("2008-09-26")

plt.tight_layout()
_images/examples_raster_cube_viz_19_0.png

IS HERE SOMETHING WRONG WITH THE CODE OR THE DATA?

EOCubeSceneCollection - from level-2 to level-3 products

EOCubeSceneCollection and EOCubeSceneCollectionChunk are similar to EOCube and EOCubeChunk and have basically all the functionality of these classes but some more specific functionality and behavious.

While EOCube and EOCubeChunk do not make any assumption about the input data, EOCubeSceneCollection and EOCubeSceneCollectionChunk assume that the input is a collection (usally time series) of scenes where each scene consists of the same bands (including a quality assessment layer).

In the Sentinel-2 / Landsat context, the atmospherically corrected single scene data is often refered to Level-2. Level-3 data products are usually derived from time series of lewer level products with the goal to get spatially-contiguous and missing data / gap free products. Such products are:

  • temporal statistical metrics, e.g. median

  • virtual time series derived from interpolation

  • best available pixel composites

This tutorial shows how to derive virtual time series with EOCubeSceneCollection.

First we load the packages required in this tutorial.

[1]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
from pathlib import Path
import rasterio
import seaborn as sns

# from eobox.raster import MultiRasterIO
from eobox import sampledata
from eobox.raster import cube
from eobox.raster import gdalutils

from eobox.raster.utils import cleanup_df_values_for_given_dtype
from eobox.raster.utils import dtype_checker_df
Sample dataset

Lets get some typical dataset to make it more tangible what we are talking about.

[2]:
def get_sampledata(year):
    dataset = sampledata.get_dataset("lsts")
    layers_paths = [Path(p) for p in dataset["raster_files"]]
    layers_df = pd.Series([p.stem for p in layers_paths]).str.split("_", expand=True) \
    .rename({0: "sceneid", 1:"band"}, axis=1)

    layers_df["date"] = pd.to_datetime(layers_df.sceneid.str[9:16], format="%Y%j")
    layers_df["uname"] = layers_df.sceneid.str[:3] + "_" + layers_df.date.dt.strftime("%Y-%m-%d") + "_" + layers_df.band.str[::]
    layers_df["path"] = layers_paths

    layers_df = layers_df.sort_values(["date", "band"])
    layers_df = layers_df.reset_index(drop=True)

    layers_df_year = layers_df[(layers_df.date >= str(year)) & (layers_df.date < str(year+1))]
    layers_df_year = layers_df_year.reset_index(drop=True)
    return layers_df_year

The sample data we are loading here contains 23 scenes each of which consists of three bands (b3, b4, b5) and a QA (quality assessment) band (here fmask). This is a typical starting point for nowadays using-all-available-pixels EO analysis tasks.

[3]:
df_layers = get_sampledata(2008)
display(df_layers.head())
display(df_layers.tail(1))
df_layers.band.value_counts()
sceneid band date uname path
0 LT50350322008110PAC01 b3 2008-04-19 LT5_2008-04-19_b3 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
1 LT50350322008110PAC01 b4 2008-04-19 LT5_2008-04-19_b4 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
2 LT50350322008110PAC01 b5 2008-04-19 LT5_2008-04-19_b5 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
3 LT50350322008110PAC01 fmask 2008-04-19 LT5_2008-04-19_fmask /home/ben/Devel/Packages/eo-box/eobox/sampleda...
4 LE70350322008118EDC00 b3 2008-04-27 LE7_2008-04-27_b3 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
sceneid band date uname path
91 LE70350322008342EDC00 fmask 2008-12-07 LE7_2008-12-07_fmask /home/ben/Devel/Packages/eo-box/eobox/sampleda...
[3]:
b3       23
b5       23
fmask    23
b4       23
Name: band, dtype: int64
EOCubeSceneCollection
Initialization

Typically we want to derive temporal features from these kind of dataset which involves masking out invalid pixels. Invalid pixels are typically covered by clouds, cloud shadows, saturated data, pixels which are outside the the sensed area, snow, etc. Usually, these categories are usually included in the QA layer.

For example, fmask is a typical Landsat QA-like (or pre classification) layer which has the following categories:

  0 - clear land
  1 - clear water
  2 - cloud
  3 - snow
  4 - shadow
255 - NoData

As a result we initialize the EOCubeSceneCollection as follows:

[4]:
df_layers=df_layers
chunksize=2**5
variables=["b3", "b4", "b5"]
qa="fmask"
qa_valid=[0, 1]

scoll = cube.EOCubeSceneCollection(df_layers=df_layers,
                                   chunksize=chunksize,
                                   variables=variables,
                                   qa=qa,
                                   qa_valid=qa_valid
                                  )
scoll.df_layers.head()
[4]:
sceneid band date uname path
0 LT50350322008110PAC01 b3 2008-04-19 LT5_2008-04-19_b3 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
1 LT50350322008110PAC01 b4 2008-04-19 LT5_2008-04-19_b4 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
2 LT50350322008110PAC01 b5 2008-04-19 LT5_2008-04-19_b5 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
3 LT50350322008110PAC01 fmask 2008-04-19 LT5_2008-04-19_fmask /home/ben/Devel/Packages/eo-box/eobox/sampleda...
4 LE70350322008118EDC00 b3 2008-04-27 LE7_2008-04-27_b3 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
Calculation of level-3 features
Virtual time series
[5]:
dst_dir = "./xxx_uncontrolled/ls2008_vts4w"
dst_pattern = "./xxx_uncontrolled/ls2008_vts4w/ls2008_vts4w_{date}_{var}.vrt"

freq = '4W'
idx_virtual = pd.date_range(start='2008-01-01', end="2008-12-31", freq=freq)
idx_virtual

scoll.create_virtual_time_series(idx_virtual=idx_virtual,
                                 dst_pattern=dst_pattern,
                                 dtypes="int16",
                                 compress="lzw",
                                 nodata=None,
                                 num_workers=8)
4it [00:04,  1.25s/it]

Create a time series layer stack (VRT) for each variable.

[6]:
for var in scoll.variables:
    input_file_list = list(list(Path(dst_dir).glob(f"*{var}*")))
    input_file_list = np.sort(input_file_list)
    output_file = Path(dst_dir) / "time_series_stacks" / f"ls_2008_vts__2008__{var}__vts4w.vrt"
    print(output_file)
    gdalutils.buildvrt(input_file_list, output_file, relative=True, separate=True)
xxx_uncontrolled/ls2008_vts4w/time_series_stacks/ls_2008_vts__2008__b3__vts4w.vrt
xxx_uncontrolled/ls2008_vts4w/time_series_stacks/ls_2008_vts__2008__b4__vts4w.vrt
xxx_uncontrolled/ls2008_vts4w/time_series_stacks/ls_2008_vts__2008__b5__vts4w.vrt
Statistical features

Such as mean, std, min, max, percentiles, iqr, …

TODO: Metrics such as regression slope (ordered by time), … - What else can be interesting?

[7]:
dst_dir = "./xxx_uncontrolled/ls2008_stats1yr"
dst_pattern = "./xxx_uncontrolled/ls2008_stats1yr/ls2008_stats1yr_{metric}_{var}.vrt"

percentiles = [.05, .1, .25, .50, .75, .9, .95]
iqr = True
diffs = True

scoll.create_statistical_metrics(percentiles=percentiles,
                                 iqr=iqr,
                                 diffs=diffs,
                                 dst_pattern=dst_pattern,
                                 dtypes="int16",
                                 compress="lzw",
                                 nodata=None,
                                 num_workers=8)

4it [00:10,  2.58s/it]

Create a statistical metrics layer stack (VRT) for each variable.

[26]:

[26]:
array([PosixPath('xxx_uncontrolled/ls2008_stats1yr/ls2008_stats1yr_p05_b5.vrt'),
       PosixPath('xxx_uncontrolled/ls2008_stats1yr/ls2008_stats1yr_p10_b5.vrt'),
       PosixPath('xxx_uncontrolled/ls2008_stats1yr/ls2008_stats1yr_p25_b5.vrt'),
       PosixPath('xxx_uncontrolled/ls2008_stats1yr/ls2008_stats1yr_p50_b5.vrt'),
       PosixPath('xxx_uncontrolled/ls2008_stats1yr/ls2008_stats1yr_p75_b5.vrt'),
       PosixPath('xxx_uncontrolled/ls2008_stats1yr/ls2008_stats1yr_p90_b5.vrt'),
       PosixPath('xxx_uncontrolled/ls2008_stats1yr/ls2008_stats1yr_p95_b5.vrt')],
      dtype=object)
[37]:
for var in scoll.variables:
    print("*" * 80)
    print(var)
    input_file_list = list(list(Path(dst_dir).glob(f"*{var}*")))
    input_file_list = list(list(Path(dst_dir).glob(f"*_min_*{var}*")))
    input_file_list += list(np.sort(list(Path(dst_dir).glob(f"*_p??_*{var}*"))))
    input_file_list += list(list(Path(dst_dir).glob(f"*_max_*{var}*")))
    input_file_list += list(np.sort(list(Path(dst_dir).glob(f"*_p??-p??_*{var}*"))))
    input_file_list += list(list(Path(dst_dir).glob(f"*_mean_*{var}*")))
    input_file_list += list(list(Path(dst_dir).glob(f"*_std_*{var}*")))
    output_file = Path(dst_dir) / "time_series_stacks" / f"ls_2008_stats__2008__{var}__stats1yr.vrt"
    print(output_file)
    print("Layers:")
    print("\n".join([p.stem for p in input_file_list]))
    gdalutils.buildvrt(input_file_list, output_file, relative=True, separate=True)
********************************************************************************
b3
xxx_uncontrolled/ls2008_stats1yr/time_series_stacks/ls_2008_stats__2008__b3__stats1yr.vrt
Layers:
ls2008_stats1yr_min_b3
ls2008_stats1yr_p05_b3
ls2008_stats1yr_p10_b3
ls2008_stats1yr_p25_b3
ls2008_stats1yr_p50_b3
ls2008_stats1yr_p75_b3
ls2008_stats1yr_p90_b3
ls2008_stats1yr_p95_b3
ls2008_stats1yr_max_b3
ls2008_stats1yr_p75-p25_b3
ls2008_stats1yr_p90-p10_b3
ls2008_stats1yr_p95-p05_b3
ls2008_stats1yr_mean_b3
ls2008_stats1yr_std_b3
********************************************************************************
b4
xxx_uncontrolled/ls2008_stats1yr/time_series_stacks/ls_2008_stats__2008__b4__stats1yr.vrt
Layers:
ls2008_stats1yr_min_b4
ls2008_stats1yr_p05_b4
ls2008_stats1yr_p10_b4
ls2008_stats1yr_p25_b4
ls2008_stats1yr_p50_b4
ls2008_stats1yr_p75_b4
ls2008_stats1yr_p90_b4
ls2008_stats1yr_p95_b4
ls2008_stats1yr_max_b4
ls2008_stats1yr_p75-p25_b4
ls2008_stats1yr_p90-p10_b4
ls2008_stats1yr_p95-p05_b4
ls2008_stats1yr_mean_b4
ls2008_stats1yr_std_b4
********************************************************************************
b5
xxx_uncontrolled/ls2008_stats1yr/time_series_stacks/ls_2008_stats__2008__b5__stats1yr.vrt
Layers:
ls2008_stats1yr_min_b5
ls2008_stats1yr_p05_b5
ls2008_stats1yr_p10_b5
ls2008_stats1yr_p25_b5
ls2008_stats1yr_p50_b5
ls2008_stats1yr_p75_b5
ls2008_stats1yr_p90_b5
ls2008_stats1yr_p95_b5
ls2008_stats1yr_max_b5
ls2008_stats1yr_p75-p25_b5
ls2008_stats1yr_p90-p10_b5
ls2008_stats1yr_p95-p05_b5
ls2008_stats1yr_mean_b5
ls2008_stats1yr_std_b5

EOCubeSceneCollection - custom functions

EOCubeSceneCollection and EOCubeSceneCollectionChunk are similar to EOCube and EOCubeChunk and have basically all the functionality of these classes but some more specific functionality and behavious.

While EOCube and EOCubeChunk do not make any assumption about the input data, EOCubeSceneCollection and EOCubeSceneCollectionChunk assume that the input is a collection (usally time series) of scenes where each scene consists of the same bands (including a quality assessment layer).

Lets get some typical dataset to make it more tangible what we are talking about.

But first we load the packages required in this tutorial and the function for loading a sample dataset.

[1]:
%load_ext autoreload
%autoreload 2

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pathlib import Path
import rasterio
import seaborn as sns

from eobox import sampledata
from eobox.raster import cube
from eobox.raster import gdalutils

from eobox.raster.utils import cleanup_df_values_for_given_dtype
from eobox.raster.utils import dtype_checker_df

print(cube.__file__)
print(sampledata.__file__)
/home/ben/Devel/Packages/eo-box/eobox/raster/cube.py
/home/ben/Devel/Packages/eo-box/eobox/sampledata/__init__.py
The sample dataset
[2]:
def get_sampledata(year):
    dataset = sampledata.get_dataset("lsts")
    layers_paths = [Path(p) for p in dataset["raster_files"]]
    layers_df = pd.Series([p.stem for p in layers_paths]).str.split("_", expand=True) \
    .rename({0: "sceneid", 1:"band"}, axis=1)

    layers_df["date"] = pd.to_datetime(layers_df.sceneid.str[9:16], format="%Y%j")
    layers_df["uname"] = layers_df.sceneid.str[:3] + "_" + layers_df.date.dt.strftime("%Y-%m-%d") + "_" + layers_df.band.str[::]
    layers_df["path"] = layers_paths

    layers_df = layers_df.sort_values(["date", "band"])
    layers_df = layers_df.reset_index(drop=True)

    layers_df_year = layers_df[(layers_df.date >= str(year)) & (layers_df.date < str(year+1))]
    layers_df_year = layers_df_year.reset_index(drop=True)
    return layers_df_year

The sample data we are loading here contains 23 scenes each of which consists of three bands (b3, b4, b5) and a QA (quality assessment) band (here fmask). This is a typical starting point for nowadays using-all-available-pixels EO analysis tasks.

[3]:
df_layers = get_sampledata(2008)
display(df_layers.head())
display(df_layers.tail(1))
df_layers.band.value_counts()
sceneid band date uname path
0 LT50350322008110PAC01 b3 2008-04-19 LT5_2008-04-19_b3 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
1 LT50350322008110PAC01 b4 2008-04-19 LT5_2008-04-19_b4 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
2 LT50350322008110PAC01 b5 2008-04-19 LT5_2008-04-19_b5 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
3 LT50350322008110PAC01 fmask 2008-04-19 LT5_2008-04-19_fmask /home/ben/Devel/Packages/eo-box/eobox/sampleda...
4 LE70350322008118EDC00 b3 2008-04-27 LE7_2008-04-27_b3 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
sceneid band date uname path
91 LE70350322008342EDC00 fmask 2008-12-07 LE7_2008-12-07_fmask /home/ben/Devel/Packages/eo-box/eobox/sampleda...
[3]:
b5       23
b4       23
b3       23
fmask    23
Name: band, dtype: int64

Invalid pixels are typically covered by pixels which are outside the the sensed area, saturated data, snow (depending on the application), clouds, cloud shadows, etc. Usually, these categories are mapped in a QA layer. For example, this dataset contains fmask, a widely used Landsat QA-layer (or pre classification layer). It which has the following categories:

  0 - clear land
  1 - clear water
  2 - cloud
  3 - snow
  4 - shadow
255 - NoData
Processing data with a custom function
Goal: Temporal statistical metrics

In this notebook we show how to calculate a set of temporal statistical metrics / layers from all valid observations of 2008. Particularly we will calculate the mean, minimum, maximum, standard deviation, 10th, 25th, 50th (median), 75th, 90th percentiles, inter-quartile range from all valid observations. Such temporal statistical metrics can be used to derive spatially contiguous features. Such features can be used to build and predict models in large-area mapping applications.

Of course, for a small dataset as the one used here the calculation could be performed all in memory and would be realatively easy. However, if there are many layers and the single layers are large (in terms of pixels) it requires a machine with a lot of memory or code that processes the data in chunks.

With EOCubeSceneCollection you can

  • easily get a chunk of data of a dataset with many large layers, e.g. a time series of Sentinel-2 scenes in the distributed tile format),

  • load a chunk of data that fits in memory,

  • develop a custom function to process the chunk of data,

  • process the whole dataset with the custom function.

Let’s get started.

Read a chunk of data

We first initialize the EOCubeSceneCollection. We provide the class already with important information, i.e.

  • with the variables it should processed, and

  • with the band to be used for masking out invalid pixels and the ids in that band to be considerd as valid pixels.

Furthermore, the chunksize is set. Here it is a very small number since the whole dataset is small and we want to show that we can develop on a small chunk and process the whole data chunk-wise. With real-world data you should make shure that the chunks are not too small since that increases the computational overhead and time but of course not too big for your memory.

[4]:
df_layers=df_layers
chunksize=2**5
variables=["b3", "b4", "b5"]
qa="fmask"
qa_valid=[0, 1]

scoll = cube.EOCubeSceneCollection(df_layers=df_layers,
                                   chunksize=chunksize,
                                   variables=variables,
                                   qa=qa,
                                   qa_valid=qa_valid
                                  )
scoll.df_layers.head()
[4]:
sceneid band date uname path
0 LT50350322008110PAC01 b3 2008-04-19 LT5_2008-04-19_b3 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
1 LT50350322008110PAC01 b4 2008-04-19 LT5_2008-04-19_b4 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
2 LT50350322008110PAC01 b5 2008-04-19 LT5_2008-04-19_b5 /home/ben/Devel/Packages/eo-box/eobox/sampleda...
3 LT50350322008110PAC01 fmask 2008-04-19 LT5_2008-04-19_fmask /home/ben/Devel/Packages/eo-box/eobox/sampleda...
4 LE70350322008118EDC00 b3 2008-04-27 LE7_2008-04-27_b3 /home/ben/Devel/Packages/eo-box/eobox/sampleda...

For the given chunk size we can find derive the numbers of chunks required to cover the whole dataset. Here we have only 4 chunks.

[5]:
print("Chunksize: ", scoll.chunksize)
print("Number of Chunks: ", scoll.n_chunks)
Chunksize:  32
Number of Chunks:  4

We can read the data as with EOCube, i.e. as an unmasked 3-D numpy array, i.e. with the QA-layers. Here the third dimension of sc_chunk.data corresponds to the rows in df_layers. And we can also easily convert the data to a dataframe where all pixels are in the rows and all layers are in the columns.

[6]:
sc_chunk = scoll.get_chunk(0).read_data()
print(sc_chunk.data.__class__)
print(sc_chunk.data.shape)
assert sc_chunk.data.shape[2] == df_layers.shape[0]

sc_chunk = sc_chunk.convert_data_to_dataframe()
print(sc_chunk.data.__class__)
print(sc_chunk.data.shape)
assert sc_chunk.data.shape[1] == df_layers.shape[0]
<class 'numpy.ndarray'>
(32, 32, 92)
<class 'pandas.core.frame.DataFrame'>
(1024, 92)

However, in case of scene collection we often want to apply the same to the masked observations of each variable / band, such as in our case where we want to calculate termporal-statistical metrics. If this is what we want, then we can use read_data_by_variable to get a dictionary of dataframes for

  • all variables and the QA-layer with mask=False, or

  • all masked variables without the QA-layer with mask=True.

Note that it is not possible so far to get 3-D arrays so far but if needed it can be implemented - justopen an issue.

Here we want the masked variables.

[7]:
sc_chunk = sc_chunk.read_data_by_variable(mask=True)
for var in sc_chunk.data.keys():
    print(f"*****************\n{var}\n......")
    display(sc_chunk.data[var].head(2))
*****************
b3
......
date 2008-04-19 00:00:00 2008-04-27 00:00:00 2008-05-05 00:00:00 2008-05-21 00:00:00 2008-05-29 00:00:00 2008-06-06 00:00:00 2008-06-14 00:00:00 2008-06-22 00:00:00 2008-06-30 00:00:00 2008-07-08 00:00:00 ... 2008-08-09 00:00:00 2008-08-17 00:00:00 2008-08-25 00:00:00 2008-09-02 00:00:00 2008-09-18 00:00:00 2008-09-26 00:00:00 2008-10-12 00:00:00 2008-10-28 00:00:00 2008-11-21 00:00:00 2008-12-07 00:00:00
0 NaN NaN NaN 1274 NaN 651.0 NaN 426 240.0 313 ... 919.0 NaN 340 283.0 480.0 NaN NaN 894 NaN NaN
1 NaN NaN 3680.0 1354 NaN 596.0 NaN 426 222.0 313 ... 799.0 NaN 340 283.0 460.0 NaN NaN 767 NaN NaN

2 rows × 23 columns

*****************
b4
......
date 2008-04-19 00:00:00 2008-04-27 00:00:00 2008-05-05 00:00:00 2008-05-21 00:00:00 2008-05-29 00:00:00 2008-06-06 00:00:00 2008-06-14 00:00:00 2008-06-22 00:00:00 2008-06-30 00:00:00 2008-07-08 00:00:00 ... 2008-08-09 00:00:00 2008-08-17 00:00:00 2008-08-25 00:00:00 2008-09-02 00:00:00 2008-09-18 00:00:00 2008-09-26 00:00:00 2008-10-12 00:00:00 2008-10-28 00:00:00 2008-11-21 00:00:00 2008-12-07 00:00:00
0 NaN NaN NaN 2215 NaN 2891.0 NaN 3128 3571.0 3460 ... 2389.0 NaN 2970 3140.0 3075.0 NaN NaN 2250 NaN NaN
1 NaN NaN 4297.0 2280 NaN 2857.0 NaN 3062 3498.0 3294 ... 2353.0 NaN 2932 2979.0 2856.0 NaN NaN 1995 NaN NaN

2 rows × 23 columns

*****************
b5
......
date 2008-04-19 00:00:00 2008-04-27 00:00:00 2008-05-05 00:00:00 2008-05-21 00:00:00 2008-05-29 00:00:00 2008-06-06 00:00:00 2008-06-14 00:00:00 2008-06-22 00:00:00 2008-06-30 00:00:00 2008-07-08 00:00:00 ... 2008-08-09 00:00:00 2008-08-17 00:00:00 2008-08-25 00:00:00 2008-09-02 00:00:00 2008-09-18 00:00:00 2008-09-26 00:00:00 2008-10-12 00:00:00 2008-10-28 00:00:00 2008-11-21 00:00:00 2008-12-07 00:00:00
0 NaN NaN NaN 1961 NaN 1720.0 NaN 1492 1325.0 1339 ... 1155.0 NaN 1253 1146.0 1472.0 NaN NaN 1788 NaN NaN
1 NaN NaN 506.0 1916 NaN 1608.0 NaN 1448 1369.0 1294 ... 1057.0 NaN 1303 1121.0 1420.0 NaN NaN 1650 NaN NaN

2 rows × 23 columns

Create the custom function

Now with this data we can develop a custom function. This function must take the dataframe of one variable as input and must derive a dataframe as output where also the pixels are in the rows and output features / layers in the columns.

As mentioned above we want to calculate the mean, minimum, maximum, standard deviation, 10th, 25th, 50th (median), 75th, 90th percentiles, inter-quartile range from all valid observations. For these statistical we can make use of pd.DataFrame.describe() and extend it by the inter-quartile range.

Lets try it out:

[8]:
var = "b3"
# sc_chunk.data[var]

def calc_statistical_metrics(var_df, percentiles=None, iqr=True):
    """Calculate statistical metrics and the count of valid observations."""
    metrics_df = var_df.transpose().describe(percentiles=percentiles).transpose()
    if iqr and all(np.isin(["25%", "75%"], metrics_df.columns)) :
        metrics_df["iqr"] = metrics_df["75%"] - metrics_df["25%"]
        metrics_df = metrics_df.drop(labels=["count"], axis=1)
    return metrics_df
metrics_df = calc_statistical_metrics(sc_chunk.data[var],
                                      percentiles=[.1, .25, .50, .75, .9], iqr=True)
metrics_df.head()
[8]:
mean std min 10% 25% 50% 75% 90% max iqr
0 530.833333 334.208654 225.0 244.3 305.5 383.0 711.75 916.5 1274.0 406.25
1 760.615385 930.246790 222.0 285.4 313.0 426.0 767.00 1243.0 3680.0 454.00
2 775.142857 906.464973 240.0 297.1 316.0 452.5 652.00 1570.3 3574.0 336.00
3 691.733333 774.670664 192.0 247.8 303.5 346.0 617.50 1685.6 2875.0 314.00
4 550.200000 553.828648 173.0 219.2 247.5 300.0 490.00 1362.4 2011.0 242.50

This is what we want. Now lets calculate the metrics as raster layers for the whole data.

Process the whole dataset

To process the whole dataset we also need the destination paths for the output layers. For our dataset with the three variables and the output dataframe above the destination paths should be a dictionary with a structure as the following one:

[9]:
metrics = ['mean', 'std', 'min', 'p10', 'p25', 'p50', 'p75', 'p90', 'max', 'iqr']
dst_path_pattern = "./xxx_uncontrolled/statistical_metrics/ls_2008_metric_{metric}_{var}.vrt"
dst_paths = {}
for var in variables:
    dst_paths[var] = [dst_path_pattern.format(metric=m, var=var) for m in metrics]
dst_paths
[9]:
{'b3': ['./xxx_uncontrolled/statistical_metrics/ls_2008_metric_mean_b3.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_std_b3.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_min_b3.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p10_b3.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p25_b3.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p50_b3.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p75_b3.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p90_b3.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_max_b3.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_iqr_b3.vrt'],
 'b4': ['./xxx_uncontrolled/statistical_metrics/ls_2008_metric_mean_b4.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_std_b4.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_min_b4.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p10_b4.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p25_b4.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p50_b4.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p75_b4.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p90_b4.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_max_b4.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_iqr_b4.vrt'],
 'b5': ['./xxx_uncontrolled/statistical_metrics/ls_2008_metric_mean_b5.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_std_b5.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_min_b5.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p10_b5.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p25_b5.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p50_b5.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p75_b5.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_p90_b5.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_max_b5.vrt',
  './xxx_uncontrolled/statistical_metrics/ls_2008_metric_iqr_b5.vrt']}

We can (if optional) or have to (if required) define

  • the output data type (one for all variables or different ones per variables by passing dtypes in a dictionary with variable keys),

  • the value to use when there will be NaNs in the ouput data,

  • a compression type for the raster,

  • all the kewword arguments that our custom function requires (here percentiles and iqr.

Now we can run the calculations for the whole raster dataset.

[10]:
scoll.apply_and_write_by_variable(fun=calc_statistical_metrics,
                                  dst_paths=dst_paths,
                                  dtypes="int16",
                                  compress="lzw",
                                  nodata=-32768,
                                  percentiles=[.1, .25, .5, .75, .9], iqr=True)
4it [00:20,  5.11s/it]
Visualize the results

Let us visualize what we calculated:

[11]:
df_feature_layers = pd.Series(np.array([dst_paths[var] for var in variables]).flatten()).to_frame()
df_feature_layers.columns = ["path"]
df_feature_layers = pd.concat(
    [
        df_feature_layers,
        df_feature_layers["path"] \
            .apply(lambda x: Path(x).stem).str.split("_", expand=True)[[3, 4]] \
            .rename({3:"metric", 4:"band"}, axis=1)
    ],
    axis=1
)
df_feature_layers = df_feature_layers.sort_values(["band", "metric"]).reset_index()
print(df_feature_layers.shape)
df_feature_layers.head()
(30, 4)
[11]:
index path metric band
0 9 ./xxx_uncontrolled/statistical_metrics/ls_2008... iqr b3
1 8 ./xxx_uncontrolled/statistical_metrics/ls_2008... max b3
2 0 ./xxx_uncontrolled/statistical_metrics/ls_2008... mean b3
3 2 ./xxx_uncontrolled/statistical_metrics/ls_2008... min b3
4 3 ./xxx_uncontrolled/statistical_metrics/ls_2008... p10 b3
[12]:
eoc_metrics = cube.EOCube(df_feature_layers)
chunk_metrics = eoc_metrics.get_chunk(0)
[13]:
fig, axes = plt.subplots(nrows=3, ncols=10, figsize=(30, 10))

for i, ax in enumerate(axes.flatten()):
    aximg = chunk_metrics.plot_raster(i, spatial_bounds=False, vmin=0, vmax=4000, ax=ax)
    ax.set_title(chunk_metrics.df_layers.band[i] + "-" + chunk_metrics.df_layers.metric[i])
plt.tight_layout()
_images/examples_raster_eocubescenecollection_custom_functions_25_0.png

vector

Create distance-to-polygon-border layer

There are cases of remote sensing data analysis we have reference data (polygon vector data) which corresponds to to the extend of the features they represent on earth. As a consequence, the pixels that are at the inner border of the polygon are most likely so called mixed pixels. In such a situation the polygons are often buffered before used to do the extraction of raster values and the analysis. This is less flexible or more laberous for advanced analysis.

To be more flexible in the analysis of the extracted raster data it can be advantageous to create a distance-to-polygon-border layer. Then we can also extract the distance of each pixel to the next polygon border and use this information in the analysis.

The function eobox.vector.calc_distance_to_border makes it is easy to calculate such layer. It is described in this tutorial how to use it.

[1]:
#import fiona
import geopandas as gpd
import matplotlib.pyplot as plt
#import numpy as np
from pathlib import Path
import rasterio
#import shapely
#import pandas as pd
#from shapely.geometry import LineString

from eobox.sampledata import get_dataset
from eobox.vector import calc_distance_to_border

%matplotlib inline
The sample dataset
[2]:
ds = get_dataset('s2l1c')

src_file_vector = ds['vector_file_osm']
template_file_raster = ds['raster_files'][0]

# *xxx_uncontrolled* is in the eo-box .gitignore
interim_file_lines = "./xxx_uncontrolled_d2b/_interim_sample_vector_dataset_lines.shp"
interim_file_lines_raster = "./xxx_uncontrolled_d2b/_interim_sample_vector_dataset_lines_raster.tif"

dst_file_proximity = "./xxx_uncontrolled_d2b/distance_to_polygon_border__vector_dataset.tif"
Processing
Goal

For a given raster dataset we want know for each pixel, how far away it is (in pixel distance) from the next polygon border of a vector dataset.

High-Level implementation
  1. Convert the polygon vector dataset in lines and save them in the CRS of the raster.

  2. Rasterize the lines in the empty raster.

  3. Calculate the proximity.

Useage
[3]:
calc_distance_to_border(polygons=src_file_vector,
                        template_raster=template_file_raster,
                        dst_raster=dst_file_proximity,
                        overwrite=False,
                        keep_interim_files=True)  # stores the vector and rasterized lines
100%|██████████| 604/604 [00:00<00:00, 48161.81it/s]
Interim files are in xxx_uncontrolled_d2b/TEMPDIR_distance_to_polygon_border__vector_dataset_xqijr5sc
[3]:
0
Vizualize results
[4]:
import rasterio.plot
[5]:
r = rasterio.open(template_file_raster)
r_d2b = rasterio.open(dst_file_proximity)
v = gpd.read_file(src_file_vector)
[6]:
fig, ax = plt.subplots(figsize=(16, 6))
rasterio.plot.show(r, cmap='pink', ax=ax)
v.plot(ax=ax, facecolor='none', edgecolor='deepskyblue')
[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4317898d0>
_images/examples_vector_calculate_distance_to_polygon_border_9_1.png
[7]:
fig, ax = plt.subplots(figsize=(16, 6))
rasterio.plot.show(r_d2b, vmin=0, vmax=10, cmap='pink', ax=ax)
v.plot(ax=ax, facecolor='none', edgecolor='deepskyblue')
[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc431765fd0>
_images/examples_vector_calculate_distance_to_polygon_border_10_1.png

ml

Forward Feature Group Selection

What are we talking about?

Forward feature group selection works as forward feature selection but it enables you to consider feature groups instead of only single features.

Why?

Sometimes features can be grouped logically and it might be of interest how groups of features perform instead of single features.

If this is the case, an additional advantage is that the number of iterations and train/val tasks needed to run through a sequential feature selection is lower.

Example
[2]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pprint as pp
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris

from mlxtend import feature_selection
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs

from eobox.ml import ffgs
Dataset

We use the well known Iris dataset here but add additional columns without any information to have some more features.

[3]:
iris = load_iris(as_frame=True)
X_informative = iris.data
np.random.seed(0)
X_uninformative = pd.DataFrame(np.random.uniform(X_informative.min(),
                                                 X_informative.max(),
                                                 size=X_informative.shape))
X_uninformative.columns = [f"noinfo-{i}" for i in range(X_uninformative.shape[1])]
X = pd.concat([X_informative, X_uninformative], axis=1)
y = iris.target
knn = KNeighborsClassifier(n_neighbors=4)
display(X.head())
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) noinfo-0 noinfo-1 noinfo-2 noinfo-3
0 5.1 3.5 1.4 0.2 6.275729 3.716454 4.556304 1.407720
1 4.9 3.0 1.4 0.2 5.825157 3.550146 3.581765 2.240255
2 4.7 3.2 1.3 0.2 7.769186 2.920260 5.671178 1.369348
3 4.6 3.1 1.5 0.2 6.344960 4.221432 1.419113 0.309110
4 5.0 3.6 1.4 0.2 4.372786 3.998288 5.591125 2.188029
Forward Feature Selection (FFS)
FFS on individual features

With FFS we consider and select one feature in each iteration.

[4]:
ffs = feature_selection.SequentialFeatureSelector(knn,
                                                  k_features=X.shape[1],
                                                  forward=True,
                                                  floating=False,
                                                  scoring='accuracy',
                                                  verbose=1
                                                  )
ffs = ffs.fit(X, y, custom_feature_names=X.columns)

print(f"Iteration : Selected Features")
for iter, subset in ffs.subsets_.items():
    print(f"{iter}         : {subset['feature_names']}")

plot_sfs(ffs.get_metric_dict(), kind='std_dev')
plt.title('Sequential Forward Feature Selection (w. StdDev)')
plt.grid()
plt.show()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    0.1s finished
Features: 1/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    0.1s finished
Features: 2/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    0.1s finished
Features: 3/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.1s finished
Features: 4/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.0s finished
Features: 5/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
Features: 6/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished
Features: 7/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished
Features: 8/8
Iteration : Selected Features
1         : ('petal width (cm)',)
2         : ('petal length (cm)', 'petal width (cm)')
3         : ('petal length (cm)', 'petal width (cm)', 'noinfo-1')
4         : ('sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'noinfo-1')
5         : ('sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'noinfo-1')
6         : ('sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'noinfo-1', 'noinfo-3')
7         : ('sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'noinfo-1', 'noinfo-2', 'noinfo-3')
8         : ('sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'noinfo-0', 'noinfo-1', 'noinfo-2', 'noinfo-3')
_images/examples_ml_ffgs_ForwardFeatureGroupSelection_5_2.png
FFGS - Forward Feature Group Selection

With FFGS we can define groups of features and consider and select these groups of feature in each iteration. Here we define four groups of features.

[5]:
fgroups_dict = {
    0: [0, 1], # sepal features
    1: [2, 3], # petal features
    2: [4], # a single no-info feature
    3: [5, 6, 7], # three no-info features
}
fgroups = np.zeros(X.shape[1], dtype=int)
for group, feature_indices in fgroups_dict.items():
    # print(group, feature_indices)
    fgroups[np.array(feature_indices)] = group
print("fgroups:", fgroups)
fgroups: [0 0 1 1 2 3 3 3]

We run a forward feature group selection similar to the forward squential feature selection just that we additionally have to specify the feature grouping.

[6]:
mod = KNeighborsClassifier(n_neighbors=4)
fsel = ffgs.ForwardFeatureGroupSelection(mod,
                                         k_features=X.shape[1],
                                         scoring='accuracy',
                                         verbose=1)

fsel = fsel.fit(X, y, custom_feature_names=X.columns, fgroups=fgroups)


print(f"Iteration : Selected Features")
for iter, subset in fsel.subsets_.items():
    print(f"{iter}         : {subset['feature_names']}")

plot_sfs(fsel.get_metric_dict(), kind='std_dev')
plt.title('Forward Feature Group Selection (w. StdDev)')
plt.grid()
plt.show()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.1s finished
Features: 2/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
Features: 4/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished
Features: 5/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished
Features: 8/8
Iteration : Selected Features
2         : ('petal length (cm)', 'petal width (cm)')
4         : ('sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)')
5         : ('sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'noinfo-0')
8         : ('sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'noinfo-0', 'noinfo-1', 'noinfo-2', 'noinfo-3')
_images/examples_ml_ffgs_ForwardFeatureGroupSelection_9_2.png
Computational cost

As mentioned above the number of iterations and train/val tasks are lower if we can cosider groups of features instead of single features only.

To evaluate the 4 feature groups we needed to train and estimate the performance for 9 feature sets while for the feature selection on 8 individual feautes this number increases to 35.

[7]:
print("Number of CV runs with single features FFS    : ", ffgs.get_number_of_ffs_iterations(8))
print("Number of CV runs with group-wise features FFS: ", ffgs.get_number_of_ffs_iterations(4))
Number of CV runs with single features FFS    :  35
Number of CV runs with group-wise features FFS:  9

SequentialFeatureSelector analysis utilities

The functions shown here are meant to facilitate the analysis of feature selection results. Here they are applied on a trained ForwardFeatureGroupSelection (SFS) instance but they can also be used with a a trained SequentialFeatureSelector instance.

[2]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

# from mlxtend import feature_selection

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris


from eobox.ml import ffgs
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Dataset
[3]:
iris = load_iris(as_frame=True)
X_informative = iris.data
np.random.seed(0)
X_uninformative = pd.DataFrame(np.random.uniform(X_informative.min(),
                                                 X_informative.max(),
                                                 size=X_informative.shape))
X_uninformative.columns = [f"noinfo-{i}" for i in range(X_uninformative.shape[1])]
X = pd.concat([X_informative, X_uninformative], axis=1)
y = iris.target
Feature selection
[4]:
mod = KNeighborsClassifier(n_neighbors=4)
fsel = ffgs.ForwardFeatureGroupSelection(mod,
                                         k_features=X.shape[1],
                                         scoring='accuracy',
                                         verbose=1)
fgroups = [0, 0, 1, 1, 2, 3, 3, 3]
#fgroups = None
fsel = fsel.fit(X, y, custom_feature_names=X.columns, fgroups=fgroups)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.1s finished
Features: 2/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.1s finished
Features: 4/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished
Features: 5/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished
Features: 8/8
SFS - metrics dataframes
Default
[5]:
sfs_metrics_df = ffgs.sfs_metrics_to_dataframe(fsel, None)
sfs_metrics_df
[5]:
n_features feature_idx cv_scores avg_score feature_names ci_bound std_dev std_err feature_idx_new feature_names_new
iter
1 2 (2, 3) [0.9666666666666667, 0.9666666666666667, 0.933... 0.966667 (petal length (cm), petal width (cm)) 0.027096 0.021082 0.010541 (2, 3) (petal length (cm), petal width (cm))
2 4 (0, 1, 2, 3) [0.9666666666666667, 0.9666666666666667, 0.966... 0.973333 (sepal length (cm), sepal width (cm), petal le... 0.017137 0.013333 0.006667 (0, 1) (sepal length (cm), sepal width (cm))
3 5 (0, 1, 2, 3, 4) [0.9666666666666667, 1.0, 0.9333333333333333, ... 0.953333 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (4,) (noinfo-0,)
4 8 (0, 1, 2, 3, 4, 5, 6, 7) [0.8666666666666667, 0.9333333333333333, 0.9, ... 0.886667 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (5, 6, 7) (noinfo-1, noinfo-3, noinfo-2)
Wide dataframe

Cross validation scores in additional columns. One row per selected feature. K(-folds) additional columns.

[6]:
sfs_metrics_df_wide = ffgs.sfs_metrics_to_dataframe(fsel, "wide")
sfs_metrics_df_wide
[6]:
n_features feature_idx cv_scores avg_score feature_names ci_bound std_dev std_err feature_idx_new feature_names_new cv_score_0 cv_score_1 cv_score_2 cv_score_3 cv_score_4
iter
1 2 (2, 3) [0.9666666666666667, 0.9666666666666667, 0.933... 0.966667 (petal length (cm), petal width (cm)) 0.027096 0.021082 0.010541 (2, 3) (petal length (cm), petal width (cm)) 0.966667 0.966667 0.933333 0.966667 1.0
2 4 (0, 1, 2, 3) [0.9666666666666667, 0.9666666666666667, 0.966... 0.973333 (sepal length (cm), sepal width (cm), petal le... 0.017137 0.013333 0.006667 (0, 1) (sepal length (cm), sepal width (cm)) 0.966667 0.966667 0.966667 0.966667 1.0
3 5 (0, 1, 2, 3, 4) [0.9666666666666667, 1.0, 0.9333333333333333, ... 0.953333 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (4,) (noinfo-0,) 0.966667 1.000000 0.933333 0.866667 1.0
4 8 (0, 1, 2, 3, 4, 5, 6, 7) [0.8666666666666667, 0.9333333333333333, 0.9, ... 0.886667 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (5, 6, 7) (noinfo-1, noinfo-3, noinfo-2) 0.866667 0.933333 0.900000 0.933333 0.8
Long dataframe

Cross validation scores in additional rows. K(-folds) additional rows.

[7]:
sfs_metrics_df_long = ffgs.sfs_metrics_to_dataframe(fsel, "long")
sfs_metrics_df_long
[7]:
iter fold n_features feature_idx avg_score feature_names ci_bound std_dev std_err feature_idx_new feature_names_new cv_score
1-0 1 0 2 (2, 3) 0.966667 (petal length (cm), petal width (cm)) 0.027096 0.021082 0.010541 (2, 3) (petal length (cm), petal width (cm)) 0.966667
1-1 1 1 2 (2, 3) 0.966667 (petal length (cm), petal width (cm)) 0.027096 0.021082 0.010541 (2, 3) (petal length (cm), petal width (cm)) 0.966667
1-2 1 2 2 (2, 3) 0.966667 (petal length (cm), petal width (cm)) 0.027096 0.021082 0.010541 (2, 3) (petal length (cm), petal width (cm)) 0.933333
1-3 1 3 2 (2, 3) 0.966667 (petal length (cm), petal width (cm)) 0.027096 0.021082 0.010541 (2, 3) (petal length (cm), petal width (cm)) 0.966667
1-4 1 4 2 (2, 3) 0.966667 (petal length (cm), petal width (cm)) 0.027096 0.021082 0.010541 (2, 3) (petal length (cm), petal width (cm)) 1.000000
2-0 2 0 4 (0, 1, 2, 3) 0.973333 (sepal length (cm), sepal width (cm), petal le... 0.017137 0.013333 0.006667 (0, 1) (sepal length (cm), sepal width (cm)) 0.966667
2-1 2 1 4 (0, 1, 2, 3) 0.973333 (sepal length (cm), sepal width (cm), petal le... 0.017137 0.013333 0.006667 (0, 1) (sepal length (cm), sepal width (cm)) 0.966667
2-2 2 2 4 (0, 1, 2, 3) 0.973333 (sepal length (cm), sepal width (cm), petal le... 0.017137 0.013333 0.006667 (0, 1) (sepal length (cm), sepal width (cm)) 0.966667
2-3 2 3 4 (0, 1, 2, 3) 0.973333 (sepal length (cm), sepal width (cm), petal le... 0.017137 0.013333 0.006667 (0, 1) (sepal length (cm), sepal width (cm)) 0.966667
2-4 2 4 4 (0, 1, 2, 3) 0.973333 (sepal length (cm), sepal width (cm), petal le... 0.017137 0.013333 0.006667 (0, 1) (sepal length (cm), sepal width (cm)) 1.000000
3-0 3 0 5 (0, 1, 2, 3, 4) 0.953333 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (4,) (noinfo-0,) 0.966667
3-1 3 1 5 (0, 1, 2, 3, 4) 0.953333 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (4,) (noinfo-0,) 1.000000
3-2 3 2 5 (0, 1, 2, 3, 4) 0.953333 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (4,) (noinfo-0,) 0.933333
3-3 3 3 5 (0, 1, 2, 3, 4) 0.953333 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (4,) (noinfo-0,) 0.866667
3-4 3 4 5 (0, 1, 2, 3, 4) 0.953333 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (4,) (noinfo-0,) 1.000000
4-0 4 0 8 (0, 1, 2, 3, 4, 5, 6, 7) 0.886667 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (5, 6, 7) (noinfo-1, noinfo-3, noinfo-2) 0.866667
4-1 4 1 8 (0, 1, 2, 3, 4, 5, 6, 7) 0.886667 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (5, 6, 7) (noinfo-1, noinfo-3, noinfo-2) 0.933333
4-2 4 2 8 (0, 1, 2, 3, 4, 5, 6, 7) 0.886667 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (5, 6, 7) (noinfo-1, noinfo-3, noinfo-2) 0.900000
4-3 4 3 8 (0, 1, 2, 3, 4, 5, 6, 7) 0.886667 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (5, 6, 7) (noinfo-1, noinfo-3, noinfo-2) 0.933333
4-4 4 4 8 (0, 1, 2, 3, 4, 5, 6, 7) 0.886667 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (5, 6, 7) (noinfo-1, noinfo-3, noinfo-2) 0.800000
Plotting

In case of a forward feature group selection it makes sense to add a column with group names for plotting.

[8]:
fgroup_names = {0: "sepal", 1: "petal", 2: "noinfo-A", 3: "noinfo-B"}

sfs_metrics_df = ffgs.sfs_metrics_to_dataframe(fsel, explode_cv_scores="wide", fgroups=fgroups, fgroup_names=fgroup_names)
sfs_metrics_df
[8]:
n_features feature_idx cv_scores avg_score feature_names ci_bound std_dev std_err feature_idx_new feature_names_new cv_score_0 cv_score_1 cv_score_2 cv_score_3 cv_score_4 feature_groups feature_group_new feature_group_names feature_group_name_new
iter
1 2 (2, 3) [0.9666666666666667, 0.9666666666666667, 0.933... 0.966667 (petal length (cm), petal width (cm)) 0.027096 0.021082 0.010541 (2, 3) (petal length (cm), petal width (cm)) 0.966667 0.966667 0.933333 0.966667 1.0 (1,) 1 (petal,) petal
2 4 (0, 1, 2, 3) [0.9666666666666667, 0.9666666666666667, 0.966... 0.973333 (sepal length (cm), sepal width (cm), petal le... 0.017137 0.013333 0.006667 (0, 1) (sepal length (cm), sepal width (cm)) 0.966667 0.966667 0.966667 0.966667 1.0 (0, 1) 0 (petal, sepal) sepal
3 5 (0, 1, 2, 3, 4) [0.9666666666666667, 1.0, 0.9333333333333333, ... 0.953333 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (4,) (noinfo-0,) 0.966667 1.000000 0.933333 0.866667 1.0 (0, 1, 2) 2 (noinfo-A, petal, sepal) noinfo-A
4 8 (0, 1, 2, 3, 4, 5, 6, 7) [0.8666666666666667, 0.9333333333333333, 0.9, ... 0.886667 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (5, 6, 7) (noinfo-1, noinfo-3, noinfo-2) 0.866667 0.933333 0.900000 0.933333 0.8 (0, 1, 2, 3) 3 (noinfo-A, noinfo-B, petal, sepal) noinfo-B
Cross-validation accuracies and mean
[9]:
colnames_cv_scores = sfs_metrics_df.columns[sfs_metrics_df.columns.str.startswith("cv_score_")]
ax = sfs_metrics_df[colnames_cv_scores].plot(style='-o',
                                             xticks=sfs_metrics_df.index,
                                             figsize=(18, 6))
ax = sfs_metrics_df.avg_score.plot(style='-o', c="k")
ax = ax.set_xticklabels(sfs_metrics_df.feature_names_new)
_images/examples_ml_ffgs_SequentialFeatureSelectorAnalysisUtilities_15_0.png
Cross-validation accuracies as boxplots
[10]:
sfs_metrics_df_long = ffgs.sfs_metrics_to_dataframe(fsel, explode_cv_scores="long", fgroups=fgroups, fgroup_names=fgroup_names)

fig, ax = plt.subplots(figsize=(18, 6))
ax = sns.boxplot(x="feature_group_name_new", y="cv_score", color="w", data=sfs_metrics_df_long)
ax = sns.swarmplot(x="feature_group_name_new", y="cv_score", hue="fold", data=sfs_metrics_df_long)
_images/examples_ml_ffgs_SequentialFeatureSelectorAnalysisUtilities_17_0.png

Early stopping feature selection

Model performance during sequential forward feature selection often reaches a performance plateau before all features are selected. To save time it makes sense to stop the feature selection process once the relevant features are selected. Potentially a variaty of stopping criteria might make sense.

Here we show how to implement and use a custom early stopping criteria.

[2]:
%matplotlib inline
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs

from eobox.ml import ffgs
Dataset

Let us define a dataset with useful and uninformative features.

[3]:
iris = load_iris(as_frame=True)
X_informative = iris.data
np.random.seed(0)
X_uninformative = pd.DataFrame(np.random.uniform(X_informative.min(),
                                                 X_informative.max(),
                                                 size=X_informative.shape))
X_uninformative.columns = [f"noinfo-{i}" for i in range(X_uninformative.shape[1])]
X = pd.concat([X_informative, X_uninformative], axis=1)
y = iris.target
Full selection

Lets start by running through the full selection process and visualize the feature learning curve.

[4]:
knn = KNeighborsClassifier(n_neighbors=4)
fsel = ffgs.ForwardFeatureGroupSelection(knn, k_features=X.shape[1])
fsel = fsel.fit(X, y)
plot_sfs(fsel.get_metric_dict())
[4]:
(<Figure size 432x288 with 1 Axes>,
 <AxesSubplot:xlabel='Number of Features', ylabel='Performance'>)
_images/examples_ml_ffgs_EarlyStoppingFeatureSelection_5_1.png
Custom early stopping function
Example implementation

To implement early stopping we need to implement a function that takes the metric dictionary (output of ForwardFeatureGroupSelection.get_metric_dict()) as an input and returnes a boolean as an output that indicates if to stop (True) or proceed (False) the feature selection. This function will be internally evaluated after every iteration with the metric dictionary as filled at the respective point.

For example a stopping criteria could be to stop if the mean performance does not increases for more than 0.5% (p) for 3 (n) subsequent iterations.

[5]:
p = 0.005
n = 3

One way to implement this is as follows (note that sfs_metrics_to_dataframe will be available at exection time and can be used inside the costum function).

[6]:
def early_stop(metrics_dict, p=0.005, n=3, verbosity=0):
    metrics_df = ffgs.sfs_metrics_to_dataframe(metrics_dict)
    metrics_df["perf_change"] = metrics_df["avg_score"].pct_change()
    metrics_df["perf_change_let_p"] = metrics_df["perf_change"] <= p

    metrics_df["perf_change_let_p_n_successive"] = metrics_df["perf_change_let_p"].groupby(metrics_df["perf_change_let_p"].eq(0).cumsum()).cumsum().tolist()
    stop = metrics_df.loc[metrics_df.index[-1], "perf_change_let_p_n_successive"] >= n

    if verbosity > 0 and stop:
        print("Stopping criteria met!")
    if verbosity > 1 and stop:
        display(metrics_df[["avg_score", "perf_change", "perf_change_let_p", "perf_change_let_p_n_successive"]])
    return stop

Note that the metrics dictionary will have one more element after every iteration, i.e. after every additionally selected feature. The following shows the early stopping evaluation process:

[7]:
full_metric_dict = fsel.get_metric_dict()

metric_dict_after_iter_i = {}
for i, last_md_key in enumerate(full_metric_dict.keys()):
    print(f"i={i}", end=" - ")
    metric_dict_after_iter_i[last_md_key] = full_metric_dict[last_md_key]
    stop = early_stop(metric_dict_after_iter_i, p=0.005, n=3, verbosity=2)
    #print(f"STOP: {stop}")
    if stop:
        break
i=0 - i=1 - i=2 - i=3 - i=4 - i=5 - Stopping criteria met!
avg_score perf_change perf_change_let_p perf_change_let_p_n_successive
iter
1 0.960000 NaN False 0
2 0.966667 0.006944 False 0
3 0.973333 0.006897 False 0
4 0.973333 0.000000 True 1
5 0.966667 -0.006849 True 2
6 0.926667 -0.041379 True 3
Useage

As we can see from the outputs and plot the last iterations are not ran anymore.

[8]:
fsel = ffgs.ForwardFeatureGroupSelection(knn, k_features=X.shape[1], verbose=True)
fsel = fsel.fit(X, y, custom_early_stop=early_stop)
plot_sfs(fsel.get_metric_dict())
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    0.1s finished
Features: 1/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    0.1s finished
Features: 2/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    0.1s finished
Features: 3/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.1s finished
Features: 4/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.1s finished
Features: 5/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
Features: 6/8
Stopping early due to custom early stopping criteria.
[8]:
(<Figure size 432x288 with 1 Axes>,
 <AxesSubplot:xlabel='Number of Features', ylabel='Performance'>)
_images/examples_ml_ffgs_EarlyStoppingFeatureSelection_13_3.png

Package content

Subpackages:

eobox.raster

eobox.raster subpackage

Subpackage which mainly builds upon rasterio (and thus gdal) functionality. It provides tools for

  • extracting values of raster sampledata at location given by a vector dataset,

  • chunk-wise processing (e.g. classification) of multiple single layer files.

For more information on the package content, visit [readthedocs](https://eo-box.readthedocs.raster/en/latest/eobox.raster.html).

Submodules:

eobox.raster.cube
class eobox.raster.cube.EOCube(df_layers, chunksize=32, wdir=None)[source]

Bases: eobox.raster.cube.EOCubeAbstract

apply_and_write(fun, dst_paths, **kwargs)[source]
create_vrt_from_chunks(dst_path)[source]
get_chunk(ji)[source]

Get a EOCubeChunk

class eobox.raster.cube.EOCubeAbstract(df_layers, chunksize=32, wdir=None)[source]

Bases: object

property chunksize
property df_layers
get_chunk_path_from_layer_path(path_layer, ji, mkdir=True)[source]
get_df_ilocs(band, date)[source]

Get positions of rows matching specific band(s) and date(s).

The method supports three typical queries:

  • one band and one date (both given as strings)

  • one band and of several dates (band given as strings, date as list of strings)

  • several band and of one date (date given as strings, band as list of strings)

Parameters
  • band (str or list) – Band(s) for which to derive the iloc index.

  • date (str or list) – Date(s) for which to derive the iloc index.

Returns

Integer (if band and date are str) or list of iloc indices.

Return type

int or list

property n_chunks
property wdir
class eobox.raster.cube.EOCubeChunk(ji, df_layers, chunksize=32, wdir=None)[source]

Bases: eobox.raster.cube.EOCubeAbstract

property chunksize
convert_data_to_dataframe()[source]
convert_data_to_ndarray()[source]

Converts the data from dataframe to ndarray format. Assumption: df-columns are ndarray-layers (3rd dim.)

property data
static from_eocube(eocube, ji)[source]

Create a EOCubeChunk object from an EOCube object.

property ji
plot_raster(idx_layer, robust=False, vmin=None, vmax=None, spatial_bounds=False, figsize=None, ax=None)[source]
plot_raster_rgb(idx_layers, robust=False, vmin=None, vmax=None, spatial_bounds=False, figsize=None, ax=None)[source]
read_data()[source]
static robust_data_range(arr, robust=False, vmin=None, vmax=None)[source]

Get a robust data range, i.e. 2nd and 98th percentile for vmin, vmax parameters.

property spatial_bounds
write_dataframe(result, dst_paths, nodata=None, compress='lzw')[source]

Write results (dataframe) to disc.

write_ndarray(result, dst_paths, nodata=None, compress='lzw')[source]

Write results (ndarray) to disc.

class eobox.raster.cube.EOCubeSceneCollection(df_layers, chunksize, variables, qa, qa_valid, wdir=None)[source]

Bases: eobox.raster.cube.EOCubeSceneCollectionAbstract, eobox.raster.cube.EOCube

Handling scene collections, i.e. a set of scenes each with the same layers.

The class enables to perform chunkwise processing over a set of scenes having the same variables / bands. Therefore the df_layers dataframe requires information to be stored in the following columns:

  • sceneid (unique identifier of the scene),

  • date (the aquisition date of the scene as datetime type),

  • band (the layers / bands that exist for all scenes),

  • uname (unique identifier for all layers, i.e. scene + variable/qu-layer),

  • path (the path where the raster for that layer is located).

Parameters
  • df_layers (dataframe) – A dataframe, see description above.

  • chunksize (int) – Size of the spatial window used as processing unit.

  • variables (list of str) – Those values in df_layers['band'] that are treated as variables.

  • qa (str) – The value in df_layers['band'] which is treated as quality assessment layer.

  • qa_valid (list of int) – The values in the qualitiy assessment layer that identify pixels to be considered as valid in the variable rasters., by default None

  • wdir (str, optional) – Working directory

Raises

ValueError – [description]

apply_and_write_by_variable(fun, dst_paths, dtypes, compress, nodata, **kwargs)[source]
create_statistical_metrics(percentiles, iqr, diffs, dst_pattern, dtypes, compress='lzw', nodata=None, num_workers=1)[source]
create_virtual_time_series(idx_virtual, dst_pattern, dtypes, compress='lzw', nodata=None, num_workers=1)[source]
get_chunk(ji)[source]

Get a EOCubeSceneCollectionChunk

class eobox.raster.cube.EOCubeSceneCollectionAbstract(df_layers, chunksize, variables, qa, qa_valid, wdir=None)[source]

Bases: eobox.raster.cube.EOCubeAbstract

Handling scene collections, i.e. a set of scenes each with the same layers.

The class enables to perform chunkwise processing over a set of scenes having the same variables / bands. Therefore the df_layers dataframe requires information to be stored in the following columns:

  • sceneid (unique identifier of the scene),

  • date (the aquisition date of the scene as datetime type),

  • band (the layers / bands that exist for all scenes),

  • uname (unique identifier for all layers, i.e. scene + variable/qu-layer),

  • path (the path where the raster for that layer is located).

Parameters
  • df_layers (dataframe) – A dataframe, see description above.

  • chunksize (int) – Size of the spatial window used as processing unit.

  • variables (list of str) – Those values in df_layers['band'] that are treated as variables.

  • qa (str) – The value in df_layers['band'] which is treated as quality assessment layer.

  • qa_valid (list of int) – The values in the qualitiy assessment layer that identify pixels to be considered as valid in the variable rasters., by default None

  • wdir (str, optional) – Working directory

Raises

ValueError – [description]

property qa
property qa_valid
property variables
class eobox.raster.cube.EOCubeSceneCollectionChunk(ji, df_layers, chunksize, variables, qa, qa_valid, wdir=None)[source]

Bases: eobox.raster.cube.EOCubeSceneCollectionAbstract, eobox.raster.cube.EOCubeChunk

Handling scene collections, i.e. a set of scenes each with the same layers.

The class enables to perform chunkwise processing over a set of scenes having the same variables / bands. Therefore the df_layers dataframe requires information to be stored in the following columns:

  • sceneid (unique identifier of the scene),

  • date (the aquisition date of the scene as datetime type),

  • band (the layers / bands that exist for all scenes),

  • uname (unique identifier for all layers, i.e. scene + variable/qu-layer),

  • path (the path where the raster for that layer is located).

Parameters
  • df_layers (dataframe) – A dataframe, see description above.

  • chunksize (int) – Size of the spatial window used as processing unit.

  • variables (list of str) – Those values in df_layers['band'] that are treated as variables.

  • qa (str) – The value in df_layers['band'] which is treated as quality assessment layer.

  • qa_valid (list of int) – The values in the qualitiy assessment layer that identify pixels to be considered as valid in the variable rasters., by default None

  • wdir (str, optional) – Working directory

Raises

ValueError – [description]

read_data_by_variable(mask=True)[source]

Reads and masks (if desired) the data and converts it in one dataframe per variable.

eobox.raster.cube.create_names_statistical_metrics(dst_pattern, percentiles=None, iqr=True, diffs=True, variables=None)[source]
eobox.raster.cube.create_names_virtual_time_series(dst_pattern, idx_virtual, variables=None)[source]
eobox.raster.cube.create_statistical_metrics(df_var, percentiles=None, iqr=True, diffs=True, colname_pattern=None, num_workers=1, rename2p=False, verbosity=0)[source]

Calculate statistial metrics from a dataframe with pd.DateTimeIndex and a instance (e.g. pixels) dimension.

Parameters
  • df_var ([type]) – [description]

  • percentiles ([type]) – [description]

  • iqr ([type]) – [description]

  • num_workers (int, optional) – [description], by default 1

  • rename2p (bool, optional) – Rename %-columns (e.g 90%) to p-columns (e.g. p90)

  • verbosity (int, optional) – [description], by default 0

Returns

[description]

Return type

[type]

eobox.raster.cube.create_virtual_time_series(df_var, idx_virtual, colname_pattern=None, num_workers=1, verbosity=0)[source]

Create a virtual time series from a dataframe with pd.DateTimeIndex and a instance (e.g. pixels) dimension.

Parameters
  • df_var ([type]) – [description]

  • idx_virtual ([type]) – [description]

  • colname_pattern (str) – Pattern for the column name, e.g. ‘ls2010_vts2w_{date}_ndvi’, where the data will be inserted from idx_virtual.

  • num_workers (int, optional) – [description], by default 1

  • verbosity (int, optional) – [description], by default 0

Returns

[description]

Return type

[type]

eobox.raster.cube.scoll_df_to_var_dfs(scoll_df, df_layers, qa=None, qa_valid=None, verbose=True)[source]

Convert a dataframe with variable and qa columns into a dictionary with masked variable dataframes.

Parameters
  • scoll_df ([pd.DataFrame]) – dataframe as returned by EOCubeSceneCollectionChunk(…).read_data().convert_data_to_dataframe()

  • df_layers ([pd.DataFrame]) – See EOCubeSceneCollectionAbstract.

  • qa ([str]) – See EOCubeSceneCollectionAbstract.

  • qa_valid ([list]) – See EOCubeSceneCollectionAbstract.

  • verbose ([bool]) –

eobox.raster.extraction

Module for extracting values of raster sampledata at location given by a vector dataset.

eobox.raster.extraction.add_vector_data_attributes_to_extracted(ref_vector, pid, dir_extracted, overwrite=False)[source]

From the vector dataset used for extraction save attributes as npy files corresponding to the extracted pixels values.

Parameters
  • ref_vector (str or pathlib.Path) – The vector dataset which has been used in extract().

  • pid (str) – The burn_attribute that has been used in extract(). Note that this only makes sense if the burn attribute is a unique feature (e.g. polygon) identifier.

  • dir_extracted (str or pathlib.Path) – The output directory which has been used in extract().

  • overwrite (bool, optional) – If True existing data will be overwritten, by default False-

eobox.raster.extraction.convert_df_to_geodf(df, crs=None)[source]

Convert dataframe returned by load_extracted to a geodataframe.

Parameters
  • df (dataframe) – Dataframe as returned by load_extracted. It must contain the columns aux_coord_x and aux_coord_y.

  • crs (None, dict, str or Path, optional) – The crs of the saved coordinates given as a dict, e.g. {'init':'epsg:32632'}, or via the dir_extracted. In the latter (str, Path) case, it is assumed that the crs can be derived from any tiff that is located in the folder. The defaultdoes not set any crs. By default None.

eobox.raster.extraction.extract(src_vector: str, burn_attribute: str, src_raster: list, dst_names: list, dst_dir: str, dist2pb: bool = False, dist2rb: bool = False, src_raster_template: Optional[str] = None, gdal_dtype: int = 4, n_jobs: int = 1)int[source]

Extract pixel values of a list of single-band raster files overlaying with a vector dataset.

This function does not return the extracted values but stores them in the dst_dir directory. The extracted values of each raster will be stored as a separate NumPy binary file as well as the values of the burn_attribute. Additionally, the folder will contain one or more intermediate GeoTIFF files, e.g, the rasterized burn_attribute and, if selected, the dist2pb and/or dist2rp layer.

Note that also the pixel coordinates will be extracted and stored as aux_coord_y and aux_coord_x. Therefore these names should be avoided in dst_names.

The function add_vector_data_attributes_to_extracted can be used to add other attributes from src_vector to the store of extracted values such that they can be loaded easily together with the other data.

With load_extracted the data can then be loaded conveniently.

If a file with a given name already exists the raster will be skipped.

Parameters
  • {str} -- Filename of the vector dataset. Currently (src_vector) –

  • must have the same CRS as the raster. (it) –

  • {str} -- Name of the attribute column in the src_vector dataset to be (burn_attribute) – stored with the extracted data. This should usually be a unique ID for the features (points, lines, polygons) in the vector dataset. Note that this attribute should not contain zeros since this value is internally used for pixels that should not be extracted, or, in other words, that to not overlap with the vector data.

  • {list} -- List of file paths of the single-band raster files from which to extract the pixel (src_raster) – values from.

  • {list} -- List corresponding to src_raster names used to store and later (dst_names) – identify the extracted to.

  • {str} -- Directory to store the data to. (dst_dir) –

Keyword Arguments
  • {bool} -- Create an additional auxiliary layer containing the distance to the closest (dist2rb) – polygon border for each extracted pixels. Defaults to False.

  • {bool} -- Create an additional auxiliary layer containing the distance to the closest – raster border for each extracted pixels. Defaults to False.

  • {str} -- A template raster to be used for rasterizing the vectorfile. (src_raster_template) – Usually the first element of src_raster. (default: {None})

  • {int} -- Numeric GDAL data type, defaults to 4 which is UInt32. (gdal_dtype) – See https://github.com/mapbox/rasterio/blob/master/rasterio/dtypes.py for useful look-up tables.

  • {int} -- Number of parallel processors to be use for extraction. -1 uses all processors. (n_jobs) – Defaults to 1.

Returns

[int] – If successful the function returns 0 as an exit code and 1 otherwise.

eobox.raster.extraction.get_paths_of_extracted(src_dir: str, patterns='*.npy', sort=True)[source]

Get the paths of extracted features. Used in load_extracted.

eobox.raster.extraction.load_extracted(src_dir: str, patterns='*.npy', vars_in_cols: bool = True, index: Optional[pandas.core.series.Series] = None, head: bool = False, sort: bool = True)[source]

Load data extracted and stored by extract()

Parameters

{str} -- The directory where the data is stored. (src_dir) –

Keyword Arguments
  • {str, or list of str} -- A pattern (patterns) – to identify the variables to be loaded. The default loads all variables, i.e. all .npy files. (default: {‘*.npy’})

  • {bool} -- Return the variables in columns (vars_in_cols) – (default: {True})

  • {pd.Series} -- A boolean pandas Series which indicates with True which samples to (index) – load.

  • {bool} -- Get a dataframe with the first five samples. (head) –

Returns

pandas.DataFrame – A dataframe with the data.

eobox.raster.extraction.load_extracted_dask(npy_path_list, index=None)[source]

Create a dask dataframe from a list of single features npy paths to be concatenated along the columns.

eobox.raster.extraction.load_extracted_partitions(src_dir: dict, patterns='*.npy', index: Optional[dict] = None, to_crs: Optional[dict] = None, verbosity=0)[source]

Load multiple row-wise appended partitions (same columns) with load_extracted().

:param src_dir {dict of str or Path} – Multiple src_dir as in load_extracted(): wrapped in a dictionary where the keys are the partition identifiers.

The key will be written as column in the returning dataframe.

:keyword patterns {str, or list of str} – See load_extracted().: :keyword index {dict pd.Series} – See load_extracted() but as dict as in src_dir.:

Returns

pandas.DataFrame – A dataframe with the data.

eobox.raster.extraction.load_extracted_partitions_dask(src_dir: dict, global_index_col: str, patterns='*.npy', verbosity=0)[source]

Load multiple row-wise appended partitions (same columns) with load_extracted() as dask dataframe.

:param src_dir {dict of str or Path} – Multiple src_dir as in load_extracted(): wrapped in a dictionary where the keys are the partition identifiers.

The key will be written as column in the returning dataframe.

Keyword Arguments

{str} (global_index_col) – – One of the columns matched by the patterns should be a global index, i.e. a index where each element is unique over all partitions.

:keyword patterns {str, or list of str} – See load_extracted().:

Returns

dask.dataframe.core.DataFrame – A dask dataframe with the data.

eobox.raster.gdalutils
eobox.raster.gdalutils.buildvrt(input_file_list, output_file, relative=True, **kwargs)[source]

Build a VRT

See also: https://www.gdal.org/gdalbuildvrt.html

You can find the possible BuildVRTOptions (kwargs) here: https://github.com/nextgis/pygdal/blob/78a793057d2162c292af4f6b240e19da5d5e52e2/2.1.0/osgeo/gdal.py#L1051

Parameters
  • {list of str or Path objects} -- List of input files. (input_file_list) –

  • {str or Path object} -- Output file (output_file) –

Keyword Arguments
  • {bool} -- If True, the input_file_list paths are converted to relative (relative) – paths (relative to the output file) and the VRT works even if the data is moved somewhere else - given that the relative location of theVRT and the input files does not chance!

  • {} -- BuildVRTOptions - see function description for a link to . (**kwargs) –

Returns

[int] – If successful, 0 is returned as exit code.

eobox.raster.gdalutils.rasterize(src_vector: str, burn_attribute: str, src_raster_template: str, dst_rasterized: str, gdal_dtype: int = 4)[source]

Rasterize the values of a spatial vector file.

Parameters
Returns

None

eobox.raster.gdalutils.reproject_on_template_raster(src_file, dst_file, template_file, resampling='near', compress=None, overwrite=False)[source]

Reproject a one-band raster to fit the projection, extend, pixel size etc. of a template raster.

Function based on https://stackoverflow.com/questions/10454316/how-to-project-and-resample-a-grid-to-match-another-grid-with-gdal-python

Parameters
  • {str} -- Filename of the source one-band raster. (src_file) –

  • {str} -- Filename of the destination raster. (dst_file) –

  • {str} -- Filename of the template raster. (template_file) –

  • {str} -- Resampling type (resampling) – ‘near’ (default), ‘bilinear’, ‘cubic’, ‘cubicspline’, ‘lanczos’, ‘average’, ‘mode’, ‘max’, ‘min’, ‘med’, ‘q1’, ‘q3’, see https://www.gdal.org/gdalwarp.html -r parameter.

  • {str} -- Compression type (compress) – None (default), ‘lzw’, ‘packbits’, ‘defalte’.

eobox.raster.rasterprocessing
class eobox.raster.rasterprocessing.MultiRasterIO(layer_files: list, dst_res: Optional[float] = None, downsampler: str = 'average', upsampler: str = 'nearest')[source]

Bases: object

Read, process and write multiple single layer raster files.

Parameters
  • {list} -- List of paths to single band raster files. (layer_files) –

  • {float} -- Destination resolution. If None it is set to the highest resolution of all layers. (dst_res) –

Keyword Arguments
  • {numeric} -- Destination resolution in which to return the array. (dst_res) – If None, each layer is returned in its native resolution, else each layer is resampled to the destination resolution dst_res (default: {None}).

  • {str} -- The method used for downsampling layers with higher (downsampler) – raster cell size (default: {‘average’}).

  • {str} -- The method used for upsampling layers with lower (upsampler) – raster cell size (default: {‘nearest’})

apply_and_save(dst_files, func, **kwargs)[source]
block_windows(res=None)[source]

Load windows for chunks-wise processing from raster internal tiling (first raster of given resolution).

Parameters

{numeric} -- Resolution determining the raster (res) –

get_arrays(ji_win)[source]

Get the data of the a window given the ji_windows derived with :method:`ji_windows`.

Parameters

{[type]} -- The index of the window or the (ji_win) –

Returns

(list of) array(s) – List of 2D arrays in native resolution in case dst_res is None

or a 3D array where all layers are resampled to dst_res resolution.

get_window_from_xy(xy)[source]

Get the window index given a coordinate (raster CRS).

ji_windows(ij_win)[source]

For a given specific window, i.e. an element of windows, get the windows of all resolutions.

Parameters

{int} -- The index specifying the window for which to return the resolution-windows. (ij_win) –

windows_df()[source]

Get Windows (W) W-row, W-col and W-index of windows e.g. loaded with block_windows() as a dataframe.

Returns

[dataframe] – A dataframe with the window information and indices (row, col, index).

windows_from_blocksize(blocksize_xy=512)[source]

Create rasterio.windows.Window instances with given size which fully cover the raster.

Parameters

{int or list of two int} -- Size of the window. If one integer is given it defines (blocksize_xy) – the width and height of the window. If a list of two integers if given the first defines the width and the second the height.

Returns

None – But the attributes windows, windows_row and windows_col are updated.

eobox.raster.rasterprocessing.create_distance_to_raster_border(src_raster, dst_raster, maxdist=None, overwrite=False)[source]

Create a raster with pixels values being the distance to the raster border (pixel distances).

eobox.raster.rasterprocessing.window_from_window(window_src, transform_src, transform_dst)[source]
eobox.raster.rasterprocessing.windows_from_blocksize(blocksize_xy, width, height)[source]

Create rasterio.windows.Window instances with given size which fully cover a raster.

Parameters
  • {int or list of two int} -- [description] (blocksize_xy) –

  • {int} -- With of the raster for which to create the windows. (width) –

  • {int} -- Heigth of the raster for which to create the windows. (height) –

Returns

list – List of windows according to the following format

[[<row-index>, <column index>], rasterio.windows.Window(<col_off>, <row_off>, <width>, <height>)].

eobox.raster.utils
eobox.raster.utils.cleanup_df_values_for_given_dtype(df, dtype, lower_as=None, higher_as=None, nan_as=None)[source]
eobox.raster.utils.dtype_checker_df(df, dtype, return_=None)[source]

Check if there are NaN values of values outside of a given datatype range.

Parameters
  • {dataframe} -- A dataframe. (df) –

  • {str} -- The datatype to check for. (dtype) –

Keyword Arguments

{str} -- Returns a boolean dataframe with the values not in the range of the dtype (return) – the row (‘rowsums’) or column (‘colsums’) sums of that dataframe or an exit code 1 (None, default) if any of the values is not in the range.

Returns

[int or DataFrame or Series] – If no value is out of the range exit code 0 is returned, else depends on return_.

eobox.sampledata

eobox.sampledata subpackage

The subpackage which holds data and functions to get file paths to data. sampledata is used in the examples and tests of other eobox sub-packages.

For more information on the package content, visit [readthedocs](https://eo-box.readthedocs.raster/en/latest/eobox.sampledata.html).

Submodules:

eobox.sampledata.base

” Base module for loading sampledata.

eobox.sampledata.base.get_dataset(dataset='s2l1c')[source]

Get a specific sampledata to play around.

So far the following sampledata exist:

  • ‘s2l1c’: One Sentinel-2 Level 1C scene with a reference dataset.

  • ‘lsts’: A time series of 105 Landsat scenes each with the bands b3 (red), b4 (nir), b5 (swir1) and fmask.

Keyword Arguments

{str} -- The name of the dataset (default (dataset) – {‘s2l1c’}).

Returns

[dict] – A dictionary with paths and information about the sampledata.

eobox.vector

eobox.vector subpackage

Subpackage which mainly builds upon shapely, fiona and geopandas (and thus ogr) functionality. It provides tools for

  • clean convertion of polygons to lines.

For more information on the package content, visit [readthedocs](https://eo-box.readthedocs.raster/en/latest/eobox.vector.html).

Submodules:

eobox.vector.conversion
eobox.vector.conversion.calc_distance_to_border(polygons, template_raster, dst_raster, overwrite=False, keep_interim_files=False, verbosity=0)[source]

Calculate the distance of each raster cell (in and outside the polygons) to the next polygon border.

Parameters
  • {str} -- Filename to a geopandas-readable file with polygon features. (polygons) –

  • {[type]} -- Filename to a rasterio-readable file. (template_raster) –

  • {[type]} -- Destination filename for the distance to polygon border raster file (dst_raster) –

Keyword Arguments
  • {bool} -- Overwrite files if they exists? (default (overwrite) – {False})

  • {bool} -- Keep the interim line vector and raster files (default (keep_interim_files) – {True})

Returns

[type] – [description]

eobox.vector.conversion.convert_polygons_to_lines(src_polygons, dst_lines, crs=None, add_allone_col=False)[source]

Convert polygons to lines.

Parameters
  • {path to geopandas-readable file} -- Filename of the the polygon vector dataset to be (src_polygons) – converted to lines.

  • {[type]} -- Filename where to write the line vector dataset to. (dst_lines) –

Keyword Arguments
  • {dict or str} -- Output projection parameters as string or in dictionary format. (crs) – This will reproject the data when a crs is given (not {None}) (default: {None}).

  • {bool} -- Add an additional attribute column with all ones. (add_allone_col) – This is useful, e.g. in case you want to use the lines with gdal_proximity afterwards (default: {True}).

Returns

int – Exit code 0 if successeful.

eobox.ml

eobox.ml subpackage

Subpackage which mainly builds upon sklearn and mlxtend. It provides tools for

  • plotting confusion matrix

For more information on the package content, visit [readthedocs](https://eo-box.readthedocs.raster/en/latest/eobox.ml.html).

Submodules:

eobox.ml.ffgs
eobox.ml.plot
eobox.ml.plot.plot_confusion_matrix(cm, class_names=None, switch_axes=False, vmin=None, vmax=None, cmap='RdBu', robust=True, logscale_color=False, fmt=',', annot_kws=None, cbar=False, mask_zeros=False, ax=None)[source]

Plot a confusion matrix with the precision and recall added.

TODO: documentation …

TODO: check https://github.com/wcipriano/pretty-print-confusion-matrix

switch_axes if False the CM is returned as is the default of sklearn with the rows being the actual class and the columns the predicted class if True, the axis are switched.

eobox.ml.clf_extension
eobox.ml.clf_extension.predict_extended(df, clf)[source]

Derive probabilities, predictions, and condfidence layers.

Parameters
  • df (pandas.DataFrame) – DataFrame containing the data matris X to be predicted with clf.

  • clf (sklearn.Classifier) – Trained sklearn classfifier with a predict_proba method.

Returns

DataFrame with the same number of rows as df and n_classes + 3 columns. THe columns contain the class predictions, confidence clayers (max. probability and the difference between the max. and second highest probability), and class probabilities.

Return type

pandas.DataFrame

Indices and tables