An intro to EOCube

[1]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
from pathlib import Path
import rasterio
import seaborn as sns

# from eobox.raster import MultiRasterIO
from eobox import sampledata
from eobox.raster import cube



def get_sampledata(year):
    dataset = sampledata.get_dataset("lsts")
    layers_paths = [Path(p) for p in dataset["raster_files"]]
    layers_df = pd.Series([p.stem for p in layers_paths]).str.split("_", expand=True) \
    .rename({0: "sceneid", 1:"band"}, axis=1)

    layers_df["date"] = pd.to_datetime(layers_df.sceneid.str[9:16], format="%Y%j")
    layers_df["uname"] = layers_df.sceneid.str[:3] + "_" + layers_df.date.dt.strftime("%Y-%m-%d") + "_" + layers_df.band.str[::]
    layers_df["path"] = layers_paths

    layers_df = layers_df.sort_values(["date", "band"])
    layers_df = layers_df.reset_index(drop=True)

    layers_df_year = layers_df[(layers_df.date >= str(year)) & (layers_df.date < str(year+1))]
    layers_df_year = layers_df_year.reset_index(drop=True)
    return layers_df_year

Input data constraints

With EOCube we can for example calculate temporal features from stacks of data that do not fit in memory.

We accept the following data constrains:

  • All layers, i.e. a single band of an acquisition, are available as single band GDAL readable raster files.

  • All files represent raster that are exactly spatially aligned with the same resolution, extend, projection, number of rows and columns etc.

  • (If all the above holds true you might band-subset or cut the data as VRTs to meet the spatial extend and single layer file constraint.

  • The data needs to come in a pandas dataframe containing at least the columns path containing the respective information.

For calculating temporal features date, band, sceneid - as in the following sample dataset - are useful additional columns that you then can accessed later in your custom code.

[2]:
df_layers = get_sampledata(2008)
display(df_layers.head())
df_layers.band.value_counts()
sceneid band date uname path
0 LT50350322008110PAC01 b3 2008-04-19 LT5_2008-04-19_b3 /home/ben/anaconda/envs/eocube/lib/python3.6/s...
1 LT50350322008110PAC01 b4 2008-04-19 LT5_2008-04-19_b4 /home/ben/anaconda/envs/eocube/lib/python3.6/s...
2 LT50350322008110PAC01 b5 2008-04-19 LT5_2008-04-19_b5 /home/ben/anaconda/envs/eocube/lib/python3.6/s...
3 LT50350322008110PAC01 fmask 2008-04-19 LT5_2008-04-19_fmask /home/ben/anaconda/envs/eocube/lib/python3.6/s...
4 LE70350322008118EDC00 b3 2008-04-27 LE7_2008-04-27_b3 /home/ben/anaconda/envs/eocube/lib/python3.6/s...
[2]:
b5       23
b4       23
fmask    23
b3       23
Name: band, dtype: int64
[3]:
df_layers.head(5)
[3]:
sceneid band date uname path
0 LT50350322008110PAC01 b3 2008-04-19 LT5_2008-04-19_b3 /home/ben/anaconda/envs/eocube/lib/python3.6/s...
1 LT50350322008110PAC01 b4 2008-04-19 LT5_2008-04-19_b4 /home/ben/anaconda/envs/eocube/lib/python3.6/s...
2 LT50350322008110PAC01 b5 2008-04-19 LT5_2008-04-19_b5 /home/ben/anaconda/envs/eocube/lib/python3.6/s...
3 LT50350322008110PAC01 fmask 2008-04-19 LT5_2008-04-19_fmask /home/ben/anaconda/envs/eocube/lib/python3.6/s...
4 LE70350322008118EDC00 b3 2008-04-27 LE7_2008-04-27_b3 /home/ben/anaconda/envs/eocube/lib/python3.6/s...

EOCube and EOCubeChunk

EOCube

[6]:
eoc = cube.EOCube(df_layers, chunksize=2**5)

Attributes you can get.

[7]:
print("The data frame with your data.")
display(eoc.df_layers.head(2))
print("Chunksize: ", eoc.chunksize)
print("Number of Chunks: ", eoc.n_chunks)
The data frame with your data.
sceneid band date uname path
0 LT50350322008110PAC01 b3 2008-04-19 LT5_2008-04-19_b3 /home/ben/anaconda/envs/eocube/lib/python3.6/s...
1 LT50350322008110PAC01 b4 2008-04-19 LT5_2008-04-19_b4 /home/ben/anaconda/envs/eocube/lib/python3.6/s...
Chunksize:  32
Number of Chunks:  4

Attributes you can set.

[8]:
eoc.chunksize=2**4
print("Chunksize: ", eoc.chunksize)
print("Number of Chunks: ", eoc.n_chunks)

eoc.chunksize=2**5
print("Chunksize: ", eoc.chunksize)
print("Number of Chunks: ", eoc.n_chunks)
Chunksize:  16
Number of Chunks:  16
Chunksize:  32
Number of Chunks:  4

EOCubeChunk

This is a child class of EOCube and you get an EOCubeChunk object from an EOCube object.

[9]:
ji = 1
eoc_chunk = eoc.get_chunk(ji)
print("Chunk ID: ", eoc_chunk.ji)
Chunk ID:  1

Load data and convert it between ndarray and DataFrame.

[10]:
ji = 1
eoc_chunk = eoc.get_chunk(ji)
print("****************")
print(eoc_chunk.__class__)
print("eochunk.data is : ", eoc_chunk.data)
print("Chunk ID: ", eoc_chunk.ji)

print("****************")
eoc_chunk = eoc_chunk.read_data()
print(eoc_chunk.data.__class__)
print("eochunk.data is ndarray with shape : ", eoc_chunk.data.shape)
data_ndarray = eoc_chunk.data.copy()

print("****************")
eoc_chunk = eoc_chunk.convert_data_to_dataframe()
print("eochunk.data is DataFrame with shape : ", eoc_chunk.data.shape)
print("(column names come from eoc_chunk.df_layers['uname']")
display(eoc_chunk.data.head())

print("****************")
eoc_chunk = eoc_chunk.convert_data_to_ndarray()
print("eochunk.data is again a ndarray with shape : ", eoc_chunk.data.shape)

****************
<class 'eobox.raster.cube.EOCubeChunk'>
eochunk.data is :  None
Chunk ID:  1
****************
<class 'numpy.ndarray'>
eochunk.data is ndarray with shape :  (32, 29, 92)
****************
eochunk.data is DataFrame with shape :  (928, 92)
(column names come from eoc_chunk.df_layers['uname']
uname LT5_2008-04-19_b3 LT5_2008-04-19_b4 LT5_2008-04-19_b5 LT5_2008-04-19_fmask LE7_2008-04-27_b3 LE7_2008-04-27_b4 LE7_2008-04-27_b5 LE7_2008-04-27_fmask LT5_2008-05-05_b3 LT5_2008-05-05_b4 ... LT5_2008-10-28_b5 LT5_2008-10-28_fmask LE7_2008-11-21_b3 LE7_2008-11-21_b4 LE7_2008-11-21_b5 LE7_2008-11-21_fmask LE7_2008-12-07_b3 LE7_2008-12-07_b4 LE7_2008-12-07_b5 LE7_2008-12-07_fmask
0 16000 6889 292 3 16000 7468 540 3 16000 6448 ... 2410 0 5609 6227 3191 4 5040 6126 575 3
1 6868 6584 361 3 16000 6865 584 3 6149 6025 ... 2134 0 5609 6140 3231 4 5248 6126 710 3
2 5590 5705 478 3 16000 6045 606 3 4933 5342 ... 2099 0 5550 6271 3272 4 4624 5976 802 3
3 6008 5772 501 3 16000 5795 584 3 5065 5276 ... 2514 0 5550 6227 3311 4 6176 7224 892 3
4 4917 4957 548 3 3874 4683 606 3 4134 4362 ... 2099 0 5639 6271 3391 4 5627 6126 846 3

5 rows × 92 columns

****************
eochunk.data is again a ndarray with shape :  (32, 29, 92)

Write results with ``EOCubeWindow.write_ndarray()``.

Usually we want to chunk-wise process all the data but save each of the resulting layers as single files, i.e. all chunks merged.

However, when we want to parallel process the chunks it is easier better to first read, process and write single chunk data and later put them together when all the chunks are computed.

Thus, for saving data you need

  • your processing logic starting from an ndarray (if you want to start from a DataFrame see below), and

  • the destination files.

Let us get the data which will be the input to our custom process:

[11]:
ji = 1
eoc_chunk = eoc.get_chunk(ji)
eoc_chunk = eoc_chunk.read_data()

Apply our logic and get the result as a ndarray.

Here as an example we calculate a boolean mask with the valid pixels derived from the fmask layers.

[12]:
ilocs_fmask = np.where((eoc_chunk.df_layers["band"] == "fmask").values)[0]
result = eoc_chunk.data[:,:,ilocs_fmask]
print("shape of data   : ", eoc_chunk.data.shape)
result = np.isin(result, [0, 1]).astype("uint8")
print("shape of result : ", result.shape)
shape of data   :  (32, 29, 92)
shape of result :  (32, 29, 23)

Note that the first two numbers of the result array shape have to be identical to the ones of the input array.

Instead the number of layers (third number) have to match the number of your destination paths. Note that we only need filenames for the whole layers,not the chunks. The files for the chunk files will be derived in EOCubeWindow. Here we assume that later we will merge the chunks virtually in form of a VRT, thus the .vrt extension. So let us get the destination paths:

[13]:
dst_paths = [Path("./results") / (sid + "_valid.vrt") for sid in eoc_chunk.df_layers.iloc[ilocs_fmask]["sceneid"]]
dst_paths[:3]
[13]:
[PosixPath('results/LT50350322008110PAC01_valid.vrt'),
 PosixPath('results/LE70350322008118EDC00_valid.vrt'),
 PosixPath('results/LT50350322008126PAC01_valid.vrt')]

And finally use the EOCubeChunk.write_ndarray to save the results.

[14]:
eoc_chunk.write_ndarray(result, dst_paths)

**TODO: Show how to execute this for all chunks with EOCube. By now see how this is done in the next eocube tutorial: *Calculate virtual time series with EOCube.**