Extract Raster Values with Vector Data¶

Extracting raster values of pixels overlapping with vector data is one of the most common tasks for developers of earth observation applications. This notebook shows on the basis of a small toy dataset how to perform this task in Python with the eobox package.

In general there are tow two main approaches to solve this task in Python. (I am eager to know if you see another approach or additional advantages or disadvantages that should be added to this list.)

The rasterstats-way:
- For each feature (point, line, polygon) in the vector dataset
  - get the geometry of the feature,
  - get the bounding box for that geometry,
  - get the raster data corresponding to the bounding box as array (extraction core),
  - rasterize the geometry such that it correponds to the array,
  - return only the raster values overlapping with the rasterized geometry.

A way of doing this in Python can be found in the gen_zonal_stats function implemented in the rasterstats package.

The rasterize-way:
- Rasterize the whole vector dataset such that the rasterized raster matches the raster data from which to extract the values (e.g. with the gdal_rasterize function).
- load the rasterized data as array and create a mask,
- load the raster from which to extract the data as array and select the values to be kept by the mask.

This tutorial shows how to use the extract function implemented in the eobox package to go this way.

There is a major difference between the two approaches which might help you to find the best for your case: The first approach is memory sparing since the whole raster does not need to be loaded in memory at once, which is necessary in the second approach. However, if your vector dataset contains many features, it might be much faster to go for the second approach since it is not necessary to loop through all the features in Python.

Let’s get stated and import the required modules.

[1]:

%load_ext autoreload
%autoreload 2
%matplotlib inline

import fnmatch
import geopandas as gpd
import os
import pandas as pd
from pathlib import Path

from eobox.raster import extraction
from eobox.raster.extraction import convert_df_to_geodf
from eobox import sampledata

We now load a small sample dataset from which we can derive all requird arguments for running the extract function.

Note that the sample dataset is S2 L1C that comes in different resolutions. Extraction from multiple resolutions is not supported and handling it is (so far) the responsibilit of the user.

Here we only slect the 10m bands for extraction.

[2]:

dataset = sampledata.get_dataset("s2l1c")

src_vector = dataset["vector_file"]
burn_attribute = "pid"  # should be unique for the polygons and not contain zero
src_raster = fnmatch.filter(dataset["raster_files"], "*B0[2,3,4,8]*")  # 10 m bands
dst_names = ["_".join(Path(src).stem.split("_")[1::]) for src in src_raster]
extraction_dir = Path("s2l1c_ref__s2l1c/s2_l1c/10m")

print("***ARGUMENTS***")
print("\nsrc_vector:\n-----------\n -", src_vector)
print("\ndst_dir:\n--------\n -", extraction_dir)
print("\nsrc_raster:\n-----------\n -", "\n - ".join(src_raster))
print("\ndst_names:\n----------\n -", "\n - ".join(dst_names))
print("\n***ARGUMENTS - END***")

***ARGUMENTS***

src_vector:
-----------
 - /home/ben/Devel/Packages/eo-box/eobox/sampledata/data/s2l1c/s2l1c_ref.gpkg

dst_dir:
--------
 - s2l1c_ref__s2l1c/s2_l1c/10m

src_raster:
-----------
 - /home/ben/Devel/Packages/eo-box/eobox/sampledata/data/s2l1c/IMG_DATA/T33UUU_20170216T102101_B08.jp2
 - /home/ben/Devel/Packages/eo-box/eobox/sampledata/data/s2l1c/IMG_DATA/T33UUU_20170216T102101_B02.jp2
 - /home/ben/Devel/Packages/eo-box/eobox/sampledata/data/s2l1c/IMG_DATA/T33UUU_20170216T102101_B03.jp2
 - /home/ben/Devel/Packages/eo-box/eobox/sampledata/data/s2l1c/IMG_DATA/T33UUU_20170216T102101_B04.jp2

dst_names:
----------
 - 20170216T102101_B08
 - 20170216T102101_B02
 - 20170216T102101_B03
 - 20170216T102101_B04

***ARGUMENTS - END***

The vector dataset looks as follows:

[3]:

gdf = gpd.read_file(src_vector)
gdf.plot(column="cid", figsize=(18, 8))

[3]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f20f4505c50>

../../_images/examples_raster_extract_raster_values_with_vector_data_5_1.png

Note that the cid column represents the classes water (cid:1) and forest (cid:2). Even though we are usually interested in such an attribute we use as burn attribute the pid column contains unique ids for the polygons. Using a unique id for each feature is recommended since it assures that for each extracted pixel the polygon it belongs to can easily been identified. We will see later that it is easy to merge the cid information or any other information of the vector data attribute table can easily be merged into the extrected data.

But now let’s do the extraction.

[4]:

%%time
extraction.extract(src_vector=src_vector,
                   burn_attribute=burn_attribute,
                   src_raster=src_raster,
                   dst_names=dst_names,
                   dst_dir=extraction_dir)

100%|██████████| 4/4 [00:00<00:00, 11.67it/s]

CPU times: user 2.18 s, sys: 343 ms, total: 2.53 s
Wall time: 397 ms

[4]:

So what just happened? - Roughly the function made what was described above under the rasterize-way and stored the extracted data and some auxiliary information to .npy files:

[5]:

list(extraction_dir.glob("*.npy"))

[5]:

[PosixPath('s2l1c_ref__s2l1c/s2_l1c/10m/aux_coord_x.npy'),
 PosixPath('s2l1c_ref__s2l1c/s2_l1c/10m/20170216T102101_B04.npy'),
 PosixPath('s2l1c_ref__s2l1c/s2_l1c/10m/20170216T102101_B02.npy'),
 PosixPath('s2l1c_ref__s2l1c/s2_l1c/10m/aux_coord_y.npy'),
 PosixPath('s2l1c_ref__s2l1c/s2_l1c/10m/20170216T102101_B08.npy'),
 PosixPath('s2l1c_ref__s2l1c/s2_l1c/10m/aux_vector_pid.npy'),
 PosixPath('s2l1c_ref__s2l1c/s2_l1c/10m/20170216T102101_B03.npy')]

Note that the aux(iliary) variables are * aux_coord_x: the x coordinates of the pixels, * aux_coord_y: the y coordinates of the pixels, * aux_coord_y: the polygon id (here) or whatever attribute has been passed burn_attribute before.

The rest of the variables contain the data extracted from the raster layers.

20170216T102101_B02.npy
20170216T102101_B03.npy
20170216T102101_B04.npy
20170216T102101_B08.npy

It is worth noting that the function checks what has already been extracted by using the dst_names. Thus, only the dst_names that alread exist are skipped. So far, it is the responsibility of the user to delete existing bands in case they should be re-extracted.

With the the function load_extracted it is easy to load the data. It has a couple of options which returns different dataframes:

The defaults return all variable in columns.

patterns can be used to load specific variables matching a list of patterns.

[6]:

extraction.load_extracted(extraction_dir, patterns=["aux_*.npy"]).head()

[6]:

	aux_coord_x	aux_coord_y	aux_vector_pid
0	344545.0	5822035.0	17
1	344555.0	5822035.0	17
2	344615.0	5821885.0	6
3	344625.0	5821885.0	6
4	344615.0	5821875.0	6

[7]:

extraction.load_extracted(extraction_dir, patterns=["aux_vector_pid.npy", "*_B??.npy"]).head()

[7]:

	aux_vector_pid	20170216T102101_B04	20170216T102101_B02	20170216T102101_B08	20170216T102101_B03
0	17	928	1456	672	1072
1	17	928	1392	672	1040
2	6	928	1456	608	1072
3	6	864	1424	608	1072
4	6	928	1424	608	1072

vars_in_cols can be used to load the variables in the columns.

This does not make so much sense here, but if you have e.g. a long time series of the NDVI and you want to make time series analysis you might want to have the dates in the rows.

[8]:

extraction.load_extracted(extraction_dir, patterns=["*B??.npy"], vars_in_cols=False).head()

[8]:

	0	1	2	3	4	5	6	7	8	9	...	3490	3491	3492	3493	3494	3495	3496	3497	3498	3499
20170216T102101_B04	928	928	928	864	928	864	864	864	864	928	...	880	912	880	848	848	848	816	816	816	816
20170216T102101_B02	1456	1392	1456	1424	1424	1424	1424	1392	1424	1392	...	1448	1464	1448	1432	1448	1480	1432	1448	1400	1400
20170216T102101_B08	672	672	608	608	608	608	608	608	608	608	...	624	656	592	560	592	592	592	560	560	560
20170216T102101_B03	1072	1040	1072	1072	1072	1072	1040	1040	1040	1072	...	1056	1120	1056	1056	1056	1056	1056	1056	1056	1056

4 rows × 3500 columns

You can also get the head of the data loaded by load_extracted, i.e. the first five rows.

[9]:

extraction.load_extracted(extraction_dir, vars_in_cols=False, head=True)

[9]:

	0	1	2	3	4	5
aux_coord_x	344545.0	344555.0	344615.0	344625.0	344615.0	344625.0
20170216T102101_B04	928.0	928.0	928.0	864.0	928.0	864.0
20170216T102101_B02	1456.0	1392.0	1456.0	1424.0	1424.0	1424.0
aux_coord_y	5822035.0	5822035.0	5821885.0	5821885.0	5821875.0	5821875.0
20170216T102101_B08	672.0	672.0	608.0	608.0	608.0	608.0
aux_vector_pid	17.0	17.0	6.0	6.0	6.0	6.0
20170216T102101_B03	1072.0	1040.0	1072.0	1072.0	1072.0	1072.0

And finally you can convert the data to a geodataframe.

[11]:

df_extracted = extraction.load_extracted(extraction_dir, patterns=["aux_*.npy", "*_B??.npy"])
gdf_extracted = convert_df_to_geodf(df_extracted, crs=extraction_dir)
gdf_extracted.plot(column='aux_vector_pid', marker='x', figsize=(20, 12))

[11]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f20bc8b81d0>

../../_images/examples_raster_extract_raster_values_with_vector_data_19_1.png

It has been said above that it is easy to join the extracted values to the values stored in the attribute table of the vector file.

[12]:

df_extracted = df_extracted.merge(gdf[["pid", "cid", "comment"]], how="left", left_on="aux_vector_pid", right_on="pid")
df_extracted.head()

[12]:

	aux_coord_x	aux_coord_y	aux_vector_pid	20170216T102101_B04	20170216T102101_B02	20170216T102101_B08	20170216T102101_B03	geometry	pid	cid	comment
0	344545.0	5822035.0	17	928	1456	672	1072	POINT (344545 5822035)	17	1	partially out; expecting: 2 - with default ras...
1	344555.0	5822035.0	17	928	1392	672	1040	POINT (344555 5822035)	17	1	partially out; expecting: 2 - with default ras...
2	344615.0	5821885.0	6	928	1456	608	1072	POINT (344615 5821885)	6	1	None
3	344625.0	5821885.0	6	864	1424	608	1072	POINT (344625 5821885)	6	1	None
4	344615.0	5821875.0	6	928	1424	608	1072	POINT (344615 5821875)	6	1	None

And now you can start analyzing the data.

[13]:

import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 4, figsize=(20, 6))
for ax, var in zip(axes.flatten(), ["20170216T102101_B02", "20170216T102101_B03", "20170216T102101_B04", "20170216T102101_B08"]):
    df_extracted.boxplot(column=var, by="cid", ax=ax)

../../_images/examples_raster_extract_raster_values_with_vector_data_23_0.png