In this report we compare the performance of odc-stac
(GitHub, Docs) and stacktac
(GitHub, Docs) libraries when loading Sentinel-2 COG (Cloud Optimized GeoTIFF) data from Azure blob storage. We analyse relative performance of the two libraries in the two common data access scenarios and measure the effect of Dask chunk-size choice on observed performance.
Experiment was conducted in the Planetary Computer Pangeo Notebook environment, using a single-node Dask cluster with 4 cores and 32 GiB RAM.
We load three bands (red, green and blue) in the native projection and resolution of the data (10m, UTM). We consider two scenarios: deep (temporal processing) and wide (building mosaic for visualisation).
{
"collections": ["sentinel-2-l2a"],
"datetime": "2020-06/2020-07",
"query": {
"s2:mgrs_tile": {"eq": "35MNM"},
"s2:nodata_pixel_percentage": {"lt": 10}
}
}
{
"collections": ["sentinel-2-l2a"],
"datetime": "2020-06-06",
"bbox": [27.345815, -14.98724, 27.565542, -7.710992]
}
To control for variability in data access performance, we run each benchmark several times and pick the fastest run for comparison. Most configurations were processed five times, with some slower ones being repeated three times. We have observed low variability of execution times for the slower ones.
A rotated thumbnail of the wide area scenario is displayed below, the image is actually tall and narrow - to save space we display it rotated counter clockwise.
The deep scenario was taken from the same region, but using only one granule (35MNM
, left side of the image above).
To investigate the impact of the configuration choices we consider several chunking sizes. When loading deep scenario data, we find that for both odc-stac
and stackstac
, chunk size of 2048 pixels produces the best result. When using optimal configuration, stackstac
achieves a peak throughput of 121.8 output megapixels per second (Mpix/s), which is about 15% higher than odc-stac
which peaks at 105 Mpix/s.
Both libraries use GDAL via rasterio to read the data, but stackstac
configures GDAL internal caching to optimize the re-use of the image metadata. This has significant positive impact on performance. Essentially, caching is only enabled when reading the image headers, so they don't get ejected by the compressed pixel data, because the compressed data is read without caching. This is especially helpful when using multiple threads in Dask workers.
When loading a large mosaic however,odc-stac
has a significant advantage over stackstac
. For odc-stac
optimal chunk size remains the same as for the deep scenario: 2048 pixels. For stackstac
, performance slowly improves with the larger chunk sizes. The best performance for stackstac
was achieved using the largest chunk size we tried: 7168 pixels. Throughput achieved by odc-stac
is almost 2.5 times higher than stackstac
: 74.7 vs 30.1 Mpix/s.
The approach taken by stackstac
for loading mosaics is roughly as following:
xx = stackstac.stack(items, ...)
mosaic = xx.groupby("time").map(stackstac.mosaic)
This means that the computational complexity of the wide scenario grows with the number of STAC items being loaded and not just with the number of output pixels being produced. A single large area STAC Item will load significantly faster than the same area partitioned into many STAC Items.
This approach to data loading results in a very large Dask graph with a large number of "no-op" chunks. While stackstac
includes optimizations for chunks that are "empty", those still need to be processed by the Dask scheduler, resulting in a significant processing overhead.
Approach taken by odc-stac
avoids this problem:
nodata
values)As a result, Dask graph produced by odc-stac
is much smaller than the graph produced by stackstac
in this scenario. Which results in a much faster submission and processing by the Dask scheduler. The table below lists the Submit time in seconds, as observed for the fastest run across the chunk sizes.
chunk | 512 | 1024 | 2048 | 3072 | 4096 | 5120 | 6144 | 7168 |
---|---|---|---|---|---|---|---|---|
method | ||||||||
odc-stac | 3.881 | 0.927 | 0.272 | 0.116 | 0.084 | 0.238 | 0.034 | 0.029 |
stackstac | 33.366 | 7.950 | 1.967 | 0.778 | 0.484 | 0.426 | 0.156 | 0.136 |
Slow submit times are especially inconvenient in interactive data analysis workflows.
Both odc-stac
and stackstac
achieve lower throughput when constructing mosaics vs loading a column of pixel data, but the penalty is significantly higher for stackstac
.
Sentinel-2 granules have some duplicated pixels. As a result, one needs to load several input pixels for some of the output pixels, and then combine them somehow into one (usually pick one out of the several valid candidates). In the wide scenario, we need to load about 1.08 input pixels for every output pixel, while for deep scenario input to output ratio is exactly 1. More work needs to be done per output pixel for the wide scenario, and so one should expect somewhat lower throughput.
For odc-stac
, throughput for the wide scenario is about 30% lower than for the deep (74.7 vs 105 Mpix/s). From the pixel overlap cost alone, one would expect an 8% higher work requirement on the read side. The other 22% could be explained by the pixel-fusing computational requirements.
For stackstac
, building mosaics is significantly more expensive computationally than reading temporal columns of data (30.1 vs 121.8 Mpix/s peak throughput, around 4 times slower per output pixel in wide scenario).
Loading data at significantly reduced resolutions is a common scenario. And is especially useful in the exploratory stages of the data analysis, which tend to be interactive and thus benefit from faster load times.
The cloud-optimized imagery includes low-resolution overviews, and therefore can be read much faster at lower resolutions. For every doubling of the output pixel size in ground units, one needs to load four times fewer input pixels for the same geographic area.
elapsed | throughput_mps | |||
---|---|---|---|---|
method | odc-stac | stackstac | odc-stac | stackstac |
resolution | ||||
10 | 40.1 | 165.9 | 74.7 | 18.1 |
20 | 9.9 | 77.1 | 75.5 | 9.7 |
40 | 6.3 | 56.3 | 29.6 | 3.3 |
80 | 2.3 | 29.3 | 20.6 | 1.6 |
For odc-stac
, reading imagery at 20m pixels is 4 times faster than reading at the native 10m. Throughput per output pixel remains the same but there are four times fewer pixels to read. At the higher zoom out scales throughput starts to drop off, but load times are still getting quite a bit faster the further you zoom out.
For stackstac
, lower resolution reads do result in faster completion time, but throughput per output pixel drops off much quicker than for odc-stac
.
When zooming out to 1/8 image side size (1/64 in the number of pixels), odc-stac
is more than 10 times quicker than stackstac
(2.3 vs 29.3 seconds).
TODO: some text on funding or something like that
chunk | 512 | 1024 | 2048 | 3072 | 4096 | 5120 | 6144 | 7168 | |
---|---|---|---|---|---|---|---|---|---|
method | scenario | ||||||||
odc-stac | wide | 154.2 | 65.4 | 40.1 | 43.9 | 55.2 | 54.6 | 55.1 | 55.0 |
stackstac | wide | 470.1 | 236.3 | 165.9 | 127.4 | 117.8 | 111.5 | 102.8 | 99.6 |
odc-stac | deep | 183.9 | 77.0 | 44.8 | 45.3 | 60.0 | 69.6 | 68.9 | 69.1 |
stackstac | deep | 212.1 | 76.6 | 38.6 | 43.4 | 56.4 | 66.4 | 68.6 | 69.4 |
chunk | 512 | 1024 | 2048 | 3072 | 4096 | 5120 | 6144 | 7168 | |
---|---|---|---|---|---|---|---|---|---|
method | scenario | ||||||||
odc-stac | wide | 3.881 | 0.927 | 0.272 | 0.116 | 0.084 | 0.238 | 0.034 | 0.029 |
stackstac | wide | 33.366 | 7.950 | 1.967 | 0.778 | 0.484 | 0.426 | 0.156 | 0.136 |
odc-stac | deep | 6.057 | 1.478 | 0.407 | 0.182 | 0.110 | 0.262 | 0.054 | 0.219 |
stackstac | deep | 4.253 | 0.973 | 0.223 | 0.110 | 0.063 | 0.067 | 0.038 | 0.049 |
chunk | 512 | 1024 | 2048 | 3072 | 4096 | 5120 | 6144 | 7168 | |
---|---|---|---|---|---|---|---|---|---|
method | scenario | ||||||||
odc-stac | wide | 19.4 | 45.8 | 74.7 | 68.3 | 54.3 | 54.8 | 54.4 | 54.5 |
stackstac | wide | 6.4 | 12.7 | 18.1 | 23.5 | 25.4 | 26.9 | 29.2 | 30.1 |
odc-stac | deep | 25.6 | 61.1 | 105.0 | 103.8 | 78.4 | 67.6 | 68.2 | 68.0 |
stackstac | deep | 22.2 | 61.4 | 121.8 | 108.3 | 83.3 | 70.8 | 68.5 | 67.8 |
This is for wide scenario only.
resolution | 10 | 20 | 40 | 80 | |
---|---|---|---|---|---|
method | chunk | ||||
odc-stac | 512 | 154.2 | 36.9 | 12.7 | 5.3 |
1024 | 65.4 | 18.2 | 7.7 | 3.6 | |
2048 | 40.1 | 9.9 | 6.3 | 2.3 | |
stackstac | 512 | 470.1 | 150.5 | 83.8 | 34.9 |
1024 | 236.3 | 108.0 | 66.9 | 32.6 | |
2048 | 165.9 | 77.1 | 56.3 | 29.3 |
resolution | 10 | 20 | 40 | 80 | |
---|---|---|---|---|---|
method | chunk | ||||
odc-stac | 512 | 3.881 | 0.865 | 0.162 | 0.047 |
1024 | 0.927 | 0.161 | 0.047 | 0.020 | |
2048 | 0.272 | 0.047 | 0.020 | 0.010 | |
stackstac | 512 | 33.366 | 8.782 | 2.281 | 0.352 |
1024 | 7.950 | 2.272 | 0.504 | 0.438 | |
2048 | 1.967 | 0.660 | 0.434 | 0.039 |
resolution | 10 | 20 | 40 | 80 | |
---|---|---|---|---|---|
method | chunk | ||||
odc-stac | 512 | 19.4 | 20.3 | 14.8 | 8.9 |
1024 | 45.8 | 41.2 | 24.4 | 13.0 | |
2048 | 74.7 | 75.5 | 29.6 | 20.6 | |
stackstac | 512 | 6.4 | 5.0 | 2.2 | 1.3 |
1024 | 12.7 | 6.9 | 2.8 | 1.4 | |
2048 | 18.1 | 9.7 | 3.3 | 1.6 |