Parallel and Larger-than-Memory Processing

Parallel and Larger-than-Memory Processing#

Authors: Ian Carroll (NASA, UMBC)

The following notebooks are prerequisites for this tutorial.

Learn with OCI: Data Access
Learn with OCI: Processing with Command-line Tools
Learn with OCI: Project and Format

Summary#

Processing a large collection of PACE granules can seem like a big job! The best way to approach a big data processing job is by breaking it into many small jobs, and then putting all the pieces back together again. The reason we break up big jobs has to do with computational resources, specifically memory (i.e. RAM) and processors (i.e. CPUs or GPUs). That large collection of granules can’t all fit in memory; and even if it could, your computation might only use one processor.

This notebook works toward a better understanding of a tool tightly integrated with XArray that breaks up big jobs. The tool is Dask: a Python library for parallel and distributed computing.

Learning Objectives#

At the end of this notebook you will know:

About the framework we’re calling “chunk-apply-combine”
How to start a dask client for parallel and larger-than-memory pipelines
One method for averaging Level-2 “swath” data over time

1. Setup#

Begin by importing all of the packages used in this notebook. If your kernel uses an environment defined following the guidance on the tutorials page, then the imports will be successful.

import cartopy.crs as ccrs
import dask.array as da
import earthaccess
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr
from dask.distributed import Client
from matplotlib.patches import Rectangle

We will discuss dask in more detail below, but we use several additional packages behind-the-scenes. They are installed, but we don’t have to import them directly.

rasterio is a high-level wrapper for GDAL which provides the ability to “warp” with the type of geolocation arrays distributed in Level-2 data.
rioxarray is a wrapper that attaches the rasterio tools to XArray data structures

SatPy is another Python package that could be useful for the processing demonstrated in this notebok, especially through its Pyresample toolking. SatPy requires dedicated readers for any given instrument, however, and we have not tested the SatPy reader contributed for PACE/OCI.

The data used in the demonstration is the chlor_a product found in the BGC suite of Level-2 ocean color products from OCI.

bbox = (-77, 36, -73, 41)
results = earthaccess.search_data(
    short_name="PACE_OCI_L2_BGC",
    temporal=("2024-08-01", "2024-08-07"),
    bounding_box=bbox,
)
len(results)

The search results include all granules from launch through July of 2024 that intersect a bounding box around the Chesapeake and Delaware Bays. The region is much smaller than an OCI swath, so we do not use the cloud cover search filter which considers the whole swath.

results[0]

Data: PACE_OCI.20240801T174245.L2.OC_BGC.V3_0.nc

Size: 15.22 MB

Cloud Hosted: True

paths = earthaccess.open(results[:1])

datatree = xr.open_datatree(paths[0])

dataset = xr.merge(datatree.to_dict().values())
dataset = dataset.set_coords(("latitude", "longitude"))
chla = dataset["chlor_a"]
chla

<xarray.DataArray 'chlor_a' (number_of_lines: 1710, pixels_per_line: 1272)> Size: 9MB
[2175120 values with dtype=float32]
Coordinates:
    longitude  (number_of_lines, pixels_per_line) float32 9MB ...
    latitude   (number_of_lines, pixels_per_line) float32 9MB ...
Dimensions without coordinates: number_of_lines, pixels_per_line
Attributes:
    long_name:      Chlorophyll Concentration, OCI Algorithm
    units:          mg m^-3
    standard_name:  mass_concentration_of_chlorophyll_in_sea_water
    valid_min:      0.001
    valid_max:      100.0
    reference:      Hu, C., Lee Z., and Franz, B.A. (2012). Chlorophyll-a alg...

As a reminder, the Level-2 data has latitude and longitude arrays that give the geolocation of every pixel. The number_of_line and pixels_per_line dimensions don’t have any meaningful coordinates that would be useful for stacking Level-2 files over time. In a lot of the granules, like the one visualized here, there will be a tiny amount of data within the box. But we don’t want to lose a single pixel (if we can help it)!

fig = plt.figure(figsize=(8, 4))
axes = fig.add_subplot()
chla.plot.pcolormesh(x="longitude", y="latitude", robust=True, ax=axes)
axes.add_patch(
    Rectangle(
        bbox[:2],
        bbox[2] - bbox[0],
        bbox[3] - bbox[1],
        edgecolor="red",
        facecolor="none",
    )
)
axes.set_aspect("equal")
plt.show()

../../_images/88c246507a101c9a0809b2f6ddd54031d0755ed2340b72163b77a704dc41f599.png

Here is where we first use rasterio and rioxarray. Those packages together add the rio attribute to the chla dataset which allows us to go GDAL processing in Python. GDAL requires certain metadata that rio records for us, but it needs to know some information! The Level-2 file uses the EPSG 4326 coordinate reference system, and the spatial dimensions are:

x (or longitude for EPSG 4326) is “pixels_per_line”
y (or latitude for EPSG 4326) is “number_of_lines”

chla = chla.rio.set_spatial_dims("pixels_per_line", "number_of_lines")
chla = chla.rio.write_crs("epsg:4326")

We can now use the reproject method to grid the L2 data into something like a Level-3 Mapped product, but for a single granule.

chla_L3M = chla.rio.reproject(
    dst_crs="epsg:4326",
    src_geoloc_array=(
        chla.coords["longitude"],
        chla.coords["latitude"],
    ),
)
chla_L3M = chla_L3M.rename({"x":"longitude", "y":"latitude"})

The plotting can now be done with imshow rather than pcolormesh.

plot = chla_L3M.plot.imshow(robust=True)

../../_images/005e37e6de92fdfb37bca1f4dc07479c64dac86e5ed01e758b464c9364f8cde8.png

Also, we can easilly select the area-of-interest using our bounding box. This smaller data array will become our “template” for gridding additional granules below.

chla_L3M_aoi = chla_L3M.sel(
    {
        "longitude": slice(bbox[0], bbox[2]),
        "latitude": slice(bbox[3], bbox[1]),
    },
)
plot = chla_L3M_aoi.plot.imshow(robust=True)

../../_images/930ffbd39388804f5c4aa35c2a984c27e4ee171909deab5f998d4a83e72758e9.png

When we get to opening mulitple datasets, we will use a helper function to create a “time” coordinate extracted from metadata.

def time_from_attr(ds):
    """Set the start time attribute as a dataset variable.

    Parameters
    ----------
    ds
        a dataset corresponding to a Level-2 granule
    """
    datetime = ds.attrs["time_coverage_start"].replace("Z", "")
    ds["time"] = ((), np.datetime64(datetime, "ns"))
    ds = ds.set_coords("time")
    return ds

Before we get to data, we will play with some random numbers. Whenever you use random numbers, a good practice is to set a fixed but unique seed, such as the result of secrets.randbits(64).

random = np.random.default_rng(seed=5179916885778238210)

back to top

3. Chunk-Apply-Combine#

If you’ve done big data processing, you’ve probably come across the “split-apply-combine” or “map-reduce” frameworks. A simple case is computing group-wise means on a dataset where one variable defines the group and another variable is what you need to average for each group.

split: divide a table into smaller tables, one for each group
apply: calculte the mean on each small table
combine: reassemble the results into a table of group-wise means

The same framework is also used without a natural group by which a dataset should be divided. The split is on equal-sized slices of the original dataset, which we call “chunks”. Rather than a group-wise mean, you could use “chunk-apply-combine” to calculate the grand mean in chunks.

chunk: divide the array into smaller arrays, or “chunks”
apply: calculate the mean and sample size of each chunk (i.e. skipping missing values)
combine: combine the size-weighted means to compute the mean of the whole array

The apply and combine steps have to be capable of calculating results on a slice that can be combined to equal the result you would have gotten on the full array. If a computation can be shoved through “chunk-apply-combine” (see also “map-reduce”), then we can process an array that is too big to read into memory at once. We can also distribute the computation across processors or across a cluster of computers.

We can represent the framework visually using a task graph, a collection of functions (nodes) linked through input and output data (edges).

flowchart LR

A(random.normal) -->|array| B(mean)

The output of the random.normal function becomes the input to the mean function. We can decide to use chunk-apply-combine if either:

array is going to be larger than available memory
mean can be accurately calculated from intermediate results on slices of array

By the way, numpy arrays have an nbytes attribute that helps you understand how much memory you may need. Note tha most computations require several times the size of an input array to do the math.

array = random.normal(1, 2, size=2**24)
array

array([-2.21670758,  0.16399032,  2.69695432, ...,  0.95942182,
        0.46135808, -0.72953258], shape=(16777216,))

print(f"{array.nbytes / 2**20} MiB")

128.0 MiB

It’s still not too big to fit in memory on most laptops. For demonstration, let’s assume we could fit a 128 MiB array into memory, but not a 1 GiB array. We will calculate the mean of a 1 GiB array anyway, using 8 splits each of size 128 MiB in a serial pipeline.

A simple way to measure performance (i.e. speed) in notebooks is to use the IPython %%timeit magic. Begin any cell with %%timeit on a line by itself to trigger that cell to run many times under a timer. How long it takes the cell to run on average will be printed along with any result.

On this system, the serial approach can be seen below to take between 2 and 3 seconds.

%%timeit -r 3

n = 8
s = 0
for _ in range(n):
    array = random.normal(1, 2, size=2**24)
    s += array.mean()
mean = s / n

3.15 s ± 37.3 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

All we can implement in a for-loop is serial computation, but we have multiple processors available!

To help visualize the “chunk-apply-combine” framework, we can make a task graph for the “distributed” approach to calculating the mean of a very large array.

%%{ init: { 'flowchart': { 'curve': 'monotoneY' } } }%%

flowchart LR

A(random.normal)
A -->|array_0| C0(apply-mean)
A -->|array_1| C1(apply-mean)
A -->|array_2| C2(apply-mean)
subgraph cluster
C0
C1
C2
end
C0 -->|result_0| D(combine-mean)
C1 -->|result_1| D
C2 -->|result_2| D

In this task graph, the mean function is never used! Instead an apply-mean function and combine-mean functions are used to perform the appropriate computations on chunks and then combine the results to the global mean.

The two things that dask provides is the cluster that distributes jobs for computation as well as many functions (i.e. like apply-mean and combine-mean) that are the “chunk-apply-combine” equivalents of numpy functions.

To begin processing with dask, we start a local client to manage a cluster.

client = Client(n_workers=4, threads_per_worker=1, memory_limit="512MiB")
client

Client

Client-65dc4b21-7221-11f0-8863-86640707dd06

Connection method: Cluster object	Cluster type: distributed.LocalCluster
Dashboard: /user/itcarroll/proxy/8787/status

Cluster Info

LocalCluster

bb7ddd98

Dashboard: /user/itcarroll/proxy/8787/status	Workers: 4
Total threads: 4	Total memory: 2.00 GiB
Status: running	Using processes: True

Scheduler Info

Scheduler

Scheduler-eb32c1a2-f381-41e4-a92f-ed70c94af3ba

Comm: tcp://127.0.0.1:42959	Workers: 0
Dashboard: /user/itcarroll/proxy/8787/status	Total threads: 0
Started: Just now	Total memory: 0 B

Workers

Worker: 0

Comm: tcp://127.0.0.1:38191	Total threads: 1
Dashboard: /user/itcarroll/proxy/34781/status	Memory: 512.00 MiB
Nanny: tcp://127.0.0.1:43369
Local directory: /tmp/dask-scratch-space/worker-j32driyx

Worker: 1

Comm: tcp://127.0.0.1:44573	Total threads: 1
Dashboard: /user/itcarroll/proxy/43543/status	Memory: 512.00 MiB
Nanny: tcp://127.0.0.1:43541
Local directory: /tmp/dask-scratch-space/worker-a1c69w_h

Worker: 2

Comm: tcp://127.0.0.1:46121	Total threads: 1
Dashboard: /user/itcarroll/proxy/33783/status	Memory: 512.00 MiB
Nanny: tcp://127.0.0.1:41171
Local directory: /tmp/dask-scratch-space/worker-refakx76

Worker: 3

Comm: tcp://127.0.0.1:43639	Total threads: 1
Dashboard: /user/itcarroll/proxy/34355/status	Memory: 512.00 MiB
Nanny: tcp://127.0.0.1:43673
Local directory: /tmp/dask-scratch-space/worker-_wra1s54

Just like np.random, we can use da.random from dask to generate a data array.

dask_random = da.random.default_rng(random)
dask_array = dask_random.normal(1, 2, size=2**27, chunks=2**22)
dask_array

	Array	Chunk
Bytes	1.00 GiB	32.00 MiB
Shape	(134217728,)	(4194304,)
Dask graph	32 chunks in 1 graph layer
Data type	float64 numpy.ndarray

Apply the mean function with out dask_array is quite a bit different.

dask_array.mean()

	Array	Chunk
Bytes	8 B	8 B
Shape	()	()
Dask graph	1 chunks in 5 graph layers
Data type	float64 numpy.ndarray

No operation has happened on the array. In fact, the random numbers have not even been generated yet! No resources are put into action until we explicitly demand a computation; for example, by calling compute, requesting a visualization, or writing data to disk.

%%timeit -r 3

mean = dask_array.mean().compute()

2.02 s ± 64.1 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

We just demonstrated two ways of doing larger-than-memory calculations.

Our synchronous implemenation (using a for loop) took the strategy of maximizing the use of available memory while processing one chunk: so we used 128 MiB chunks, requiring 8 chunks to get to a 1 GiB array.

Our concurrent implementation (using dask.array), took the strategy of maximizing the use of available processors: so we used small chunks of 32 MiB, requiring 32 to get to a 1 GiB array.

The concurrent implementation was about twice as fast, but your mileage may vary.

client.close()

5. Scaling Out#

The example above relies on a LocalCluster, which only uses the resources on the JupyterLab server running your notebook. The promise of the commercial cloud is massive scalability. Again, dask and the CryoCloud come to our aid in the form of a pre-configured dask_gateway.Gateway. The gateway object created below works with the CryoCloud to launch additional servers, and now those servers are enlisted in your cluster.

from dask_gateway import Gateway

gateway = Gateway()
options = gateway.cluster_options()
options

cluster = gateway.new_cluster(options)
cluster

The cluster starts with no workers, and you must set manual or adaptive scaling in order to get any workers.

The cluster can take several minutes to spin up. Monitor the dashboard to ensure you have workers.

client = cluster.get_client()
client

Client

Client-97f811f9-7221-11f0-8863-86640707dd06

Connection method: Cluster object	Cluster type: dask_gateway.GatewayCluster
Dashboard: /services/dask-gateway/clusters/prod.899a01c54ac14b97acc2bbc1f5b6d35b/status

Cluster Info

GatewayCluster

Name: prod.899a01c54ac14b97acc2bbc1f5b6d35b
Dashboard: /services/dask-gateway/clusters/prod.899a01c54ac14b97acc2bbc1f5b6d35b/status

Use this client exactly as above:

futures = client.map(
    grid_match,
    paths,
    dst_crs=crs,
    dst_shape=shape,
    dst_transform=transform,
)

chla = xr.combine_nested(client.gather(futures), concat_dim="time")
chla["time"] = attrs["time"]
chla

cluster.close()

back to top

Comm: tcp://127.0.0.1:41187	Total threads: 1
Dashboard: /user/itcarroll/proxy/38719/status	Memory: 0.93 GiB
Nanny: tcp://127.0.0.1:42775
Local directory: /tmp/dask-scratch-space/worker-nzqhkop4

Comm: tcp://127.0.0.1:41801	Total threads: 1
Dashboard: /user/itcarroll/proxy/40667/status	Memory: 0.93 GiB
Nanny: tcp://127.0.0.1:41601
Local directory: /tmp/dask-scratch-space/worker-o879xsw8

Comm: tcp://127.0.0.1:35243	Total threads: 1
Dashboard: /user/itcarroll/proxy/33035/status	Memory: 0.93 GiB
Nanny: tcp://127.0.0.1:44101
Local directory: /tmp/dask-scratch-space/worker-e9b4q_in

Comm: tcp://127.0.0.1:37293	Total threads: 1
Dashboard: /user/itcarroll/proxy/46157/status	Memory: 0.93 GiB
Nanny: tcp://127.0.0.1:40615
Local directory: /tmp/dask-scratch-space/worker-0k17q4p3

Comm: tcp://127.0.0.1:38341	Workers: 0
Dashboard: /user/itcarroll/proxy/8787/status	Total threads: 0
Started: Just now	Total memory: 0 B

Parallel and Larger-than-Memory Processing

Contents

Parallel and Larger-than-Memory Processing#

Summary#

Learning Objectives#

Contents#

1. Setup#

3. Chunk-Apply-Combine#

Client

Cluster Info

LocalCluster

Scheduler Info

Scheduler

Workers

Worker: 0

Worker: 1

Worker: 2

Worker: 3

4. Stacking Level-2 Granules#

Client

Cluster Info

LocalCluster

Scheduler Info

Scheduler

Workers

Worker: 0

Worker: 1

Worker: 2

Worker: 3

5. Scaling Out#

Client

Cluster Info

GatewayCluster