<aside> 🌷

“We embody, we learn, we release the idea of failure, because it is all data.”

**—adrienne maree brown

</aside>

Problem Statement

This project builds upon a pre-existing body of work for rewildingCities, an open-source initiative promoting the creation of socio-technical systems for climate resilience in urban communities.

Our pilot study successfully replicated a peer-reviewed scientific paper on Park Cooling Intensity (PCI) executed in Nanjing, China. We focused on creating a reproducible analysis pipeline in R that processed multiple large geospatial datasets for New York City. This initial work was executed on a single machine, locally, and highlighted the significant computational bottlenecks inherent in complex geospatial analysis for large urban areas.

We will be prototyping a layered, collaborative research environment (CRE) envisioned for running complex socio-environmental models by attempting to architect its logic into a scalable, cloud-native system. The goal of rewildingCities is to create spaces for local communities to democratically model sustainable ecologies, economies, and infrastructure for climate resilient futures.

Experimental Overview

Our collaborative research environment is designed around an initial “recipe-ingredient” model that was validated in our pilot:

Recipes: This is a methodological system of blueprints defining the semantic data needs for a specific geospatial or scientific experiment that consist of curiosity spaces outlining specific questions that our system is capable of addressing, methodologies which show explicit choice points in actualizing the exploration of a curiosity space, or question, and experiments, which actually bind the questions a researcher may have, with the data that exists, and the analytical approach, or method, best for answering said question.
Manifests (manifest.yml): Local configuration files that map each recipe’s abstract ingredients to a specific city’s real-world data sources (in this instance, we chose New York City).
The Simulation Engine: The cloud-based backend that a manifest, processes data according to our internal schema, and executes local experiments.

With this project, we seek to explore several core scalability needs of experimenting with data in this way:

Computational Scaling (Event-Driven, Parallel Processing): The PCI pilot study for one city required processing terabytes of raw satellite and vector data. To support the possibility of dozens of scientist communities running complex models, the system must be able to parallelize these massive computational workloads. A single-machine approach is not viable.
Data Ingestion & Validation Scaling: The platform must be able to ingest and validate a heterogeneous mix of data sources. This requires a robust, decoupled pipeline that can handle diverse formats and gracefully manage failures.
Request Scaling (API & Data Serving): The ultimate goal is for the CRE to serve its results to interactive dashboards. The public-facing API must support a growing number of users querying these complex datasets, ensuring a responsive and interactive experience.

Experimental Tasks

Our experiments will test the three most critical, user-facing aspects of our prototype’s performance: its raw computational speed for complex analyses, its ability to reliably and quickly ingest user-provided data, and its capacity to serve that data to multiple users with low latency.

1 | Scalability of the Core Computational Engine —Moumita x Sagar

→ Focus:

This experiment directly addresses the core computation challenge that limits our single-machine pilot. It measures the system's ability to perform heavy, parallelizable geospatial analysis.
This portion of the project is concentrated in the Soil and Roots directories.

→ Desired Outcome:

By distributing the analysis for our PCI model across hundreds of parallel workers, we can reduce the end-to-end processing time from hours (on a single machine) to minutes. This demonstrates the architectural viability of providing rapid, on-demand analysis for researchers.

→ Method:

We will containerize the core geospatial logic from our successful PCI pilot study (CRS transformation, raster cropping, zonal statistics).
An S3 trigger will queue up processing tasks for 2,000 park polygons in an SQS queue, simulating a full-scale city analysis.
We will execute the pipeline using AWS Lambda with its Reserved Concurrency set to three different levels: 10 (low), 100 (medium), and 500 (high).

→ Evaluation & Metrics:

We will measure and plot the Total Analysis Latency (time from the first task entering the queue to the last result being stored) against the concurrency level.
This experiment’s success is defined by a significant, near-linear decrease in processing time as concurrency increases, proving the prototype can overcome the primary bottleneck of the original pilot.
This part is inherently part of the Canopy directory.

Problem Statement

Pilot Study Repository

Architectural Design Report

Current GitHub Repository

Experimental Overview

Experimental Tasks

1 | Scalability of the Core Computational Engine —Moumita x Sagar