diff --git a/docs/evaluations/virtual_dataset_performance_assessment.ipynb b/docs/evaluations/virtual_dataset_performance_assessment.ipynb new file mode 100644 index 00000000..f08cf5bc --- /dev/null +++ b/docs/evaluations/virtual_dataset_performance_assessment.ipynb @@ -0,0 +1,920 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "0bebc34d-64a9-4961-accd-e3eece6aad5f", + "metadata": {}, + "source": [ + "# Assessing performance of various methods to open NASA Earthdata data\n", + "\n", + "\n", + "## Summary\n", + "\n", + "This notebook started and currently uses TEMPO Level-3 data for its test cases. For access patterns, we begin by focusing on using the relatively new `earthaccess.open_virtual_mfdataset()` functionality. Further discussion of this effort can be found in [this `earthaccess` GitHub discussion](https://github.com/nsidc/earthaccess/discussions/987).\n", + "\n", + "## Prerequisites\n", + "\n", + "- **AWS US-West-2 Environment:** This tutorial has been designed to run in an AWS cloud compute instance in AWS region us-west-2. However, if you want to run it from your laptop or workstation, everything should work just fine but without the speed benefits of in-cloud access.\n", + "\n", + "- **Earthdata Account:** A (free!) Earthdata Login account is required to access data from the NASA Earthdata system. Before requesting TEMPO data, we first need to set up our Earthdata Login authentication, as described in the Earthdata Cookbook's [earthaccess tutorial (link)](https://nasa-openscapes.github.io/earthdata-cloud-cookbook/tutorials/earthaccess-demo.html).\n", + "\n", + "- **Packages:**\n", + "\n", + " - `cartopy`\n", + " - `dask`\n", + " - `earthaccess` **version 0.14.0 or greater**\n", + " - `matplotlib`\n", + " - `numpy`\n", + " - `xarray`" + ] + }, + { + "cell_type": "markdown", + "id": "d64bfa3b-108e-47d1-a24c-4bdcda59e5c8", + "metadata": {}, + "source": [ + "# Test Report (so far)\n", + "\n", + "\n", + "## Approach for Initial Test Case(s)\n", + "---\n", + "\n", + "#### Data\n", + "\n", + "The tests in this notebook leverage data from the Nitrogen Dioxide ($NO_2$) Level-3 data collection of the [TEMPO air quality mission (link)](https://asdc.larc.nasa.gov/project/TEMPO). TEMPO Level-3 data are stored in granules that are each ~500 MB. Thus, a year's worth of data is about $500*4867/1024/1024 = 2.3 \\text{ TB}$.\n", + "\n", + "#### Methods\n", + "\n", + "We use multiple functions, including the relatively new `earthaccess.open_virtual_mfdataset()` function, to open, stream, or download the data and/or metadata from granules such that we can then calculate means for a subset of the data and visualize the results.\n", + "\n", + "Test cases utilize the following functions/function chains:\n", + "- Single granule cases:\n", + " - **Case 1:** `earthaccess.open_virtual_dataset()`\n", + " - **Case 2:** `xr.open_dataset(earthaccess.open())`\n", + "- Multi-granule cases:\n", + " - **Case 3:** `earthaccess.open_virtual_mfdataset()`\n", + " - **Case 4:** `earthaccess.download()`\n", + "\n", + "#### Benchmarking\n", + "\n", + "To facilitate comparisons, times reported for opening one file are extrapolated to an estimated time it would take to open all 4,867 TEMPO Level-3 granules in the 2024–2025 \"year-long\" analysis scenario, while assuming the year-long scenario would be performed using the Openscapes Hub's default of 4 CPUs. And where appropriate, times are converted to more easily readable units. For example, for Cases **1** and **2**, the following equation is used to convert from time (in seconds) for opening one file ($x$) to estimated time (in hours) for all granules in a year-long scenario ($y$): \n", + "$$\n", + " x \\text{ time (s) for one granule} * \n", + " \\frac{4867 \\text{ granules}}{1 \\text{ granule}} * \n", + " \\frac{1 \\text{ min}}{60 \\text{ s}} * \n", + " \\frac{1}{4 \\text{ CPUs}} * \n", + " \\frac{1 \\text{ hr}}{60 \\text{ min}} =\n", + " y \\text{ time (hr) for all granules per CPU}\n", + "$$\n", + "\n", + "For Case **3**, the following equation is used to estimate a time (in miliseconds) for a single granule ($y$) from the time (in seconds) it takes to open all granules ($x$):\n", + "\n", + "$$\n", + " x \\text{ time (s) for all granules} * \n", + " \\frac{1 \\text{ granule}}{4867 \\text{ granules}} * \n", + " \\frac{1000 \\text{ ms}}{1 \\text{ s}} * =\n", + " y \\text{ time (ms) for one granule}\n", + "$$\n", + "\n", + "For Case **4**, the following equation is used to estimate time (in hours) for all granules in a year-long scneario ($y$) from the time (in seconds) it takes to open 10 granules ($x$):\n", + "\n", + "$$\n", + " x \\text{ time (s) for 10 granules} * \n", + " \\frac{4867 \\text{ granules}}{10 \\text{ granules}} * \n", + " \\frac{1 \\text{ min}}{60 \\text{ s}} *\n", + " \\frac{1 \\text{ hr}}{60 \\text{ min}} =\n", + " y \\text{ time (hr) for all granules}\n", + "$$\n", + "\n", + "## Results\n", + "---\n", + " \n", + "### Case 1 – Opening as virtual dataset – using `earthaccess.open_virtual_dataset()`:\n", + "\n", + "with \n", + "```python\n", + "open_options = {\n", + " \"access\": \"direct\",\n", + " \"load\": True\n", + "}\n", + "```\n", + "\n", + "| Run | Wall time | Wall time extrapolated to all granules | CPU time | CPU time extrapolated to all granules |\n", + "| :- | :--------- | :- | :- | :- |\n", + "| 1 | 235 ms | 0.08 hr = 4.8 min | 46.5 ms | 0.016 hr = 56 s |\n", + "| 2 | 881 ms | 0.30 hr = 17.9 min | 54.9 ms | 0.019 hr = 67 s |\n", + "| 3 | 281 ms | 0.09 hr = 5.7 min | 56.0 ms | 0.019 hr = 68 s |\n", + "| 4 | 249 ms | 0.08 hr = 5.0 min | 53.7 ms | 0.018 hr = 65 s |\n", + "| **Average:** | **411 ms** | **0.14 hr = 8.4 min** | **52.8 ms** | **0.018 hr = 64 s** |\n", + "\n", + "### Case 2 – Streaming the data – using `xr.open_dataset(earthaccess.open([results[0]])[0])`\n", + "\n", + "| Run | Wall time | Extrapolated to all granules | CPU time | CPU time extrapolated to all granules |\n", + "| :- | :--------- | :- | :- | :- |\n", + "| 1 | 17.5 s | 5.9 hr | 6.11 s | 2.1 hr |\n", + "| 2 | 12.3 s | 4.2 hr | 5.16 s | 1.7 hr |\n", + "| 3 | 12.4 s | 4.2 hr | 5.22 s | 1.8 hr |\n", + "| 4 | 12.5 s | 4.2 hr | 5.08 s | 1.7 hr |\n", + "| **Average:** | **13.7 s** | **4.6 hr** | **5.39 s** | **1.8 hr** | \n", + "\n", + "\n", + "### Case 3 – Opening as virtual dataset – using `earthaccess.open_virtual_mfdataset()`\n", + "\n", + "with\n", + "```python\n", + "open_options = {\n", + " \"access\": \"direct\",\n", + " \"load\": True,\n", + " \"concat_dim\": \"time\",\n", + " \"coords\": \"minimal\",\n", + " \"compat\": \"override\",\n", + " \"join\": \"override\",\n", + " \"combine_attrs\": \"override\",\n", + " \"parallel\": True,\n", + "}\n", + "```\n", + "And note that these times represent working with the \"root\" group of the netCDF.\n", + "\n", + "| Run | Wall time all granules | Wall time estimate for one granule | CPU time all granules |\n", + "| :- | :--------- | :- | :- |\n", + "| 1 | 245 s | 50 ms | 85 s |\n", + "| 2 | 238 s | 49 ms | 84 s |\n", + "| 3 | 235 s | 48 ms | 85 s |\n", + "| 4 | 231 s | 47 ms | 84 s |\n", + "| **Average:** | **237 s** | **48.5 ms** | **84.5 s** |\n", + "\n", + "\n", + "### Case 4 - Downloading – using earthaccess.download(results[0:10], local_path=\"/tmp/\")\n", + "\n", + "| Run | Wall time 10 granules | Wall time estimate for all granules | CPU time 10 granules |\n", + "| :- | :--------- | :- | :- |\n", + "| 1 | 178 s | 24 hr | 19 s |\n", + "| 2 | 186 s | 25 hr | 26 s |\n", + "| 3 | 160 s | 22 hr | 21 s |\n", + "| 4 | 172 s | 23 hr | 32 s |\n", + "| **Average:** | **174 s** | **23.5 hr** | **24.5 s** |" + ] + }, + { + "cell_type": "markdown", + "id": "9ab9d476-5186-4da8-8329-1270e6e925bd", + "metadata": {}, + "source": [ + "# Setup" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "id": "aa4a11af-f182-4021-ad6f-8b198d6bdbde", + "metadata": {}, + "outputs": [], + "source": [ + "import cartopy.crs as ccrs\n", + "import earthaccess\n", + "import matplotlib.pyplot as plt\n", + "import numpy as np\n", + "import xarray as xr\n", + "from dask.diagnostics import ProgressBar\n", + "from matplotlib import rcParams\n", + "\n", + "%config InlineBackend.figure_format = 'jpeg'\n", + "rcParams[\"figure.dpi\"] = (\n", + " 80 # Reduce figure resolution to keep the saved size of this notebook low.\n", + ")\n", + "\n", + "pbar = ProgressBar()\n", + "pbar.register() # Set the ProgressBar to indicate progress for the minutes-long open steps." + ] + }, + { + "cell_type": "markdown", + "id": "e44ec322-40a3-419d-bd5f-6e86bc5c3103", + "metadata": {}, + "source": [ + "#### Methods used for calculating and converting results' timings" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "8bff4406-e01c-4788-81a5-37293d5adcf8", + "metadata": {}, + "outputs": [], + "source": [ + "# Function for creating table at top of notebook.\n", + "def granule_time_in_seconds_to_year_of_granules_time_in_hours(\n", + " input_time: float | list[float],\n", + ") -> np.array:\n", + " return np.asarray(input_time) * 4867 / 60 / 4 / 60" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "590795f0-167c-4549-9f95-db441ce87e39", + "metadata": {}, + "outputs": [], + "source": [ + "# np.mean(granule_time_in_seconds_to_year_of_granules_time_in_hours([6.11, 5.16, 5.22, 5.08]))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "53ce8bc9-b02e-44ef-a151-b036f4020347", + "metadata": {}, + "outputs": [], + "source": [ + "# np.mean([6.11, 5.16, 5.22, 5.08])" + ] + }, + { + "cell_type": "markdown", + "id": "1c5f134f-f9d0-4427-9561-b5e7604748d9", + "metadata": {}, + "source": [ + "## Login using the Earthdata Login" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "198318d6-2850-450e-9df0-dd0895ee985d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.14.0\n" + ] + } + ], + "source": [ + "auth = earthaccess.login() # earthaccess.system.UAT)\n", + "\n", + "if not auth.authenticated:\n", + " auth.login(\n", + " strategy=\"interactive\", persist=True\n", + " ) # ask for credentials and persist them in a .netrc file\n", + "\n", + "print(earthaccess.__version__)" + ] + }, + { + "cell_type": "markdown", + "id": "e961e6f5-25fb-4770-94c8-2f6592225d0b", + "metadata": {}, + "source": [ + "# TEMPO $NO_2$ Level-3 Data Tests" + ] + }, + { + "cell_type": "markdown", + "id": "86a3e22a-3af7-4584-9517-7993a8bad9c0", + "metadata": {}, + "source": [ + "## Search for data granules" + ] + }, + { + "cell_type": "markdown", + "id": "4b70e220-893c-4451-bfaa-c01e14e0b577", + "metadata": {}, + "source": [ + "We search for TEMPO Nitrogen Dioxide ($NO_2$) data for a year-long period (note: times are in UTC) betwee January, 2024 and 2025." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "bd50079f-6b8b-424d-8a12-36416f4d69ff", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Number of results: 4990\n", + "CPU times: user 821 ms, sys: 118 ms, total: 939 ms\n", + "Wall time: 14.8 s\n" + ] + } + ], + "source": [ + "%%time\n", + "results = earthaccess.search_data(\n", + " short_name=\"TEMPO_NO2_L3\",\n", + " version=\"V03\",\n", + " temporal=(\"2024-01-11 12:00\", \"2025-01-11 12:00\"),\n", + ")\n", + "print(f\"Number of results: {len(results)}\")" + ] + }, + { + "cell_type": "markdown", + "id": "040b0a55-45e8-41c3-a36f-94b7e40f7866", + "metadata": {}, + "source": [ + "## Opening a Single Granule" + ] + }, + { + "cell_type": "markdown", + "id": "2175776e-cca3-4ddd-8798-e2d38d3d5174", + "metadata": {}, + "source": [ + "### Case 1" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "595a1f12-7833-4b31-bc8d-c433f9d4b3bc", + "metadata": {}, + "outputs": [], + "source": [ + "open_options = {\"access\": \"direct\", \"load\": True}" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "5c4b0ef4-627d-4ba5-a9b2-1a03a6c5d814", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 33.2 ms, sys: 221 μs, total: 33.4 ms\n", + "Wall time: 219 ms\n" + ] + } + ], + "source": [ + "%%time\n", + "first_result_root = earthaccess.open_virtual_dataset(results[0], **open_options)" + ] + }, + { + "cell_type": "markdown", + "id": "ceeb1e9b-b0ba-46ea-a14e-0dd45f7419cd", + "metadata": {}, + "source": [ + "CPU times: user 46.5 ms, sys: 0 ns, total: 46.5 ms\n", + "Wall time: 235 ms\n", + "\n", + "CPU times: user 53.3 ms, sys: 1.58 ms, total: 54.9 ms\n", + "Wall time: 881 ms\n", + "\n", + "CPU times: user 56 ms, sys: 0 ns, total: 56 ms\n", + "Wall time: 281 ms\n", + "\n", + "CPU times: user 53.7 ms, sys: 0 ns, total: 53.7 ms\n", + "Wall time: 249 ms" + ] + }, + { + "cell_type": "markdown", + "id": "d84190fc-99ac-4cab-9547-f26303694737", + "metadata": {}, + "source": [ + "### Case 2" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "91238e2c-433d-494c-a45c-20d3b2a65569", + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "0ede255e57ef428da0fc242ce3ac74d8", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "QUEUEING TASKS | : 0%| | 0/1 [00:00