Skip to content
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 24 additions & 16 deletions core/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,9 @@ As you might know by now that Python is a programming language. To make your job
Most geoscience data analysis involves working with numerical arrays.
The default library for dealing with numerical arrays in Python is [NumPy](http://www.numpy.org/).
It has some built in functions for calculating very simple statistics
(e.g. maximum, mean, standard deviation),
(e.g., maximum, mean, standard deviation),
but for more complex analysis
(e.g. interpolation, integration, linear algebra)
(e.g., interpolation, integration, linear algebra)
the [SciPy](https://scipy.org) library is the default.
If you’re dealing with particularly large arrays,
[Dask](https://dask.org/) works with the existing Python ecosystem
Expand All @@ -31,8 +31,8 @@ is useful for dealing with non-standard calendars.
When it comes to data visualization,
the default library is [Matplotlib](https://matplotlib.org/).
As you can see at the [Matplotlib gallery](https://matplotlib.org/stable/gallery/index.html),
this library is great for any simple (e.g. bar charts, contour plots, line graphs),
static (e.g. .png, .eps, .pdf) plots.
this library is great for any simple (e.g., bar charts, contour plots, line graphs),
static (e.g., .png, .eps, .pdf) plots.
The [Cartopy](https://scitools.org.uk/cartopy/docs/latest/) library
provides additional plotting functionality for common geographic map projections.

Expand All @@ -49,30 +49,38 @@ These high-levels libraries aren’t as flexible
but they can do common tasks with far less effort.

The most popular high-level data science library is undoubtedly [Pandas](http://pandas.pydata.org/).
The key advance offered by Pandas is the concept of labeled arrays.
The key advance offered by Pandas is the concept of the DataFrame, a two-dimensional labeled array:

![Pandas DataFrame schematic](https://github.com/pandas-dev/pandas/blob/c8e8651a1b9d45eb45a3f6cf7fad4c0152cc84bb/doc/source/_static/schemas/01_table_dataframe.svg)

Rather than referring to the individual elements of a data array using a numeric index
(as is required with NumPy),
the actual row and column headings can be used.
That means information from the cardiac ward on 3 July 2005
could be obtained from a medical dataset by asking for `data['cardiac'].loc['2005-07-03']`,
rather than having to remember the numeric index corresponding to that ward and date.
This labeled array feature,
combined with a bunch of other features that streamline common statistical and plotting tasks
combined with a bunch of other features that streamline common statistical (e.g., computing averages) and plotting tasks
traditionally performed with SciPy, datetime and Matplotlib,
greatly simplifies the code development process (read: less lines of code).

One of the limitations of Pandas
is that it’s only able to handle one- or two-dimensional (i.e. tabular) data arrays.
The [Xarray](http://xarray.pydata.org/) library was therefore created
to extend the labelled array concept to x-dimensional arrays.
to extend the labelled array concept to N-dimensional arrays:

![Xarray Dataset schematic](https://github.com/pydata/xarray/blob/5f670a74392e3b625dba283c75c6dc2a43be808b/doc/_static/dataset-diagram.png)

Not all of the Pandas functionality is available
(which is a trade-off associated with being able to handle multi-dimensional arrays),
but the ability to refer to array elements by their actual latitude (e.g. 20 South),
longitude (e.g. 50 East), height (e.g. 500 hPa) and time (e.g. 2015-04-27), for example,
but the ability to refer to array elements by their actual latitude (e.g., 20 South),
longitude (e.g., 50 East), height (e.g., 500 hPa) and time (e.g., 2015-04-27), for example,
makes the Xarray data array far easier to deal with than the NumPy array.
Like Pandas, Xarray has many built-in functions to perform common tasks,
such as rolling averages, computing a max/min along a given dimension, and resampling data.
As an added bonus,
Xarray also has built in functionality for reading/writing specific geoscience file formats
(e.g netCDF, GRIB)
(e.g., netCDF, GRIB)
and incorporates Dask under the hood to make dealing with large arrays easier.

You will occasionally find yourself needing to use a core library directly
Expand All @@ -93,15 +101,15 @@ So far we’ve considered libraries that do general,
broad-scale tasks like data input/output, common statistics, visualisation, etc.
Given their large user base,
these libraries are usually written and supported by large companies/institutions
(e.g. the MetOffice supports Cartopy)
or the wider PyData community (e.g. NumPy, Pandas, Xarray).
(e.g., the [Met Office](wiki:Met_Office) supports Cartopy)
or the wider PyData community (e.g., NumPy, Pandas, Xarray).
Within each sub-discipline of the geosciences,
individuals and research groups take these general libraries
and apply them to their very specific data analysis tasks.
Increasingly, these individuals and groups
are formally packaging and releasing their code for use within their community.
For instance, Andrew Dawson (an atmospheric scientist at Oxford)
does a lot of EOF analysis and manipulation of wind data,
does a lot of empirical orthogonal function (EOF) analysis and manipulation of wind data,
so he has released his [eofs](https://ajdawson.github.io/eofs/)
and [windspharm](https://ajdawson.github.io/windspharm/) libraries
(which are able to handle data arrays from NumPy or Xarray).
Expand All @@ -110,15 +118,15 @@ have released their Python ARM Radar Toolkit ([Py-ART](http://arm-doe.github.io/
for analysing weather radar data.

A great place to start learning about use-cases for domain-specific libraries across the geosciences is the [Pythia Cookbook Gallery](https://cookbooks.projectpythia.org). Also check out the [Pythia Resource Gallery](https://projectpythia.org/resource-gallery) and try filtering by domain. The [Python for Atmosphere and Ocean Science (PyAOS) package index](https://pyaos.github.io/packages/)
attempt to keep track of the domain-specific libraries in these subfiels.
attempts to keep track of the domain-specific libraries in these subfields.


## Tutorials

- [NumPy](numpy.md): Core package for array computing, the workhorse of the Scientific Python stack
- [Matplotlib](matplotlib.md): Basic plotting
- [Matplotlib](matplotlib.md): Basic plotting, including line, scatter, and contour plots
- [Cartopy](cartopy.md): Plotting on map projections
- [Datetime](datetime.md): Dealing with time and calendar data
- [Pandas](pandas.md): Working with labeled tabular data
- [Data formats](data-formats.md): Working with common geoscience data formats
- [Data formats](data-formats.md): Working with netCDF data, a common geoscience data format
- [Xarray](xarray.md): Working with gridded and labeled N-dimensional data
Loading