A collection of useful copy-pasteable standalone PySpark code snippets with corresponding output explaining behavior of commonly used functions.
The source code of snippets is rendered as HTML and hosted at http://isabekov.github.io/pyspark-cookbook/.
Emacs with org-mode
is used as a development environment. Compared to Jupyter notebooks, the source code is easier to keep in a version control system since it is just a plain text.
Software | Version | Comment |
---|---|---|
Emacs | 29.2 | main development environment |
Python | 3.11.6 | works with pyspark >= 3.4.0, (see discussion) |
python-pyspark | 3.4.0 | Python API for Spark (large-scale data processing library) |
python-py4j | 0.10.9.7 | enables Python programs to dynamically access Java (dependency of PySpark) |
python-pandas | 2.0.2 | Python data analysis library |
python-pyarrow | 15.0.0 | bindings to Apache Arrow (dependency of PySpark) |
python-tabulate | 0.9.0 | needed to convert dataframes into org-table format |
Java Runtime Environment | 17.0.10 | newer version do not work with PySpark 3.4.0 |
PYNT (Emacs package) | 20180710.726 | interactive kernel for Python in Emacs, read installation instructions at (see repository) |
ein (Emacs IPython Notebook) | ab10680a | a Jupyter client (newer versions >2018-10-31 do not work!) |
emacs-epc (Emacs RPC stack) | 20140610.534 | an asynchronous RPC stack for Emacs |
org-export | 64ac299 | command line tool needed for HTML export, requires Emacs (see repository) |
GNU readline | 8.2.13 | library needed for correct invocation of Python in Emacs on MacOS |
Depending on the operating system, install Python
and packages py4j, pyspark, pandas, pyarrow, tabulate
using corresponding package manager and pip
.
Install GNU readline:
$ pip install gnureadline
Replace libedit~ with readline
:
python -m override_readline
Details can be found here.
Install the codebook module with pip
package manager:
$ pip install git+https://github.com/ebanner/pynt
On ArchLinux, pip is not allowed to install by default, so pass an extra argument:
$ pip install --break-system-packages git+https://github.com/ebanner/pynt
Open Emacs. Install pynt
in Emacs through MELPA.
M-x package-install RET pynt
where RET is just the “Enter” key.
Melpa and Elpa repositories should be already added to Emacs’s package management configuration. To see available packages:
M-x list-packages RET
Search for package epc
using Ctrl-s
and click on Install
.
Repeat for package ein
.
PySpark Cookbook’s recipes were tested in Emacs IDE using Java Runtime environment: 17.0.10.
. Set it as default:
$ export JAVA_HOME=/usr/lib/jvm/java-17-openjdk
$ sudo ln -s /usr/lib/jvm/java-17-openjdk /usr/lib/jvm/default
Newer versions of Java are not compatible with PySpark v3.4.0.
$ git clone https://github.com/nhoffman/org-export.git
$ cd org-export
$ sudo install -D -m 755 org-export* /usr/local/bin
To produce HTML page with PySpark code snippets, run:
$ make index.html
To render examples of converting PySpark tables displayed in pretty
format to orgtbl
format (see tabulate package describing the formats), run:
$ make test_ps2org.html
Navigate to any snippet outside “Functions”~ chapter (which is meant to provide only service functions for post-processing the output). Make sure that the cursor is inside a Python code block:
#+begin_src python :post pretty2orgtbl(data=*this*) ... #+end_src
Press C-c C-c
(i.e. Ctrl-c
twice). Emacs will execute the source code block inside a Python session and display the output.
Execution of code blocks via Jupyter kernel is possible only with installed prerequisite packages of the specified versions.
Package ein
and epc
must have exactly the versions defined in the table.
To fix the following error during evaluation of code blocks:
ModuleNotFoundError: No module named 'notebook.services'
Find the installation of PyNT:
$ grep -i kernelmanager /usr/lib/python3.11/site-packages/codebook/manager.py
from jupyter_server.services.kernels.kernelmanager import MappingKernelManager
which is defined in the source code and change that line in manager.py
to
from jupyter_server.services.kernels.kernelmanager import MappingKernelManager
If you encounter error
:nowait is incompatible with :server
then search for installation of Emacs EPC package and grep for string :nowait
:
$ grep "nowait" ~/.emacs.d/elpa/epc-20140610.534/epcs.el
:family 'ipv4 :server t :nowait t
It is defined in epcs.el. The solution is to change this line to:
:family 'ipv4 :server t :nowait nil
If you encounter error
Server may raise an error. Use "M-x epc:pop-to-last-server-process-buffer RET" to see the full traceback:
/usr/lib/python3.11/site-packages/codebook/manager.py:41: SyntaxWarning: invalid escape sequence '\d'
p = '.*kernel-(?P<kid>\d+).json'
append letter “r” (raw string) to the string definition in that line:
p = r'.*kernel-(?P<kid>\d+).json'