Skip to content

isabekov/pyspark-cookbook

Repository files navigation

PySpark Cookbook

A collection of useful copy-pasteable standalone PySpark code snippets with corresponding output explaining behavior of commonly used functions.

The source code of snippets is rendered as HTML and hosted at http://isabekov.github.io/pyspark-cookbook/.

Development Environment

Emacs with org-mode is used as a development environment. Compared to Jupyter notebooks, the source code is easier to keep in a version control system since it is just a plain text.

./screenshots/example.png

Development dependencies

SoftwareVersionComment
Emacs29.2main development environment
Python3.11.6works with pyspark >= 3.4.0, (see discussion)
python-pyspark3.4.0Python API for Spark (large-scale data processing library)
python-py4j0.10.9.7enables Python programs to dynamically access Java (dependency of PySpark)
python-pandas2.0.2Python data analysis library
python-pyarrow15.0.0bindings to Apache Arrow (dependency of PySpark)
python-tabulate0.9.0needed to convert dataframes into org-table format
Java Runtime Environment17.0.10newer version do not work with PySpark 3.4.0
PYNT (Emacs package)20180710.726interactive kernel for Python in Emacs, read installation instructions at (see repository)
ein (Emacs IPython Notebook)ab10680aa Jupyter client (newer versions >2018-10-31 do not work!)
emacs-epc (Emacs RPC stack)20140610.534an asynchronous RPC stack for Emacs
org-export64ac299command line tool needed for HTML export, requires Emacs (see repository)
GNU readline8.2.13library needed for correct invocation of Python in Emacs on MacOS

Install Python and Python Packages

Depending on the operating system, install Python and packages py4j, pyspark, pandas, pyarrow, tabulate using corresponding package manager and pip.

MacOS-specific Installation

Install GNU readline:

$ pip install gnureadline

Replace libedit~ with readline:

python -m override_readline

Details can be found here.

PYNT Installation

Install the codebook module with pip package manager:

$ pip install  git+https://github.com/ebanner/pynt

On ArchLinux, pip is not allowed to install by default, so pass an extra argument:

$ pip install --break-system-packages  git+https://github.com/ebanner/pynt

Open Emacs. Install pynt in Emacs through MELPA.

M-x package-install RET pynt

where RET is just the “Enter” key.

Install Emacs Packages

Melpa and Elpa repositories should be already added to Emacs’s package management configuration. To see available packages:

M-x list-packages RET

Search for package epc using Ctrl-s and click on Install. Repeat for package ein.

Java Runtime Installation

PySpark Cookbook’s recipes were tested in Emacs IDE using Java Runtime environment: 17.0.10.. Set it as default:

$ export JAVA_HOME=/usr/lib/jvm/java-17-openjdk
$ sudo ln -s /usr/lib/jvm/java-17-openjdk /usr/lib/jvm/default

Newer versions of Java are not compatible with PySpark v3.4.0.

Install org-export

$ git clone https://github.com/nhoffman/org-export.git
$ cd org-export
$ sudo install -D -m 755 org-export* /usr/local/bin

Export to HTML

To produce HTML page with PySpark code snippets, run:

$ make index.html

To render examples of converting PySpark tables displayed in pretty format to orgtbl format (see tabulate package describing the formats), run:

$ make test_ps2org.html

Execution of Code Blocks in org-mode

Navigate to any snippet outside “Functions”~ chapter (which is meant to provide only service functions for post-processing the output). Make sure that the cursor is inside a Python code block:

#+begin_src python :post pretty2orgtbl(data=*this*)
  ...
#+end_src

Press C-c C-c (i.e. Ctrl-c twice). Emacs will execute the source code block inside a Python session and display the output.

Troubleshooting

Execution of code blocks via Jupyter kernel is possible only with installed prerequisite packages of the specified versions. Package ein and epc must have exactly the versions defined in the table.

To fix the following error during evaluation of code blocks:

ModuleNotFoundError: No module named 'notebook.services'

Find the installation of PyNT:

$ grep -i kernelmanager /usr/lib/python3.11/site-packages/codebook/manager.py
from jupyter_server.services.kernels.kernelmanager import MappingKernelManager

which is defined in the source code and change that line in manager.py to

from jupyter_server.services.kernels.kernelmanager import MappingKernelManager

If you encounter error

:nowait is incompatible with :server

then search for installation of Emacs EPC package and grep for string :nowait:

$ grep "nowait" ~/.emacs.d/elpa/epc-20140610.534/epcs.el
       :family 'ipv4 :server t :nowait t

It is defined in epcs.el. The solution is to change this line to:

:family 'ipv4 :server t :nowait nil

If you encounter error

Server may raise an error. Use "M-x epc:pop-to-last-server-process-buffer RET" to see the full traceback:

/usr/lib/python3.11/site-packages/codebook/manager.py:41: SyntaxWarning: invalid escape sequence '\d'
p = '.*kernel-(?P<kid>\d+).json'

append letter “r” (raw string) to the string definition in that line:

p = r'.*kernel-(?P<kid>\d+).json'

About

A collection of useful copy-pasteable PySpark code snippets with output

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published