webhdfsmagic is a Python package that provides IPython magic commands to interact with HDFS via WebHDFS/Knox Gateway directly from your Jupyter notebooks.
Simplify your HDFS interactions in Jupyter:
Before (with PyWebHdfsClient):
from pywebhdfs.webhdfs import PyWebHdfsClient
hdfs = PyWebHdfsClient(host='...', port='...', user_name='...', ...)
data = hdfs.read_file('/data/file.csv')
df = pd.read_csv(BytesIO(data))Now (with webhdfsmagic):
%hdfs get /data/file.csv .
df = pd.read_csv('file.csv')93% less code! β¨
Complete workflow demo: mkdir β put β ls β cat β get β chmod β rm
| Command | Description |
|---|---|
%hdfs ls [path] |
List files and directories (returns pandas DataFrame) |
%hdfs du <path> [-s] [-h] |
Disk usage β real recursive sizes via GETCONTENTSUMMARY -s : summary of path itself Β Β·Β -h : human-readable (KB/MB/GB) |
%hdfs stat <path> |
File/directory metadata β single-row DataFrame (GETFILESTATUS) |
%hdfs mv <src> <dst> |
Rename or move a file/directory β server-side (RENAME, no data copy) |
%hdfs mkdir <path> |
Create directory (parents created automatically) |
%hdfs put <local> <hdfs> |
Upload one or more files (supports wildcards *.csv, -t, --threads <N> for parallel uploads) |
%hdfs get <hdfs> <local> |
Download files (supports wildcards and ~ for home directory, -t, --threads <N> for parallel downloads) |
%hdfs cat <file> [-n lines] [--format type] [--raw] |
Display file content with smart formatting for CSV/Parquet |
%hdfs rm [-r] <path> |
Delete files/directories (-r for recursive, supports wildcards) |
%hdfs chmod [-R] <mode> <path> |
Change permissions (-R for recursive) |
%hdfs chown [-R] <user:group> <path> |
Change owner (-R for recursive, requires superuser) |
pip install webhdfsmagicInstall from source:
git clone https://github.com/ab2dridi/webhdfsmagic.git
cd webhdfsmagic
pip install -e .
# Enable autoload (creates startup script)
jupyter-webhdfsmagicAfter installation, enable autoload to have webhdfsmagic load automatically in all Jupyter sessions:
jupyter-webhdfsmagicThis creates ~/.ipython/profile_default/startup/00-webhdfsmagic.py so the extension loads automatically.
Alternative: Load manually in each notebook:
%load_ext webhdfsmagicCreate ~/.webhdfsmagic/config.json:
{
"knox_url": "https://hostname:port/gateway/default",
"webhdfs_api": "/webhdfs/v1",
"username": "your_username",
"password": "your_password",
"verify_ssl": false
}SSL Options:
"verify_ssl": falseβ Disable SSL verification (development only)"verify_ssl": trueβ Use system certificates"verify_ssl": "/path/to/cert.pem"β Use custom certificate (supports~)
Configuration Examples:
See examples/config/ for complete configurations (with/without SSL, custom certificate, etc.)
Sparkmagic Fallback:
If ~/.webhdfsmagic/config.json doesn't exist, the package tries ~/.sparkmagic/config.json and extracts configuration from kernel_python_credentials.url.
All operations are automatically logged to ~/.webhdfsmagic/logs/webhdfsmagic.log for debugging and auditing purposes.
Log Features:
- β Automatic rotation (10MB per file, keeps 5 backups)
- β Detailed HTTP request/response logging
- β Operation tracing with timestamps
- β Error tracking with full stack traces
- β Password masking for security
- β File-level DEBUG logging
- β Console-level WARNING/ERROR logging
View Recent Logs:
# View last 50 lines
tail -50 ~/.webhdfsmagic/logs/webhdfsmagic.log
# Follow logs in real-time
tail -f ~/.webhdfsmagic/logs/webhdfsmagic.log
# Search for errors
grep "ERROR" ~/.webhdfsmagic/logs/webhdfsmagic.log
# View specific operation
grep "hdfs put" ~/.webhdfsmagic/logs/webhdfsmagic.logLog Format:
2025-12-08 10:30:15 - webhdfsmagic - INFO - [magics.py:145] - >>> Starting operation: hdfs ls
2025-12-08 10:30:15 - webhdfsmagic - DEBUG - [client.py:85] - HTTP Request: GET http://...
2025-12-08 10:30:15 - webhdfsmagic - DEBUG - [client.py:105] - HTTP Response: 200 from http://...
2025-12-08 10:30:15 - webhdfsmagic - INFO - [magics.py:180] - <<< Operation completed: hdfs ls - SUCCESS
# The extension is already loaded automatically!
%hdfs help
# List files
%hdfs ls /data
# Disk usage β list immediate children with their real recursive sizes
%hdfs du /data/users
# Disk usage β summary of the path itself (single call, no children iteration)
%hdfs du -s /data/users
# Human-readable sizes (KB / MB / GB)
%hdfs du -h /data/users
# Combine both: summary + human-readable
%hdfs du -sh /data/users
# File metadata (name, type, size, owner, permissions, modifiedβ¦)
%hdfs stat /data/events.parquet
# Metadata for a directory
%hdfs stat /data/users
# Rename a file
%hdfs mv /data/old_name.csv /data/new_name.csv
# Move a directory to another location
%hdfs mv /data/tmp /data/archive/tmp
# Create a directory
%hdfs mkdir /user/hdfs/output
# Upload multiple CSV files using wildcards
%hdfs put ~/data/*.csv /user/hdfs/input/
# Download a file to home directory
%hdfs get /user/hdfs/results/output.csv ~/downloads/
# Download multiple files with wildcards
%hdfs get /user/hdfs/results/*.csv ./local_results/
# ===== SMART CAT (File Preview) =====
# Display first 50 lines (default grid table format)
%hdfs cat /user/hdfs/data/file.csv -n 50
# Smart CSV formatting with automatic table display
%hdfs cat /user/hdfs/data/sales.csv
# Display Parquet file as table
%hdfs cat /user/hdfs/data/records.parquet -n 20
# Pandas format (classic DataFrame representation)
%hdfs cat /user/hdfs/data/data.csv --format pandas
# Polars format (shows schema + explicit types, 3.7x faster for Parquet!)
%hdfs cat /user/hdfs/data/records.parquet --format polars
> **Warning:** Using `%hdfs cat /user/hdfs/data/records.parquet -n -1 --raw` will try to load the entire file into memory. For very large Parquet files, this can consume a lot of RAM and may crash your notebook. Use this command with caution and confirm you really want to load the full file before running it.
> **Tip:** For large Parquet files, it is highly recommended to use `%hdfs cat file.parquet --format polars` instead of `%hdfs cat file.parquet --raw` for much better performance and readability.
# Raw text display (unformatted original content)
%hdfs cat /user/hdfs/data/file.csv --raw
# ===== File Management =====
# Delete files with wildcards
%hdfs rm /user/hdfs/temp/*.log
# Delete a directory recursively
%hdfs rm -r /user/hdfs/temp
# Change permissions recursively
%hdfs chmod -R 755 /user/hdfs/data
# Change owner recursively (requires superuser privileges)
%hdfs chown -R hdfs:hadoop /user/hdfs/dataIntegration with pandas:
# Download and read directly
%hdfs get /data/sales.csv .
df = pd.read_csv('sales.csv')
df.head()Upload, download, and delete multiple files using shell-style wildcards:
# Upload all CSV files
%hdfs put data/*.csv /hdfs/input/
# Download specific pattern
%hdfs get /hdfs/output/result_*.csv ./downloads/
# Delete log files
%hdfs rm /hdfs/temp/*.logAutomatically format structured files as readable tables:
# CSV files are automatically detected and formatted
%hdfs cat /data/sales.csv
# ββββββββββββββ¬ββββββββββ¬βββββββββ
# β date β product β amount β
# ββββββββββββββΌββββββββββΌβββββββββ€
# β 2025-12-08 β laptop β 1200 β
# β 2025-12-09 β phone β 800 β
# ββββββββββββββ΄ββββββββββ΄βββββββββ
# Parquet files work seamlessly
%hdfs cat /data/records.parquet -n 100
# TSV and other delimiters are auto-detected
%hdfs cat /data/data.tsv # Detects tab delimiter
# Force specific format
%hdfs cat /data/file.csv --format pandas # Pandas DataFrame (classic)
%hdfs cat /data/file.csv --format polars # Polars with schema and types
%hdfs cat /data/file.csv --raw # Raw text, no formatting
# Supported formats:
# - CSV (comma, tab, semicolon, pipe - auto-detected)
# - Parquet (uses Polars for 3.7x faster processing)
# - TSV (tab-separated values)π Format Options Explained:
- Default (grid): Beautiful ASCII table, perfect for reports
--format pandas: Classic pandas display, familiar to data scientists--format polars: Shows schema with explicit types (str, i64, f64, bool) - ideal for data validation--raw: Original file content without any parsing
π Performance: Parquet files are processed using Polars, providing ultra-fast reads and minimal memory usage (3.7x faster than PyArrow+Pandas).
Memory Protection: By default, the cat command limits downloads to 50 MB to prevent memory saturation. This protection applies when using -n <lines> option. To read entire large files, use -n -1:
# Safe: Limited to 50 MB download
%hdfs cat /huge_file.csv -n 100
# Full read: No memory limit (use with caution on large files)
%hdfs cat /small_file.csv -n -1%hdfs get for better performance.
LISTSTATUS (used by %hdfs ls) always reports size=0 for directories. %hdfs du fixes this by calling GETCONTENTSUMMARY per entry, which returns the actual recursive size.
# Default: iterate over immediate children, show real recursive size for each
%hdfs du /data/users
# Returns a DataFrame:
# name type size space_consumed file_count dir_count error
# alice DIR 1073741824 3221225472 4820 12 None
# bob DIR 536870912 1610612736 2100 5 None
# reports FILE 10485760 31457280 1 0 None
# Summary: single row for the path itself
%hdfs du -s /data/users
# name type size space_consumed file_count dir_count error
# /data/users DIR 1610612736 4831838208 6921 17 None
# Human-readable sizes
%hdfs du -h /data/users
# name type size space_consumed file_count dir_count error
# alice DIR 1.0 GB 3.0 GB 4820 12 None
# bob DIR 512.0 MB 1.5 GB 2100 5 None
# Combine both
%hdfs du -sh /data/usersGraceful permission handling: directories returning HTTP 401/403 are included in the DataFrame with size=None and an error message β the command never crashes mid-iteration:
%hdfs du /sensitive/data
# name type size space_consumed file_count dir_count error
# public DIR ... ... ... ... None
# private DIR None None None None permission denied (HTTP 403)
# readonly DIR ... ... ... ... NoneStarting from version 0.0.4, webhdfsmagic supports parallel file transfers using the --threads (or -t) option for both put and get commands.
This allows you to upload or download multiple files simultaneously, greatly speeding up operations on large datasets or many files.
Key features:
- Multi-threaded transfers for PUT and GET
- Syntax:
%hdfs put --threads N <local_files> <hdfs_dir> - Syntax:
%hdfs get --threads N <hdfs_files> <local_dir> - N = number of threads (e.g. 4, 8, 16)
Examples:
# Parallel upload: PUT multiple files to HDFS using 4 threads
%hdfs put --threads 4 *.csv /demo/data/
# Parallel download: GET multiple files from HDFS using 4 threads
%hdfs get --threads 4 /demo/data/* ./downloads/
# You can also use the short option -t
%hdfs put -t 8 *.tsv /demo/data/- The
--threads/-toption is available for bothputandgetcommands. - You can specify any number of threads (e.g. 2, 4, 8, 16) depending on your system and network.
- Parallel transfers are especially useful for large datasets or many small files.
- Error handling is robust: if a file fails to transfer, youβll get a clear error message for that file.
- The command syntax is identical to single-threaded, just add
--threads Nor-t N.
See the notebook demo for a full example.
Get detailed help directly in your notebook:
# Show all available commands with descriptions
%hdfs helpThis displays a comprehensive interactive help with:
- All available commands (ls, du, stat, mv, mkdir, put, get, cat, rm, chmod, chown)
- Options and flags for each command
- Format descriptions for the
catcommand - Auto-detection features explanation
Summary of available commands:
| Command | Description |
|---|---|
%hdfs help |
Display this help |
%hdfs setconfig {...} |
Set configuration (JSON format) |
%hdfs ls [path] |
List files and directories |
%hdfs du <path> [-s] [-h] |
Disk usage (real recursive sizes) -s : summary of path itself Β Β·Β -h : human-readable sizes |
%hdfs stat <path> |
File/directory metadata β one API call (GETFILESTATUS) Columns: name, type, size, owner, group, permissions, block_size, modified, replication |
%hdfs mv <src> <dst> |
Rename or move a file/directory (RENAME) β server-side, no data copy |
%hdfs mkdir <path> |
Create directory |
%hdfs rm <path> [-r] |
Delete file/directory -r : recursive deletion |
%hdfs put <local> <hdfs> |
Upload files (supports wildcards) -t, --threads <N> : use N parallel threads for multi-file uploads |
%hdfs get <hdfs> <local> |
Download files (supports wildcards) -t, --threads <N> : use N parallel threads for multi-file downloads |
%hdfs cat <file> [options] |
Smart file preview (CSV/TSV/Parquet) -n <lines> : limit to N rows (default: 100) --format <type> : force format (csv, parquet, pandas, polars, raw) --raw : display raw content without formatting |
%hdfs chmod [-R] <mode> <path> |
Change permissions (e.g., 644, 755) -R : recursive |
%hdfs chown [-R] <user:group> <path> |
Change owner and group -R : recursive |
Examples:
%hdfs du /data/usersβ List children with their real recursive sizes%hdfs du -sh /data/usersβ Total size of the path, human-readable%hdfs stat /data/events.parquetβ Metadata for a file (size, owner, permissionsβ¦)%hdfs stat /data/usersβ Metadata for a directory%hdfs mv /data/old.csv /data/new.csvβ Rename a file%hdfs mv /data/tmp /data/archive/tmpβ Move a directory%hdfs cat data.csv -n 10β Preview first 10 rows%hdfs cat data.parquet --format pandasβ Display in pandas format (classic)%hdfs cat data.parquet --format polarsβ Display with schema and types%hdfs put *.csv /data/β Upload all CSV files%hdfs put -t 4 ./data/*.csv /hdfs/input/β Upload files with 4 parallel threads%hdfs get -t 8 /hdfs/output/*.parquet ./results/β Download files with 8 parallel threads%hdfs chmod -R 755 /mydirβ Set permissions recursively
The help command is always available and shows the most up-to-date documentation for your installed version.
Apply permission changes to entire directory trees:
# Recursive chmod
%hdfs chmod -R 755 /hdfs/project/
# Recursive chown (requires superuser)
%hdfs chown -R hdfs:hadoop /hdfs/project/Use ~ as a shortcut for your home directory:
# Download to home directory
%hdfs get /hdfs/file.csv ~/downloads/
# Works in subdirectories too
%hdfs get /hdfs/data/*.csv ~/projects/analysis/- examples/demo.ipynb - Full demo with real HDFS cluster (Docker)
- examples/examples.ipynb - Examples with mocked tests (no cluster needed)
- examples/config/ - Configuration file examples
- ROADMAP.md - Upcoming features
Unit tests (no HDFS cluster required):
pytest tests/ -vTest with Docker HDFS cluster:
# Start the demo environment
cd demo
docker-compose up -d
# Wait 30 seconds for initialization
sleep 30
# Configure webhdfsmagic (if not already done)
mkdir -p ~/.webhdfsmagic
cat > ~/.webhdfsmagic/config.json << 'EOF'
{
"knox_url": "http://localhost:8080/gateway/default",
"webhdfs_api": "/webhdfs/v1",
"username": "testuser",
"password": "testpass",
"verify_ssl": false
}
EOF
# Test with demo notebook
cd ..
jupyter notebook examples/demo.ipynbSee demo/README.md for complete Docker environment documentation.
All operations are logged to ~/.webhdfsmagic/logs/webhdfsmagic.log:
# View recent activity
tail -50 ~/.webhdfsmagic/logs/webhdfsmagic.log
# Check for errors
grep -i "error" ~/.webhdfsmagic/logs/webhdfsmagic.log
# View specific command execution
grep "hdfs put" ~/.webhdfsmagic/logs/webhdfsmagic.log -A 5Connection Errors:
- Check Knox gateway URL in
~/.webhdfsmagic/config.json - Verify SSL settings (
verify_ssl: falsefor testing) - Check logs for HTTP error details
Authentication Errors:
- Verify username/password in config
- Check if credentials have expired
- Review authentication errors in logs
File Transfer Issues:
- Check local file paths exist
- Verify HDFS paths are absolute (start with
/) - Review detailed HTTP request/response in logs
- Check disk space on both local and HDFS
Permission Errors:
- Verify HDFS user permissions
- Check file/directory ownership in HDFS
- Review operation logs for specific error messages
Contributions are welcome! To contribute:
- Fork the project
- Create a branch (
git checkout -b feature/my-feature) - Commit your changes (
git commit -m 'feat: add new feature') - Push to the branch (
git push origin feature/my-feature) - Open a Pull Request
Testing and code quality:
# Run tests
pytest tests/ -v
# Check code style
ruff check .
ruff format .This project is licensed under the MIT License. See LICENSE for details.
- PyPI: https://pypi.org/project/webhdfsmagic/
- GitHub: https://github.com/ab2dridi/webhdfsmagic
- Issues: https://github.com/ab2dridi/webhdfsmagic/issues
For questions or suggestions, open an issue on GitHub.
