Skip to content

ab2dridi/webhdfsmagic

Repository files navigation

Version Python License Tests Coverage Code style Status Maintained

webhdfsmagic

webhdfsmagic is a Python package that provides IPython magic commands to interact with HDFS via WebHDFS/Knox Gateway directly from your Jupyter notebooks.

πŸš€ Why webhdfsmagic?

Simplify your HDFS interactions in Jupyter:

Before (with PyWebHdfsClient):

from pywebhdfs.webhdfs import PyWebHdfsClient
hdfs = PyWebHdfsClient(host='...', port='...', user_name='...', ...)
data = hdfs.read_file('/data/file.csv')
df = pd.read_csv(BytesIO(data))

Now (with webhdfsmagic):

%hdfs get /data/file.csv .
df = pd.read_csv('file.csv')

93% less code! ✨

🎬 See it in Action

webhdfsmagic demo

Complete workflow demo: mkdir β†’ put β†’ ls β†’ cat β†’ get β†’ chmod β†’ rm

✨ Features

Command Description
%hdfs ls [path] List files and directories (returns pandas DataFrame)
%hdfs du <path> [-s] [-h] Disk usage β€” real recursive sizes via GETCONTENTSUMMARY
-s : summary of path itself Β Β·Β  -h : human-readable (KB/MB/GB)
%hdfs stat <path> File/directory metadata β€” single-row DataFrame (GETFILESTATUS)
%hdfs mv <src> <dst> Rename or move a file/directory β€” server-side (RENAME, no data copy)
%hdfs mkdir <path> Create directory (parents created automatically)
%hdfs put <local> <hdfs> Upload one or more files (supports wildcards *.csv,
-t, --threads <N> for parallel uploads)
%hdfs get <hdfs> <local> Download files (supports wildcards and ~ for home directory,
-t, --threads <N> for parallel downloads)
%hdfs cat <file> [-n lines] [--format type] [--raw] Display file content with smart formatting for CSV/Parquet
%hdfs rm [-r] <path> Delete files/directories (-r for recursive, supports wildcards)
%hdfs chmod [-R] <mode> <path> Change permissions (-R for recursive)
%hdfs chown [-R] <user:group> <path> Change owner (-R for recursive, requires superuser)

πŸ“¦ Installation

pip install webhdfsmagic

Install from source:

git clone https://github.com/ab2dridi/webhdfsmagic.git
cd webhdfsmagic
pip install -e .

# Enable autoload (creates startup script)
jupyter-webhdfsmagic

πŸ”§ Configuration

Automatic Loading

After installation, enable autoload to have webhdfsmagic load automatically in all Jupyter sessions:

jupyter-webhdfsmagic

This creates ~/.ipython/profile_default/startup/00-webhdfsmagic.py so the extension loads automatically.

Alternative: Load manually in each notebook:

%load_ext webhdfsmagic

Configuration File

Create ~/.webhdfsmagic/config.json:

{
  "knox_url": "https://hostname:port/gateway/default",
  "webhdfs_api": "/webhdfs/v1",
  "username": "your_username",
  "password": "your_password",
  "verify_ssl": false
}

SSL Options:

  • "verify_ssl": false β†’ Disable SSL verification (development only)
  • "verify_ssl": true β†’ Use system certificates
  • "verify_ssl": "/path/to/cert.pem" β†’ Use custom certificate (supports ~)

Configuration Examples:
See examples/config/ for complete configurations (with/without SSL, custom certificate, etc.)

Sparkmagic Fallback:
If ~/.webhdfsmagic/config.json doesn't exist, the package tries ~/.sparkmagic/config.json and extracts configuration from kernel_python_credentials.url.

Logging & Debugging

All operations are automatically logged to ~/.webhdfsmagic/logs/webhdfsmagic.log for debugging and auditing purposes.

Log Features:

  • βœ… Automatic rotation (10MB per file, keeps 5 backups)
  • βœ… Detailed HTTP request/response logging
  • βœ… Operation tracing with timestamps
  • βœ… Error tracking with full stack traces
  • βœ… Password masking for security
  • βœ… File-level DEBUG logging
  • βœ… Console-level WARNING/ERROR logging

View Recent Logs:

# View last 50 lines
tail -50 ~/.webhdfsmagic/logs/webhdfsmagic.log

# Follow logs in real-time
tail -f ~/.webhdfsmagic/logs/webhdfsmagic.log

# Search for errors
grep "ERROR" ~/.webhdfsmagic/logs/webhdfsmagic.log

# View specific operation
grep "hdfs put" ~/.webhdfsmagic/logs/webhdfsmagic.log

Log Format:

2025-12-08 10:30:15 - webhdfsmagic - INFO - [magics.py:145] - >>> Starting operation: hdfs ls
2025-12-08 10:30:15 - webhdfsmagic - DEBUG - [client.py:85] - HTTP Request: GET http://...
2025-12-08 10:30:15 - webhdfsmagic - DEBUG - [client.py:105] - HTTP Response: 200 from http://...
2025-12-08 10:30:15 - webhdfsmagic - INFO - [magics.py:180] - <<< Operation completed: hdfs ls - SUCCESS

πŸ’‘ Usage

# The extension is already loaded automatically!
%hdfs help

# List files
%hdfs ls /data

# Disk usage β€” list immediate children with their real recursive sizes
%hdfs du /data/users

# Disk usage β€” summary of the path itself (single call, no children iteration)
%hdfs du -s /data/users

# Human-readable sizes (KB / MB / GB)
%hdfs du -h /data/users

# Combine both: summary + human-readable
%hdfs du -sh /data/users

# File metadata (name, type, size, owner, permissions, modified…)
%hdfs stat /data/events.parquet

# Metadata for a directory
%hdfs stat /data/users

# Rename a file
%hdfs mv /data/old_name.csv /data/new_name.csv

# Move a directory to another location
%hdfs mv /data/tmp /data/archive/tmp

# Create a directory
%hdfs mkdir /user/hdfs/output

# Upload multiple CSV files using wildcards
%hdfs put ~/data/*.csv /user/hdfs/input/

# Download a file to home directory
%hdfs get /user/hdfs/results/output.csv ~/downloads/

# Download multiple files with wildcards
%hdfs get /user/hdfs/results/*.csv ./local_results/

# ===== SMART CAT (File Preview) =====

# Display first 50 lines (default grid table format)
%hdfs cat /user/hdfs/data/file.csv -n 50

# Smart CSV formatting with automatic table display
%hdfs cat /user/hdfs/data/sales.csv

# Display Parquet file as table
%hdfs cat /user/hdfs/data/records.parquet -n 20

# Pandas format (classic DataFrame representation)
%hdfs cat /user/hdfs/data/data.csv --format pandas



# Polars format (shows schema + explicit types, 3.7x faster for Parquet!)
%hdfs cat /user/hdfs/data/records.parquet --format polars


> **Warning:** Using `%hdfs cat /user/hdfs/data/records.parquet -n -1 --raw` will try to load the entire file into memory. For very large Parquet files, this can consume a lot of RAM and may crash your notebook. Use this command with caution and confirm you really want to load the full file before running it.

> **Tip:** For large Parquet files, it is highly recommended to use `%hdfs cat file.parquet --format polars` instead of `%hdfs cat file.parquet --raw` for much better performance and readability.

# Raw text display (unformatted original content)
%hdfs cat /user/hdfs/data/file.csv --raw

# ===== File Management =====

# Delete files with wildcards
%hdfs rm /user/hdfs/temp/*.log

# Delete a directory recursively
%hdfs rm -r /user/hdfs/temp

# Change permissions recursively
%hdfs chmod -R 755 /user/hdfs/data

# Change owner recursively (requires superuser privileges)
%hdfs chown -R hdfs:hadoop /user/hdfs/data

Integration with pandas:

# Download and read directly
%hdfs get /data/sales.csv .
df = pd.read_csv('sales.csv')
df.head()

🎯 Advanced Features

Wildcard Operations

Upload, download, and delete multiple files using shell-style wildcards:

# Upload all CSV files
%hdfs put data/*.csv /hdfs/input/

# Download specific pattern
%hdfs get /hdfs/output/result_*.csv ./downloads/

# Delete log files
%hdfs rm /hdfs/temp/*.log

Smart File Formatting (CSV & Parquet)

Automatically format structured files as readable tables:

# CSV files are automatically detected and formatted
%hdfs cat /data/sales.csv
# β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”
# β”‚ date       β”‚ product β”‚ amount β”‚
# β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
# β”‚ 2025-12-08 β”‚ laptop  β”‚ 1200   β”‚
# β”‚ 2025-12-09 β”‚ phone   β”‚ 800    β”‚
# β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜

# Parquet files work seamlessly
%hdfs cat /data/records.parquet -n 100

# TSV and other delimiters are auto-detected
%hdfs cat /data/data.tsv  # Detects tab delimiter

# Force specific format
%hdfs cat /data/file.csv --format pandas  # Pandas DataFrame (classic)
%hdfs cat /data/file.csv --format polars  # Polars with schema and types
%hdfs cat /data/file.csv --raw            # Raw text, no formatting

# Supported formats:
#   - CSV (comma, tab, semicolon, pipe - auto-detected)
#   - Parquet (uses Polars for 3.7x faster processing)
#   - TSV (tab-separated values)

πŸ“Š Format Options Explained:

  • Default (grid): Beautiful ASCII table, perfect for reports
  • --format pandas: Classic pandas display, familiar to data scientists
  • --format polars: Shows schema with explicit types (str, i64, f64, bool) - ideal for data validation
  • --raw: Original file content without any parsing

πŸš€ Performance: Parquet files are processed using Polars, providing ultra-fast reads and minimal memory usage (3.7x faster than PyArrow+Pandas).

Memory Protection: By default, the cat command limits downloads to 50 MB to prevent memory saturation. This protection applies when using -n <lines> option. To read entire large files, use -n -1:

# Safe: Limited to 50 MB download
%hdfs cat /huge_file.csv -n 100

# Full read: No memory limit (use with caution on large files)
%hdfs cat /small_file.csv -n -1

⚠️ Large Parquet Files: Files > 100 MB will show a warning recommending to download first with %hdfs get for better performance.

Disk Usage β€” Real Directory Sizes (%hdfs du)

LISTSTATUS (used by %hdfs ls) always reports size=0 for directories. %hdfs du fixes this by calling GETCONTENTSUMMARY per entry, which returns the actual recursive size.

# Default: iterate over immediate children, show real recursive size for each
%hdfs du /data/users
# Returns a DataFrame:
# name      type  size         space_consumed  file_count  dir_count  error
# alice     DIR   1073741824   3221225472      4820        12         None
# bob       DIR   536870912    1610612736      2100        5          None
# reports   FILE  10485760     31457280        1           0          None

# Summary: single row for the path itself
%hdfs du -s /data/users
# name          type  size         space_consumed  file_count  dir_count  error
# /data/users   DIR   1610612736   4831838208      6921        17         None

# Human-readable sizes
%hdfs du -h /data/users
# name      type  size    space_consumed  file_count  dir_count  error
# alice     DIR   1.0 GB  3.0 GB          4820        12         None
# bob       DIR   512.0 MB  1.5 GB        2100        5          None

# Combine both
%hdfs du -sh /data/users

Graceful permission handling: directories returning HTTP 401/403 are included in the DataFrame with size=None and an error message β€” the command never crashes mid-iteration:

%hdfs du /sensitive/data
# name       type  size  space_consumed  file_count  dir_count  error
# public     DIR   ...   ...             ...         ...        None
# private    DIR   None  None            None        None       permission denied (HTTP 403)
# readonly   DIR   ...   ...             ...         ...        None

Parallel Uploads & Downloads (Multi-threaded PUT/GET)

Starting from version 0.0.4, webhdfsmagic supports parallel file transfers using the --threads (or -t) option for both put and get commands.

This allows you to upload or download multiple files simultaneously, greatly speeding up operations on large datasets or many files.

Key features:

  • Multi-threaded transfers for PUT and GET
  • Syntax: %hdfs put --threads N <local_files> <hdfs_dir>
  • Syntax: %hdfs get --threads N <hdfs_files> <local_dir>
  • N = number of threads (e.g. 4, 8, 16)

Examples:

# Parallel upload: PUT multiple files to HDFS using 4 threads
%hdfs put --threads 4 *.csv /demo/data/

# Parallel download: GET multiple files from HDFS using 4 threads
%hdfs get --threads 4 /demo/data/* ./downloads/

# You can also use the short option -t
%hdfs put -t 8 *.tsv /demo/data/
  • The --threads/-t option is available for both put and get commands.
  • You can specify any number of threads (e.g. 2, 4, 8, 16) depending on your system and network.
  • Parallel transfers are especially useful for large datasets or many small files.
  • Error handling is robust: if a file fails to transfer, you’ll get a clear error message for that file.
  • The command syntax is identical to single-threaded, just add --threads N or -t N.

See the notebook demo for a full example.

πŸ“– Built-in Help System

Get detailed help directly in your notebook:

# Show all available commands with descriptions
%hdfs help

This displays a comprehensive interactive help with:

  • All available commands (ls, du, stat, mv, mkdir, put, get, cat, rm, chmod, chown)
  • Options and flags for each command
  • Format descriptions for the cat command
  • Auto-detection features explanation

Summary of available commands:

Command Description
%hdfs help Display this help
%hdfs setconfig {...} Set configuration (JSON format)
%hdfs ls [path] List files and directories
%hdfs du <path> [-s] [-h] Disk usage (real recursive sizes)
-s : summary of path itself Β Β·Β  -h : human-readable sizes
%hdfs stat <path> File/directory metadata β€” one API call (GETFILESTATUS)
Columns: name, type, size, owner, group, permissions, block_size, modified, replication
%hdfs mv <src> <dst> Rename or move a file/directory (RENAME) β€” server-side, no data copy
%hdfs mkdir <path> Create directory
%hdfs rm <path> [-r] Delete file/directory
-r : recursive deletion
%hdfs put <local> <hdfs> Upload files (supports wildcards)
-t, --threads <N> : use N parallel threads for multi-file uploads
%hdfs get <hdfs> <local> Download files (supports wildcards)
-t, --threads <N> : use N parallel threads for multi-file downloads
%hdfs cat <file> [options] Smart file preview (CSV/TSV/Parquet)
-n <lines> : limit to N rows (default: 100)
--format <type> : force format (csv, parquet, pandas, polars, raw)
--raw : display raw content without formatting
%hdfs chmod [-R] <mode> <path> Change permissions (e.g., 644, 755)
-R : recursive
%hdfs chown [-R] <user:group> <path> Change owner and group
-R : recursive

Examples:

  • %hdfs du /data/users – List children with their real recursive sizes
  • %hdfs du -sh /data/users – Total size of the path, human-readable
  • %hdfs stat /data/events.parquet – Metadata for a file (size, owner, permissions…)
  • %hdfs stat /data/users – Metadata for a directory
  • %hdfs mv /data/old.csv /data/new.csv – Rename a file
  • %hdfs mv /data/tmp /data/archive/tmp – Move a directory
  • %hdfs cat data.csv -n 10 – Preview first 10 rows
  • %hdfs cat data.parquet --format pandas – Display in pandas format (classic)
  • %hdfs cat data.parquet --format polars – Display with schema and types
  • %hdfs put *.csv /data/ – Upload all CSV files
  • %hdfs put -t 4 ./data/*.csv /hdfs/input/ – Upload files with 4 parallel threads
  • %hdfs get -t 8 /hdfs/output/*.parquet ./results/ – Download files with 8 parallel threads
  • %hdfs chmod -R 755 /mydir – Set permissions recursively

The help command is always available and shows the most up-to-date documentation for your installed version.

Recursive Permissions

Apply permission changes to entire directory trees:

# Recursive chmod
%hdfs chmod -R 755 /hdfs/project/

# Recursive chown (requires superuser)
%hdfs chown -R hdfs:hadoop /hdfs/project/

Home Directory Expansion

Use ~ as a shortcut for your home directory:

# Download to home directory
%hdfs get /hdfs/file.csv ~/downloads/

# Works in subdirectories too
%hdfs get /hdfs/data/*.csv ~/projects/analysis/

πŸ“š Documentation and Examples

πŸ§ͺ Testing

Unit tests (no HDFS cluster required):

pytest tests/ -v

Test with Docker HDFS cluster:

# Start the demo environment
cd demo
docker-compose up -d

# Wait 30 seconds for initialization
sleep 30

# Configure webhdfsmagic (if not already done)
mkdir -p ~/.webhdfsmagic
cat > ~/.webhdfsmagic/config.json << 'EOF'
{
  "knox_url": "http://localhost:8080/gateway/default",
  "webhdfs_api": "/webhdfs/v1",
  "username": "testuser",
  "password": "testpass",
  "verify_ssl": false
}
EOF

# Test with demo notebook
cd ..
jupyter notebook examples/demo.ipynb

See demo/README.md for complete Docker environment documentation.

πŸ› Troubleshooting

Check Logs

All operations are logged to ~/.webhdfsmagic/logs/webhdfsmagic.log:

# View recent activity
tail -50 ~/.webhdfsmagic/logs/webhdfsmagic.log

# Check for errors
grep -i "error" ~/.webhdfsmagic/logs/webhdfsmagic.log

# View specific command execution
grep "hdfs put" ~/.webhdfsmagic/logs/webhdfsmagic.log -A 5

Common Issues

Connection Errors:

  • Check Knox gateway URL in ~/.webhdfsmagic/config.json
  • Verify SSL settings (verify_ssl: false for testing)
  • Check logs for HTTP error details

Authentication Errors:

  • Verify username/password in config
  • Check if credentials have expired
  • Review authentication errors in logs

File Transfer Issues:

  • Check local file paths exist
  • Verify HDFS paths are absolute (start with /)
  • Review detailed HTTP request/response in logs
  • Check disk space on both local and HDFS

Permission Errors:

  • Verify HDFS user permissions
  • Check file/directory ownership in HDFS
  • Review operation logs for specific error messages

🀝 Contributing

Contributions are welcome! To contribute:

  1. Fork the project
  2. Create a branch (git checkout -b feature/my-feature)
  3. Commit your changes (git commit -m 'feat: add new feature')
  4. Push to the branch (git push origin feature/my-feature)
  5. Open a Pull Request

Testing and code quality:

# Run tests
pytest tests/ -v

# Check code style
ruff check .
ruff format .

πŸ“ License

This project is licensed under the MIT License. See LICENSE for details.

πŸ”— Links

πŸ“¬ Contact

For questions or suggestions, open an issue on GitHub.

About

webhdfsmagic is a Python package that provides IPython magic commands to interact with HDFS via WebHDFS/Knox.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors