Skip to content

Conversation

@sarahyurick
Copy link
Contributor

No description provided.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 15, 2026

Greptile Summary

This PR relocates AWS credential configuration from the ArXiv section to the beginning of the notebook, ensuring credentials are set before initializing the Ray client. This prevents credentials-related errors that occur when environment variables are set after the Ray cluster starts.

Key changes:

  • Moved AWS credential cell block to appear immediately after prerequisites, before Ray client initialization
  • Added clear instructions about credential timing and formatting (no quotes around values)
  • Improved S3 bucket accessibility check by replacing shell command with proper Python subprocess.run() implementation with error handling
  • Updated prerequisite text to reference credential setup location at top of notebook

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The changes are documentation and configuration improvements that fix a known issue where AWS credentials set after Ray initialization don't propagate correctly. The subprocess implementation follows security best practices with explicit noqa comment and proper error handling.
  • No files require special attention

Important Files Changed

Filename Overview
tutorials/text/download-and-extract/download_extract_tutorial.ipynb Improves AWS credentials setup by moving it before Ray client initialization and adding proper subprocess handling for S3 verification

Sequence Diagram

sequenceDiagram
    participant User
    participant Notebook
    participant Ray
    participant AWS_S3
    
    Note over User,Notebook: Prerequisites Setup
    User->>Notebook: Install wget
    User->>Notebook: Set AWS credentials (NEW LOCATION)
    Note over Notebook: AWS_ACCESS_KEY_ID<br/>AWS_SECRET_ACCESS_KEY<br/>AWS_SESSION_TOKEN
    
    Note over Notebook,Ray: Ray Client Initialization
    Notebook->>Ray: Initialize RayClient()
    Ray-->>Notebook: Ray cluster started with env vars
    
    Note over Notebook,AWS_S3: ArXiv Section - Verify Access
    Notebook->>AWS_S3: subprocess.run(s5cmd ls)
    AWS_S3-->>Notebook: List tar files (with credentials)
    
    Note over Notebook,AWS_S3: Download & Extract Pipeline
    Notebook->>AWS_S3: ArxivDownloadExtractStage
    AWS_S3-->>Notebook: Download tar files
    Notebook->>Notebook: Extract LaTeX sources
    Notebook->>Notebook: Write to JSONL
Loading

Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick sarahyurick added the r1.1.0 Pick this label for auto cherry-picking into r1.1.0 label Jan 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r1.1.0 Pick this label for auto cherry-picking into r1.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant