Comprehensive automation scripts for setting up modern data science and machine learning development environments across Ubuntu, WSL, and Hyper-V deployments.
- π Latest Tools: Anaconda 2024.02-1, Python 3.12+, Ubuntu 24.04.1 LTS
- π€ Modern ML Stack: PyTorch 2.0+, TensorFlow 2.15+, Transformers, FastAI
- π Advanced Analytics: Polars, Vaex, Dask, Xarray, Zarr for big data
- π§ DevOps Integration: Docker, Kubernetes, MicroK8s, Canonical Data Science Stack
- π± Interactive Apps: Streamlit, Gradio, Panel, Dash for web applications
- π― MLOps Tools: MLflow, Weights & Biases, Optuna, Ray[Tune]
- π» Enhanced IDE: VS Code with 20+ data science extensions
- π³ Containerized ML: GPU-enabled environments with Canonical Data Science Stack
- Python Ecosystem: NumPy, Pandas, Matplotlib, Scikit-learn, Plotly, Seaborn
- Machine Learning: XGBoost, LightGBM, CatBoost, Transformers, PyTorch, TensorFlow
- Big Data Tools: Polars, Vaex, Dask, Xarray, Zarr for scalable computing
- NLP & Vision: SpaCy, Gensim, OpenCV, Pillow, Sentence-Transformers
- Interactive Apps: Streamlit, Gradio, Panel, Dash for rapid prototyping
- Statistics: R, RStudio, Statsmodels, Pingouin for advanced analytics
- Containerization: Docker, Docker Compose for reproducible environments
- Orchestration: Kubernetes, Helm, Kubectl for scalable deployments
- GPU Acceleration: CUDA Toolkit, Canonical Data Science Stack with MicroK8s
- MLOps: MLflow, Weights & Biases for experiment tracking
- Code Quality: Black, isort, Flake8, Pylint for professional development
# Clone and setup
git clone https://github.com/tomblanchard312/DSWorkloadInstallScripts.git
cd DSWorkloadInstallScripts
# Make executable and run
chmod +x ubuntu_post_install.sh
./ubuntu_post_install.sh# Clone and setup
git clone https://github.com/tomblanchard312/DSWorkloadInstallScripts.git
cd DSWorkloadInstallScripts
# Make executable and run
chmod +x wsl_post_install.sh
./wsl_post_install.sh# Run PowerShell script (requires admin privileges)
Set-ExecutionPolicy Bypass -Scope Process -Force
.\ubuntu-hyper-v-setup.ps1- Minimum: 4GB RAM, 20GB storage, Ubuntu 24.04.1 LTS or Windows 10/11
- Recommended: 16GB+ RAM, 50GB+ SSD, NVIDIA GPU (optional)
- Network: Internet connection for package downloads
- Windows 10/11 with WSL2 or Hyper-V enabled
- PowerShell with administrative privileges
- Virtualization enabled in BIOS
- Python 3.12+ with pip, venv, and build tools
- Anaconda 2024.02-1 for package management
- Git for version control
- Node.js & npm for web development
- Scientific Computing: NumPy, Pandas, Matplotlib, Polars, Vaex
- Machine Learning: Scikit-learn, XGBoost, LightGBM, CatBoost
- Deep Learning: PyTorch, TensorFlow, Transformers, FastAI
- Big Data: Dask, Xarray, Zarr for scalable computing
- Visualization: Plotly, Seaborn, Bokeh, Altair, Vega Datasets
- NLP & CV: SpaCy, Gensim, OpenCV, Sentence-Transformers
- JupyterLab with enhanced extensions
- Streamlit, Gradio, Panel, Dash for rapid app development
- RStudio for statistical computing
- Azure Data Studio for database management
- Docker & Docker Compose for containerization
- Kubernetes tools (kubectl, Helm) for orchestration
- Canonical Data Science Stack with MicroK8s for GPU-enabled containers
- MLOps tools (MLflow, Weights & Biases) for experiment tracking
- Python & Jupyter: Python, Jupyter, DVC, R, Julia support
- Code Quality: Black, isort, Flake8, Pylint for formatting
- DevOps: Docker, Kubernetes, YAML support
- Database: SQL tools, PostgreSQL integration
- Git: GitLens for enhanced Git workflow
Our scripts now include the Canonical Data Science Stack for enterprise-grade ML environments:
# After installation, use these commands:
data-science-stack start --gpu # Start GPU-enabled ML environment
data-science-stack list # List available environments
data-science-stack connect # Connect to JupyterLab
data-science-stack stop # Stop environmentsBenefits:
- Containerized ML Environments with GPU support
- MicroK8s Integration for Kubernetes-native workflows
- Automatic GPU Detection and driver handling
- MLflow Integration for experiment tracking
- Reproducible Environments across machines
- CUDA Toolkit 11.4+ for NVIDIA GPUs
- Automatic GPU detection and optimization
- Containerized GPU access via Canonical DSS
- Multi-GPU support for advanced workloads
# Create virtual environment
python3 -m venv myproject
source myproject/bin/activate
# Install additional packages
pip install streamlit plotly-dash
# Launch interactive app
streamlit run app.py# Start GPU-enabled environment
data-science-stack start --gpu
# Access JupyterLab
data-science-stack connect
# Run ML experiments with GPU acceleration
python train_model.py# Build and run container
docker build -t ml-app .
docker run -p 8501:8501 ml-app
# Use Docker Compose
docker-compose up -d- Permission Denied: Ensure scripts are executable (
chmod +x) - DOS Characters: Remove with
sed -i -e 's/\r$//' script.sh - CUDA Issues: Check NVIDIA drivers and GPU compatibility
- WSL Issues: Ensure WSL2 is enabled and updated
- Check the INDEX.md for detailed documentation
- Review script output for specific error messages
- Ensure all prerequisites are met
- Check system compatibility with Ubuntu version
We welcome contributions! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
- Additional VS Code extensions
- New data science libraries
- Platform-specific optimizations
- Documentation enhancements
This project is licensed under the MIT License - see the LICENSE file for details.
- Canonical for the Data Science Stack
- Microsoft for VS Code and WSL
- Open Source Community for the amazing tools
- Contributors who help improve these scripts
- GitHub Issues: Report bugs or request features
- Discussions: Join the community
- Documentation: Comprehensive INDEX.md
β Star this repository if it helped you set up your data science environment!
Made with β€οΈ for the data science community