haesleinhuepf · jackyko1991 · Jul 5, 2024 · Jul 5, 2024 · Jul 5, 2024 · Jul 8, 2024
diff --git a/docs/03_advanced_python/shell_scripting.md b/docs/03_advanced_python/shell_scripting.md
@@ -0,0 +1,13 @@
+# Shell Scripting
+Shell scripting involves writing scripts for the command-line interpreter, known as the shell, on Unix-based systems and Windows. These scripts automate tasks such as file manipulation, program execution, and text processing. Common Unix shells include Bash, Zsh, and Ksh, while Windows uses Command Prompt (cmd) and PowerShell. Shell scripts start with a shebang (`#!`) on Unix or a suitable extension like `.bat` or `.ps1` on Windows. They use variables, loops, and conditionals to control flow. Commands like `echo`, `grep`, and `awk` are common on Unix, while `Write-Output`, `Get-Content`, and `ForEach-Object` are used in PowerShell. This is very often used together with Python scripts to automate batch analysis.
+
+For Windows users it is highly recommended to run Python alongside with Git bash (https://git-scm.com/downloads) that maximally mimic the running *nix running environment.
+
+With the seamless integration of bash terminal, remote SSH and Jupyter extension in VSCode, the experience of different operating systems does not differ very much. But under certain special occasions like 2FA security login to computing clusters, Linux or MacOS can retain a better experience with *nix specialised functions like SSH sockets for connection persistence.
+
+## Summary Table on OS and Shell Scipts
+| Operating System | Terminal Emulator            | Default Shell    | Additional Shells                   | Pros                                                                         | Cons                                                               |
+|------------------|------------------------------|------------------|-------------------------------------|------------------------------------------------------------------------------|--------------------------------------------------------------------|
+| Linux            | GNOME Terminal, Konsole      | Bash             | Zsh, Fish, Ksh, Tcsh, Dash          | Highly customizable, vast array of tools, strong community support, open-source | Fragmentation in terminal emulators, varying default configurations|
+| macOS            | Terminal, iTerm2             | Zsh (since 10.15) | Bash, Fish, Ksh, Tcsh               | User-friendly, well-integrated with macOS, iTerm2 offers advanced features    | Terminal app is less feature-rich compared to iTerm2              |
+| Windows          | Command Prompt, PowerShell, Windows Terminal | PowerShell       | Bash (via WSL), Git Bash, Cygwin    | Powerful scripting capabilities in PowerShell, WSL brings Linux compatibility | Command Prompt is limited, PowerShell syntax can be complex       |
diff --git a/docs/15_gpu_acceleration/tensor_core.md b/docs/15_gpu_acceleration/tensor_core.md
@@ -0,0 +1,89 @@
+# Tensor Core
+
+A **Tensor Core** is a computing unit in Nvidia GPUs that multiplies two matrices, and then adds a third matrix to the result to accomplish hardware accelerated **General Matrix Multiplication** (GEMM). To leverage Tensor Cores in TensorFlow and PyTorch, you need to ensure that you're using the right hardware, software versions, and configurations. Tensor Cores are specialised processing units available in after NVIDIA's Volta architectures. For further details check [hardware/Neural Processing Unit](../80_hardware/gpu.md)
+
+Here's a step-by-step guide to enabling Tensor Cores for TensorFlow and PyTorch:
+
+## Prerequisites
+1. **Hardware**: Ensure you have an NVIDIA GPU that supports Tensor Cores (Volta, Turing, or Ampere architectures).
+2. **CUDA Toolkit**: Install the CUDA toolkit version supported by your GPU.
+3. **cuDNN Library**: Install the corresponding cuDNN library version.
+
+## TensorFlow
+1. Install TensorFlow with GPU support:
+
+    ```bash
+    pip install tensorflow-gpu
+    ```
+2. Enable Mixed Precision Training:
+    TensorFlow provides an easy way to use mixed precision by using the `tf.keras.mixed_precision` API.
+
+    ```python
+    import tensorflow as tf
+    from tensorflow.keras import mixed_precision
+
+    # Enable mixed precision
+    mixed_precision.set_global_policy('mixed_float16')
+    ```
+3. Ensure Optimizers are Compatible:
+
+    Some optimizers may need adjustments for mixed precision. TensorFlow's built-in optimizers are already compatible.
+
+4. Model Building:
+
+    When building your model, ensure that layers are using the `float16` data type where appropriate. TensorFlow automatically casts variables to `float32` to maintain numerical stability.
+
+5. Training:
+    Train your model as usual. TensorFlow will use Tensor Cores automatically during the mixed-precision operations.
+
+Tensorflow also allow environmental variable control of Tensor Core. Check the setting from official documentation: https://docs.nvidia.com/deeplearning/frameworks/tensorflow-user-guide/index.html#tf_disable_tensor_op_math
+
+
+## PyTorch
+1. Install PyTorch with GPU support:
+
+    ```bash
+    pip install torch torchvision
+    ```
+
+2. Enable AMP (Automatic Mixed Precision):
+
+    PyTorch provides AMP through the torch.cuda.amp module.
+
+    ```python
+    import torch
+    from torch.cuda.amp import autocast, GradScaler
+
+    model = YourModel().cuda()
+    optimizer = torch.optim.Adam(model.parameters())
+    scaler = GradScaler()  # For scaling the gradients
+
+    for epoch in range(num_epochs):
+        for data, target in train_loader:
+            data, target = data.cuda(), target.cuda()
+
+            optimizer.zero_grad()
+
+            with autocast():  # Enables mixed precision
+                output = model(data)
+                loss = criterion(output, target)
+
+            scaler.scale(loss).backward()  # Scales the loss
+            scaler.step(optimizer)  # Updates the optimizer
+            scaler.update()  # Updates the scaler
+    ```
+3. Ensure the model is on GPU:
+
+    Make sure your model and data are on the GPU by using `.cuda()`.
+
+## Additional Tips
+- **Performance Monitoring**: Use NVIDIA’s nvprof, nsight, or nvidia-smi tools to monitor GPU usage and ensure Tensor Cores are being utilised.
+- **Profile Your Code**: Use TensorFlow's or PyTorch's built-in profilers to understand where your model spends most of its time and ensure mixed precision is being applied correctly.
+
+By following these steps, you should be able to take advantage of Tensor Cores in both TensorFlow and PyTorch, significantly accelerating the training process for deep learning models.
+
+
+
+
+
+
diff --git a/docs/80_hardware/SoC.png b/docs/80_hardware/SoC.png
diff --git a/docs/80_hardware/arch.md b/docs/80_hardware/arch.md
@@ -0,0 +1,24 @@
+# Processor Chipset Architecture
+<div style="text-align: center;">
+  <img src="./SoC.png" alt="Placeholder Image" style="width:50%;">
+  <p><em>Intel Meteror Lake processor architecture. Modern day IC vendors tends to integrate various computation components on one single chipset to facilitate performance. When performing bioimage analysis we often utilise the processor's different computational units. Certain processor architectures facilitate more one specific tasks, e.g. image decode/encode tasks can take advantage of Intel Integrated Performance Primitives (IPP) library with hardware level accelerations.</em></p>
+</div>
+
+Modern days computer CPUs are more lean to a System-on-a-Chip (SoC) that integrates all major components of a computing device including CPU, GPU, NPU and RAM. The physically compactness brings shorter communication route among each computing units, hence facilitate computing performance.
+
+However the SoCs may still classified by the CPUs instruction sets, mainly x86 and ARM. Python libraries natively built on one of the architecture may not be directly runnable on the other, unless with OS layer translation or code compilation from source. i.e. Legacy x86 Python libraries may not be runnable on ARM computers. For power performance reason we see chipset manufacturers are releasing new SoCs in ARM architecture, yet most of the existing bioimage analysis software are pre-compiled in x86. With the effort of Apple Rosetta 2, the issue is more relieved yet not 100% compatible. So bare in mind in choosing the adequate CPU for your analysis work.
+
+When necessary, consult the code developer for the support to the CPU platforms. Following is a summary for the CPUs architectures:
+
+| Feature                          | Apple Silicon      | Intel               | AMD               | Snapdragon Xlite   | NVIDIA            |
+|----------------------------------|--------------------|---------------------|-------------------|--------------------|--------------------|
+| **Architecture**                 | ARM-based          | x86/x86-64          | x86/x86-64        | ARM-based          | ARM-based (Grace CPU) |
+| **Notable Series**               | M1, M2             | Core, Xeon          | Ryzen, EPYC       | Snapdragon 8cx     | Grace CPU          |
+| **Manufacturing Process**        | 5nm (TSMC)         | 10nm, 7nm, 14nm     | 7nm, 6nm, 5nm     | 7nm (TSMC)         | 5nm (TSMC)         |
+| **Performance Cores**            | High-performance   | High-performance    | High-performance  | High-performance   | High-performance   |
+| **Efficiency Cores**             | High-efficiency    | Not typical         | Not typical       | High-efficiency    | Not typical        |
+| **Integrated Graphics**          | Yes (Apple GPU)    | Yes (Intel Iris, UHD) | Yes (Radeon Graphics) | Yes (Adreno GPU)   | Yes (NVIDIA GPU)   |
+| **Thermal Design Power (TDP)**   | Low to moderate    | Moderate to high    | Moderate to high  | Low                | Moderate to high   |
+| **Primary Use Cases**            | Laptops, Desktops  | Laptops, Desktops, Servers | Laptops, Desktops, Servers | Laptops, Mobile Devices | HPC, AI, Data Centers |
+| **Special Features**             | Unified Memory     | Hyper-Threading, vPro | Simultaneous Multithreading (SMT) | AI Engine, 5G      | AI Acceleration, NVLink |
+| **Compatibility**                | macOS              | Windows, Linux, macOS | Windows, Linux   | Windows, Android   | Linux              |
diff --git a/docs/80_hardware/gpu.md b/docs/80_hardware/gpu.md
@@ -0,0 +1,23 @@
+# GPU Support
+## AI Training
+Though all SoC manufacturers embed GPU in the chipset, the AI based analysis is largely relying on NVidia CUDA as the base software stack. Common neural network libraries in Python (pyTorch and Tensorflow) are the foundation stone of popular models like UNet, Cellpose and Stardist. Yet we are seeing a recent support to pyTorch AMD ROCm and Intel OneAPI AI acceleration, the community support is fairly limited when comparing to CUDA. Considering the training scalability and infrastructure support across major GPU farms/research clusters, NVidia is still the sole runner when consider new model training.
+
+## AI Inference
+Machine learning algorithms consists of two parts: model training and inference. The computation resources for a fixed AI model to be implemented in new data are much smaller than training from scratch. On smaller AI tasks non-CUDA chipsets bring larger options for bioimage analysis. The inference of neural network based AI can be physically accelerated with specifically designed circuits. Such designs are often referred as neural processing units (NPU). NVidia, specifically added Tensor Core in bundle with optimised packages like cuDNN and Transformer Engine, to their later GPU products. Quick guide for Tensor Core based acceleration in available in [here](../15_gpu_acceleration/tensor_core.md).
+
+<div style="text-align: center;">
+  <img src="./npu.png" alt="Placeholder Image" style="width:50%;">
+  <p><em>General Matrix Multiplication (GEMM) as the fundamental building block of neural network (NN) operations. The math basis of NNs and image manipulation are similar embarrassingly parallel tasks involving matrices, leading GPU widely used in many machine learning tasks.</em></p>
+</div>
+
+## GPGPU Acceleration
+Apart from AI applications, bioimage analysis tasks like single plane illumination fluorescent correlation spectroscopy (SPIM-FCS) performs [pixelwise fitting of the autocorrelation function](https://github.com/bpi-oxford/Gpufit/blob/master/Gpufit/models/spim_acfN.cuh). Such image analysis can utilise the parallelisation power of GPU to accelerate the research.
+
+One high level analysis package [py-clesperanto](https://github.com/clEsperanto/pyclesperanto_prototype) attempts GPU acceleration based on OpenCL. Such computing process allows bioimage analysis not bound to graphic processing, but to more generic calculations. From this the GPU is often referred as general purpose GPU (GPGPU). Vendors like AMD and Intel are alternatives to NVidia in this sense.
+
+|                 | **NVIDIA**                                                                                                                                                 | **AMD**                                                                                                                               | **Intel**                                                                                                                            | **Apple**                                                                                                               |
+|---------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------|
+| **OpenCL Support**        | Strong support but secondary to CUDA. Performance is robust.                                                                                               | Major proponent with excellent support and performance.                                                                               | Improved support, especially with newer architectures. Performance historically lagged but is improving.                             | Limited support, focus shifting to Metal.                                                                 |
+| **OpenGL Support**        | Excellent performance and highly optimized drivers.                                                                                                        | Strong support, competitive performance, though historically lagged in specific cases.                                               | Solid support, especially for integrated graphics. Significant improvements with newer discrete GPUs.                                | Historically good, but deprecated in favor of Metal.                                                    |
+| **Vendor-Specific Toolkit** | **CUDA**: Most mature and widely-used. Extensive libraries and tools. Highly optimized hardware.                                                           | **ROCm**: Comprehensive platform, open-source, flexible. Primarily supports newer GPUs.                                               | **oneAPI**: Unified model supporting multiple architectures. Provides DPC++, integrates OpenCL. Relatively new, still gaining traction. | **Metal**: Modern API designed specifically for Apple hardware. Provides low-level access to GPU features.               |
+| **Performance**           | Typically leads in raw performance, especially for CUDA-based applications. Highly optimized.                                                               | Excels in OpenCL applications. High memory bandwidth and compute capabilities.                                                       | Emerging player with promising solutions. Strong integration with CPUs.                                                               | Excellent performance on Apple hardware, particularly optimized for Apple Silicon (M1, M2, etc.).       |
diff --git a/docs/80_hardware/npu.md b/docs/80_hardware/npu.md
@@ -0,0 +1,25 @@
+
+# Neural Processing Unit (NPU)
+<div style="text-align: center;">
+  <img src="./npu_2.png" alt="Placeholder Image" style="width:50%;">
+  <p><em>Schematic depiction of the outter matrix product AB of two matrices A and B. NPUs implement GEMMs by partitioning the output matrix into tiles, which are then parallel loaded from memory buffer, multiplied and accumulated into output. </em></p>
+</div>
+
+A Neural Processing Unit (NPU) is a specialized hardware accelerator designed to efficiently handle the computational demands of AI and machine learning tasks, particularly neural network inference and training. NPUs are optimized for the types of operations commonly used in deep learning, such as matrix multiplications, convolutions, and activation functions. In mid-2024 the NPUs are embedded in various SoCs, allowing a wider choice in AI applications.
+
+| Feature                          | Google TPU (USB/M.2)      | Apple Silicon      | AMD                      | Intel (after Meteor Lake)       | NVIDIA (Grace Hopper)     | NVIDIA (Jetson)           | Snapdragon Xlite          |
+|----------------------------------|---------------------------|--------------------|--------------------------|---------------------------|---------------------------|---------------------------|---------------------------|
+| **Product Name**                 | Edge TPU                  | Apple Neural Engine| 3rd Gen Ryzen AI| VPU, GNA, AI Engine       | TensorRT, DLA, Grace Hopper| Jetson Xavier, Nano, TX2  | Qualcomm AI Engine        |
+| **Primary Use Case**             | Edge AI, Low Power Devices| Mobile, Desktop    | GPUs with AI Capabilities| Mobile, Desktop, Edge AI  | Data Center, HPC, Embedded | Embedded AI     | Mobile, Edge Computing    |
+| **Performance**                  | Moderate                  | High               | Moderate to High         | Moderate to High          | Very High                 | Moderate to High          | Moderate                  |
+| **Efficiency**                   | High                      | High               | Moderate                 | High                      | Moderate to High          | High                      | High                      |
+| **Special Features**             | Google Cloud Compatible, Tensor Operations| Unified Memory, Tight OS Integration | APUs, ROCm | Low Power, Vision Processing, Integrated AI | CUDA Integration, Tensor Cores | Low Power, Integrated AI | Integrated 5G, AI on Device |
+| **Flexibility**                  | Specialized for TensorFlow| General Purpose    | AI with General Compute  | Specialized for AI and Vision| Highly Specialized        | General Purpose           | General Purpose           |
+| **Compatibility**                | TensorFlow Lite           | macOS              | Windows, Linux           | Windows, Linux            | Windows, Linux            | Linux                     | Android, Windows          |
+| **Scalability**                  | High                      | Moderate           | Moderate                 | Moderate                  | High                      | Moderate                  | Moderate                  |
+| **Integration**                  | Edge Devices              | Mobile, Desktop    | GPUs                     | Mobile, Desktop, Edge Devices | HPC, Cloud, Embedded      | Embedded Systems| Mobile SoCs               |
+| **Availability**                 | USB, M.2 Modules          | Built-in (A-series, M-series)| Radeon Instinct GPUs | Integrated in Meteor Lake CPUs | Available in GPUs, Servers | Available in Embedded Modules | Snapdragon SoCs           |
+
+## External Reading:
+- [Get started with tensorflow-metal (AI on Apple Neural Engine)](https://developer.apple.com/metal/tensorflow-plugin/)
+- [PluggableDevice: Device Plugin for Tensorflow](https://blog.tensorflow.org/2021/06/pluggabledevice-device-plugins-for-TensorFlow.html)
diff --git a/docs/80_hardware/npu.png b/docs/80_hardware/npu.png
diff --git a/docs/80_hardware/npu_2.png b/docs/80_hardware/npu_2.png