notebook updates

mariana-gferreira · mariana-gferreira · commit a94f40e1549c · 2025-08-01T15:09:00.000+01:00
diff --git a/notebooks/0.1_summary_fov_per_sample_id.ipynb b/notebooks/0.1_summary_fov_per_sample_id.ipynb
@@ -5,20 +5,20 @@
    "id": "1415df36",
    "metadata": {},
    "source": [
-    "# 0. Split the dataset into train and test subsets\n",
+    "# 0.1. Generate a summary of the available data\n",
     "Data from the same origin should be kept in the same subset to avoid data leakage.\n",
     "\n",
     "This notebook crosses the image names with the identifiers in the csv file to get the number of FOVs (Fields of View) for each sample. Also counting the number of empty FOVs.\n",
     "\n",
-    "These counts can then be used to split the dataset into train and test subsets. Usually 80% for training and 20% for testing.\n"
+    "These counts can then be used to do an informed split of the dataset into train and test subsets. Usually 80% for training and 20% for testing.\n"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "d1ebf2ae",
    "metadata": {},
    "source": [
-    "## 0.1. Load libraries and custom functions\n",
+    "## 0.1.1. Load libraries and custom functions\n",
     "\n",
     "Load the `pandas`, `os`, and `skimage` libraries.\n"
    ]
@@ -95,7 +95,7 @@
    "id": "0baddb91",
    "metadata": {},
    "source": [
-    "## 0.2. Code\n"
+    "## 0.1.2. Code\n"
    ]
   },
   {
diff --git a/notebooks/0.2_label_watershed.ipynb b/notebooks/0.2_label_watershed.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# 0.1. 3D Seeded Watershed Segmentation\n",
+    "# 0.2. 3D Seeded Watershed Segmentation\n",
     "\n",
     "This script uses the `watershed` algorithm from `skimage.segmentation` to perform seeded segmentation on a label image where a cluster of objects is classified with a single label. \n",
     "\n",
@@ -15,7 +15,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 0.1.1. Load Python Libraries\n",
+    "## 0.2.1. Load Python Libraries\n",
     "\n",
     "Load the necessary Python libraries for image processing and visualization. \n",
     "\n",
@@ -50,7 +50,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 0.1.2. Load Functions\n",
+    "## 0.2.2. Load Functions\n",
     "\n",
     "Load custom functions to handle image processing tasks."
    ]
@@ -267,7 +267,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 0.1.3. Pipeline"
+    "## 0.2.3. Pipeline"
    ]
   },
   {
@@ -306,7 +306,51 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Processing C4-16122021_Label45_367L_w3_1076_100x_0p21_01_scaled_oriScale.tif...\n",
+      "Processing label 1...\n",
+      "Only one conected component found for label 1, no markers for watershed applied.\n",
+      "Transfering to relabeled array\n",
+      "Processing label 2...\n",
+      "Only one conected component found for label 2, no markers for watershed applied.\n",
+      "Transfering to relabeled array\n",
+      "Processing label 3...\n",
+      "Processing label 4...\n",
+      "Only one conected component found for label 4, no markers for watershed applied.\n",
+      "Transfering to relabeled array\n",
+      "Processing label 5...\n",
+      "Only one conected component found for label 5, no markers for watershed applied.\n",
+      "Transfering to relabeled array\n",
+      "Processing label 6...\n",
+      "Processing label 7...\n",
+      "Processing label 8...\n",
+      "Only one conected component found for label 8, no markers for watershed applied.\n",
+      "Transfering to relabeled array\n",
+      "Processing label 9...\n",
+      "Only one conected component found for label 9, no markers for watershed applied.\n",
+      "Transfering to relabeled array\n",
+      "Processing C4-12012022_Label46_367L_Cd16_100x_0p21_01_scaled_oriScale.tif...\n",
+      "Processing label 1...\n",
+      "Processing C3-03022022_Label49_t1_100x_0p21_02_POS_current_scaled_oriScale.tif...\n",
+      "Processing label 1...\n",
+      "Processing label 2...\n",
+      "Only one conected component found for label 2, no markers for watershed applied.\n",
+      "Transfering to relabeled array\n",
+      "Processing label 3...\n",
+      "Processing label 4...\n",
+      "Processing C4-26012022_Label48_t1strep_100x_0p21_03_POS_current_scaled_oriScale.tif...\n",
+      "Processing label 1...\n",
+      "Only one conected component found for label 1, no markers for watershed applied.\n",
+      "Transfering to relabeled array\n",
+      "Processing C4-16122021_Label45_367L_w3_1076_100x_0p21_01_scaled_oriScale 2.tif...\n",
+      "C4-16122021_Label45_367L_w3_1076_100x_0p21_01_scaled_oriScale 2.tif is likely not a label image. Skipping...\n"
+     ]
+    }
+   ],
    "source": [
     "# Create The save directory if it does not exist\n",
     "# If it exists, it will not raise an error\n",
diff --git a/notebooks/0.3_normalize_and_crop.ipynb b/notebooks/0.3_normalize_and_crop.ipynb
@@ -5,7 +5,7 @@
    "id": "7cfb11f5",
    "metadata": {},
    "source": [
-    "# 0.2. Normalize and Crop Training Data\n",
+    "# 0.3. Normalize and Crop Training Data\n",
     "\n",
     "Due to the large size of the training data, we will normalize and crop the images to a smaller size. This will help in reducing the computational load and make it easier to work with the data.\n",
     "\n",
@@ -17,7 +17,7 @@
    "id": "b21b2780",
    "metadata": {},
    "source": [
-    "## 0.2.1. Load Python Libraries"
+    "## 0.3.1. Load Python Libraries"
    ]
   },
   {
@@ -39,7 +39,7 @@
    "id": "e44e1ad1",
    "metadata": {},
    "source": [
-    "## 0.2.2. Load Custom Functions"
+    "## 0.3.2. Load Custom Functions"
    ]
   },
   {
@@ -119,7 +119,7 @@
    "id": "105eb73d",
    "metadata": {},
    "source": [
-    "## 0.2.3. Code to Normalize and Crop Training Data"
+    "## 0.3.3. Code to Normalize and Crop Training Data"
    ]
   },
   {
@@ -143,19 +143,24 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# The path to the images and masks should be specified in the variables `img_directory`, `og_mask_directory`, and `watershed_label_directory`.\n",
+    "# The path to the images and masks should be specified in the variables `img_directory`, and `watershed_label_directory`.\n",
     "img_directory = \"directory/to/images\"\n",
     "\n",
-    "og_mask_directory = \"directory/to/original/masks\"\n",
-    "\n",
     "watershed_label_directory = \"directory/to/watershed/labels\"\n",
     "\n",
+    "# OPTIONAL: If you have an original mask directory, specify it here.\n",
+    "# This is used to reduce the amount of similiar crops generated by cropping each label in a cluster individually.\n",
+    "og_mask_directory = \"directory/to/original/masks\"  # Set to None if not used\n",
+    "\n",
     "# Provide the directories to store the cropped images and labels\n",
     "# They will be created if they do not exist\n",
     "cropped_img_directory = \"directory/to/cropped/images\"\n",
     "cropped_lbl_directory = \"directory/to/cropped/labels\"\n",
     "\n",
     "# Size of the crop in the x and y dimensions\n",
+    "# Should be at least twice the patch size used for training the model\n",
+    "# This is to ensure that an edge crop still has a sufficient size for training\n",
+    "# For example, if the patch size is 128, a crop size of 256 is recommended.\n",
     "crop_size_xy = 256\n",
     "\n",
     "# Minimum size of the crop in the z-dimension\n",
@@ -181,9 +186,64 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "48a21f98",
+   "id": "b60a1b77",
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "DONE: 08082024_rLabel_014.2_TAMRA_sense_P00002_C4scaled_oriScale.tif\n",
+      "DONE: 08082024_rLabel_014.2_TAMRA_sense_P00006_C4scaled_oriScale.tif\n",
+      "DONE: 08082024_rLabel_014.2_TAMRA_sense_P00009_C4scaled_oriScale.tif\n",
+      "DONE: 21082024_rLabel_024_TAMRA_sense_P00013_C4scaled_oriScale.tif\n",
+      "DONE: 21082024_rLabel_024_TAMRA_sense_P00017_C4scaled_oriScale.tif\n",
+      "DONE: 21082024_rLabel_024_TAMRA_sense_P00020_C4scaled_oriScale.tif\n",
+      "DONE: 21082024_rLabel_024_TAMRA_sense_P00023_C4scaled_oriScale.tif\n",
+      "DONE: 21082024_rLabel_024_TAMRA_sense_P00027_C4scaled_oriScale.tif\n",
+      "DONE: 21082024_rLabel_024_TAMRA_sense_P00031_C4scaled_oriScale.tif\n",
+      "DONE: 21082024_rLabel_024_TAMRA_sense_P00033_C4scaled_oriScale.tif\n",
+      "DONE: 21082024_rLabel_024_TAMRA_sense_P00045_C4scaled_oriScale.tif\n",
+      "DONE: 21082024_rLabel_024_TAMRA_sense_P00049_C4scaled_oriScale.tif\n",
+      "DONE: 21082024_rLabel_024_TAMRA_sense_P00051_C4scaled_oriScale.tif\n",
+      "DONE: 21082024_rLabel_024_TAMRA_sense_P00052_C4scaled_oriScale.tif\n",
+      "DONE: 21082024_rLabel_024_TAMRA_sense_P00055_C4scaled_oriScale.tif\n",
+      "DONE: C3-03022022_Label49_t1_100x_0p21_02_POS_current_scaled_oriScale.tif\n",
+      "DONE: C3-03022022_Label49_t1_100x_0p21_03_POS_current_scaled_oriScale.tif\n",
+      "DONE: C3-03022022_Label49_t3_100x_0p21_03_POS_current_scaled_oriScale.tif\n",
+      "DONE: C3-03022022_Label49_t3_100x_0p21_04_POS_current_scaled_oriScale.tif\n",
+      "DONE: C3-26012022_Label48_t2_100x_0p21_03_POS_current_scaled_oriScale.tif\n",
+      "DONE: C4-02122021_Label43_label1_343_0p25_100x_0p21_01_scaled_oriScale.tif\n",
+      "DONE: C4-02122021_Label43_label1_343_0p25_100x_0p21_02_scaled_oriScale.tif\n",
+      "DONE: C4-02122021_Label43_label1_343_0p25_100x_0p21_04_scaled_oriScale.tif\n",
+      "DONE: C4-03022022_Label49_t2_100x_0p21_03_POS_current_scaled_oriScale.tif\n",
+      "DONE: C4-03022022_Label49_t2_100x_0p21_04_POS_current_scaled_oriScale.tif\n",
+      "DONE: C4-12012022_Label46_367L_BCd16low_100x_0p21_02_scaled_oriScale.tif\n",
+      "DONE: C4-12012022_Label46_367L_Cd16_100x_0p21_01_scaled_oriScale.tif\n",
+      "DONE: C4-12012022_Label46_367L_Cd16_100x_0p21_03_scaled_oriScale.tif\n",
+      "DONE: C4-16122021_Label44_CD16_367L_w1_closetolabel_100x_0p21_01_scaled_oriScale.tif\n",
+      "DONE: C4-16122021_Label44_CD16_367L_w1_closetolabel_100x_0p21_02_scaled_oriScale.tif\n",
+      "DONE: C4-16122021_Label44_CD16_367L_w1_closetolabel_100x_0p21_03_scaled_oriScale.tif\n",
+      "DONE: C4-16122021_Label44_CD16_367L_w1_closetolabel_100x_0p21_04_scaled_oriScale.tif\n",
+      "DONE: C4-16122021_Label45_367L_w3_1076_100x_0p21_01_scaled_oriScale.tif\n",
+      "DONE: C4-16122021_Label45_367L_w3_1076_100x_0p21_03_scaled_oriScale.tif\n",
+      "DONE: C4-16122021_Label45_367L_w3_1077_100x_0p21_01_scaled_oriScale.tif\n",
+      "DONE: C4-16122021_Label45_367L_w3_1079_100x_0p21_03_scaled_oriScale.tif\n",
+      "DONE: C4-26012022_Label48_t1_100x_0p21_03_POS_current_scaled_oriScale.tif\n",
+      "DONE: C4-26012022_Label48_t1strep_100x_0p21_02_POS_current_scaled_oriScale.tif\n",
+      "DONE: C4-26012022_Label48_t1strep_100x_0p21_03_POS_current_scaled_oriScale.tif\n",
+      "DONE: L72_w6_P00124_scaled_oriScale.tif\n",
+      "DONE: L74_w9_P00101_scaled_oriScale.tif\n",
+      "DONE: L74_w9_P00107_scaled_oriScale.tif\n",
+      "DONE: rLabel_012.2_TAMRA_sense_P00019_C4scaled_oriScale.tif\n",
+      "DONE: rLabel_012.2_TAMRA_sense_P00028_C4scaled_oriScale.tif\n",
+      "DONE: rLabel_012.2_TAMRA_sense_P00033_C4scaled_oriScale.tif\n",
+      "DONE: rLabel_012.2_TAMRA_sense_P00034_C4scaled_oriScale.tif\n",
+      "DONE: rLabel_012.2_TAMRA_sense_P00041_C4scaled_oriScale.tif\n",
+      "DONE: rLabel_012.2_TAMRA_sense_P00042_C4scaled_oriScale.tif\n"
+     ]
+    }
+   ],
    "source": [
     "# Get the list of files in the specified image directory\n",
     "img_dir_list = sorted(os.listdir(img_directory))\n",
@@ -193,9 +253,19 @@
     "    # only process files with .tif or .tiff extensions\n",
     "    if file.endswith((\".tif\", \".tiff\")):\n",
     "        img = imread(os.path.join(img_directory, file))\n",
-    "        mask = imread(os.path.join(og_mask_directory, file))\n",
     "        lbl = imread(os.path.join(watershed_label_directory, file))\n",
     "\n",
+    "        # Handle different cases for og_mask_directory\n",
+    "        try:\n",
+    "            # Check if variable exists and is a valid directory path\n",
+    "            if og_mask_directory and os.path.isdir(og_mask_directory):\n",
+    "                mask = imread(os.path.join(og_mask_directory, file))\n",
+    "            else:\n",
+    "                mask = np.copy(lbl)\n",
+    "        except (NameError, TypeError):\n",
+    "            # Variable doesn't exist or is None\n",
+    "            mask = np.copy(lbl)\n",
+    "\n",
     "        # Normalize the image from 1 to 99.8 percentile\n",
     "        img = normalize(img, 1, 99.8, axis=(0, 1, 2))\n",
     "\n",