Fix images for Docusaurus (#3512)

CristianLara · facebook-github-bot · commit 1eb535559023 · 2025-03-19T08:46:46.000-07:00
Summary: # Images by path In the old tutorials we included images as base64 attachments but in the new ones we are specifying them via filepath. Relative filepaths in the notebook were breaking during the conversion process because the tutorial MDX file ends up in a different filesystem location while the image file was being left behind. We fix this by copying the image file to the tutorials docs location and updating the relative file path as appropriate. A half-baked implementation of this was already present in the script but was being bypassed. # Images as base64 attachments The old tutorials were including images using base64 attachments which were not properly supported by the conversion script. These image attachments are stored in the notebook cell's "attachments" field in base64 format with their associated mime_type and referenced in the markdown via attachment name. The pattern we search for in the Markdown is `![alt_text](attachment:attachment_name title)` with three groups: - group 1 = alt_text (optional) - group 2 = attachment_name - group 3 = title (optional) To represent this in MD we replace the attachment reference with the base64 encoded string as `![{alt_text}](data:{mime_type};base64,{img_as_base64})` This fix won't automatically propogate to the broken old tutorials, in a separate commit I'll fix them by updating the pre-built tutorial mdx stored in the `docusaurus-versions` branch Pull Request resolved: #3512 Reviewed By: mpolson64 Differential Revision: D71322665 Pulled By: CristianLara fbshipit-source-id: f25fe4c1fb057539f79e40eeebef3fa040b3f84d
diff --git a/scripts/convert_ipynb_to_mdx.py b/scripts/convert_ipynb_to_mdx.py
@@ -196,10 +196,53 @@ def create_buttons(
     return f'<LinkButtons\n  githubUrl="{github_url}"\n  colabUrl="{colab_url}"\n/>\n\n'
 
 
-def handle_images_found_in_markdown(
+def handle_image_attachments(
+    markdown: str,
+    attachments: dict[str, dict[str, str]],
+) -> str:
+    """
+    Image attachments are stored in the notebook cell's "attachments" field in base64
+    format with their associated mime_type and referenced in the markdown via
+    attachment name.
+
+    The pattern we search for in the Markdown is
+    `![alt_text](attachment:attachment_name title)` with three groups:
+
+    - group 1 = alt_text (optional)
+    - group 2 = attachment_name
+    - group 3 = title (optional)
+
+    To represent this in MD we replace the attachment reference with the base64 encoded
+    string as `![{alt_text}](data:{mime_type};base64,{img_as_base64})`
+
+    Args:
+        markdown (str): The markdown content containing image attachments.
+        attachments (Dict[str, Dict[str, str]]): A dictionary of attachments with their
+            corresponding MIME types and base64 encoded data.
+
+    Returns:
+        str: The markdown content with images converted to base64 format.
+    """
+    markdown_image_pattern = re.compile(
+        r"""!\[([^\]]*)\]\(attachment:(.*?)(?=\"|\))(\".*\")?\)"""
+    )
+    searches = re.finditer(markdown_image_pattern, markdown)
+    for search in searches:
+        alt_text, attachment_name, _ = search.groups()
+        mime_type, base64 = next(iter(attachments[attachment_name].items()))
+        start, end = search.span()
+        markdown = (
+            markdown[:start]
+            + generate_img_base64_md(base64, mime_type, alt_text)
+            + markdown[end:]
+        )
+    return markdown
+
+
+def handle_image_paths_found_in_markdown(
     markdown: str,
     new_img_dir: Path,
-    lib_dir: Path,
+    nb_path: Path,
 ) -> str:
     """
     Update image paths in the Markdown, and copy the image to the docs location.
@@ -210,6 +253,9 @@ def handle_images_found_in_markdown(
     - group 1 = path/to/image.png
     - group 2 = "title"
 
+    We explicitly exclude matching if the path starts with `attachment:` as this
+    indicates that the image is embedded as a base64 attachment not a file path.
+
     The first group (the path to the image from the original notebook) will be replaced
     with ``assets/img/{name}`` where the name is `image.png` from the example above. The
     original image will also be copied to the new location
@@ -219,12 +265,15 @@ def handle_images_found_in_markdown(
         markdown (str): Markdown where we look for Markdown flavored images.
         new_img_dir (Path): Path where images are copied to for display in the
             MDX file.
-        lib_dir (Path): The location for the Bean Machine repo.
+        lib_dir (Path): The location for the repo.
+        nb_path (Path): The location for the notebook.
 
     Returns:
         str: The original Markdown with new paths for images.
     """
-    markdown_image_pattern = re.compile(r"""!\[[^\]]*\]\((.*?)(?=\"|\))(\".*\")?\)""")
+    markdown_image_pattern = re.compile(
+        r"""!\[[^\]]*\]\((?!attachment:)(.*?)(?=\"|\))(\".*\")?\)"""
+    )
     searches = list(re.finditer(markdown_image_pattern, markdown))
 
     # Return the given Markdown if no images are found.
@@ -250,11 +299,11 @@ def handle_images_found_in_markdown(
 
         # Copy the original image to the new location.
         if old_path.exists():
+            # resolves if an absolute path is used
             old_img_path = old_path
         else:
-            # Here we assume the original image exists in the same directory as the
-            # notebook, which should be in the tutorials folder of the library.
-            old_img_path = (lib_dir / "tutorials" / old_path).resolve()
+            # fall back to path relative to the notebook
+            old_img_path = (nb_path.parent / old_path).resolve()
         new_img_path = str(new_img_dir / name)
         shutil.copy(str(old_img_path), new_img_path)
 
@@ -359,7 +408,7 @@ def get_source(cell: NotebookNode) -> str:
 def handle_markdown_cell(
     cell: NotebookNode,
     new_img_dir: Path,
-    lib_dir: Path,
+    nb_path: Path,
 ) -> str:
     """
     Handle the given Jupyter Markdown cell and convert it to MDX.
@@ -368,17 +417,17 @@ def handle_markdown_cell(
         cell (NotebookNode): Jupyter Markdown cell object.
         new_img_dir (Path): Path where images are copied to for display in the
             Markdown cell.
-        lib_dir (Path): The location for the Bean Machine library.
+        lib_dir (Path): The location for the library.
+        nb_path (Path): The location for the notebook.
 
     Returns:
         str: Transformed Markdown object suitable for inclusion in MDX.
     """
     markdown = get_source(cell)
 
-    # Update image paths in the Markdown and copy them to the Markdown tutorials folder.
-    # Skip - Our images are base64 encoded, so we don't need to copy them to the docs
-    # folder.
-    # markdown = handle_images_found_in_markdown(markdown, new_img_dir, lib_dir)
+    # Handle the different ways images are included in the Markdown.
+    markdown = handle_image_paths_found_in_markdown(markdown, new_img_dir, nb_path)
+    markdown = handle_image_attachments(markdown, cell.get("attachments", {}))
 
     markdown = sanitize_mdx(markdown)
     mdx = mdformat.text(markdown, options={"wrap": 88}, extensions={"myst"})
@@ -411,6 +460,26 @@ def handle_cell_input(cell: NotebookNode, language: str) -> str:
     return f"```{language}\n{cell_source}\n```\n\n"
 
 
+def generate_img_base64_md(
+    img_as_base64: int | str | NotebookNode,
+    mime_type: int | str | NotebookNode,
+    alt_text: str = "",
+) -> str:
+    """
+    Generate a markdown image tag from a base64 encoded image.
+
+    Args:
+        img_as_base64 (int | str | NotebookNode): The base64 encoded image data.
+        mime_type (int | str | NotebookNode): The MIME type of the image.
+        alt_text (str, optional): The alternative text for the image. Defaults to an
+            empty string.
+
+    Returns:
+        str: A markdown formatted image tag.
+    """
+    return f"![{alt_text}](data:{mime_type};base64,{img_as_base64})"
+
+
 def handle_image(
     values: list[dict[str, int | str | NotebookNode]],
 ) -> list[tuple[int, str]]:
@@ -431,7 +500,7 @@ def handle_image(
         index = value["index"]
         mime_type = value["mime_type"]
         img = value["data"]
-        output.append((index, f"![](data:image/{mime_type};base64,{img})\n\n"))
+        output.append((index, f"{generate_img_base64_md(img, mime_type)}\n\n"))
     return output
 
 
@@ -880,7 +949,7 @@ def transform_notebook(path: Path, nb_metadata: object) -> str:
 
         # Handle a Markdown cell.
         if cell_type == "markdown":
-            mdx += handle_markdown_cell(cell, img_folder, LIB_DIR)
+            mdx += handle_markdown_cell(cell, img_folder, path)
 
         # Handle a code cell.
         if cell_type == "code":
diff --git a/tutorials/closed_loop/closed_loop.ipynb b/tutorials/closed_loop/closed_loop.ipynb
@@ -460,7 +460,7 @@
         "\n",
         "Internally, Ax uses a class named `Scheduler` to orchestrate the trial deployment, polling, data fetching, and candidate generation.\n",
         "\n",
-        "![Scheduler state machine](../../assets/scheduler_state_machine.png)\n",
+        "![Scheduler state machine](scheduler_state_machine.png)\n",
         "\n",
         "The `OrchestrationConfig` provides users with control over various orchestration settings:\n",
         "* `parallelism` defines the maximum number of trials that may be run at once. If your external system supports multiple evaluations in parallel, increasing this number can significantly decrease experimentation time. However, it is important to note that as parallelism increases, optimiztion performance often decreases. This is because adaptive experimentation methods rely on previously observed data for candidate generation -- the more tirals that have been observed prior to generation of a new candidate, the more accurate Ax's model will be for generation of that candidate.\n",
diff --git a/tutorials/closed_loop/scheduler_state_machine.png b/tutorials/closed_loop/scheduler_state_machine.png