Skip to content

epub images in lfs#2144

Merged
jo-elimu merged 2 commits intomainfrom
2060-epub-images-in-lfs
Apr 18, 2025
Merged

epub images in lfs#2144
jo-elimu merged 2 commits intomainfrom
2060-epub-images-in-lfs

Conversation

@jo-elimu
Copy link
Copy Markdown
Member

Issue Number

Purpose

Technical Details

Testing Instructions

Screenshots


Format Checks

Note

Files in PRs are automatically checked for format violations with mvn spotless:check.

If this PR contains files with format violations, run mvn spotless:apply to fix them.

@jo-elimu jo-elimu self-assigned this Apr 18, 2025
@jo-elimu jo-elimu requested a review from a team as a code owner April 18, 2025 08:08
@jo-elimu jo-elimu linked an issue Apr 18, 2025 that may be closed by this pull request
6 tasks
@jo-elimu jo-elimu requested review from AshishBagdane, shiv810 and vuriaval and removed request for a team April 18, 2025 08:08
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 18, 2025

Codecov Report

Attention: Patch coverage is 0% with 16 lines in your changes missing coverage. Please review.

Project coverage is 14.98%. Comparing base (f1b3e89) to head (a40d873).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
...t/storybook/StoryBookCreateFromEPubController.java 0.00% 12 Missing ⚠️
...java/ai/elimu/entity/content/multimedia/Image.java 0.00% 3 Missing ⚠️
src/main/java/ai/elimu/util/GitHubLfsHelper.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##               main    #2144   +/-   ##
=========================================
  Coverage     14.97%   14.98%           
  Complexity      387      387           
=========================================
  Files           232      232           
  Lines          6089     6085    -4     
  Branches        703      701    -2     
=========================================
  Hits            912      912           
+ Misses         5127     5123    -4     
  Partials         50       50           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 18, 2025

Walkthrough

This set of changes standardizes the storage and retrieval of images by updating the filename generation and URL construction to use the image's MD5 checksum and format extension, rather than relying on entity IDs or revision numbers. The image upload process to GitHub LFS is now performed immediately after image extraction, particularly for ePUB imports, rather than after persisting the parent entity. The JSP views for images and videos have been updated to display additional metadata, including checksums and URLs, with improved formatting. No changes were made to public method signatures or exported entities.

Changes

File(s) Change Summary
pom-dependency-tree.txt Updated main artifact version from 2.5.92-SNAPSHOT to 2.5.93-SNAPSHOT.
src/main/java/ai/elimu/entity/content/multimedia/Image.java Simplified getUrl() to always generate a GitHub raw content URL using the image's MD5 checksum and format extension, removing conditional logic and local fallback.
src/main/java/ai/elimu/util/GitHubLfsHelper.java Changed image filename generation in uploadImageToLfs to use the image's MD5 checksum and format extension instead of ID and revision.
src/main/java/ai/elimu/web/content/storybook/StoryBookCreateFromEPubController.java Moved image storage (upload to GitHub LFS, database save, contribution event) to immediately after extraction from ePUB, eliminating deferred storage after StoryBook persistence. Adjusted chapter image handling accordingly.
src/main/webapp/WEB-INF/jsp/content/multimedia/image/edit.jsp Added display of image's MD5 checksum, URL, and GitHub checksum in the aside section.
src/main/webapp/WEB-INF/jsp/content/multimedia/video/edit.jsp Added line breaks after labels for checksum, URL, and GitHub checksum for improved layout in the aside section.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant StoryBookController
    participant GitHubLfsHelper
    participant Database

    User ->> StoryBookController: Upload ePUB file
    StoryBookController ->> StoryBookController: Extract images (cover, chapters)
    loop For each extracted image
        StoryBookController ->> GitHubLfsHelper: uploadImageToLfs(image bytes, ...)
        GitHubLfsHelper ->> StoryBookController: Return CID/checksum
        StoryBookController ->> Database: Save image with CID/checksum
        StoryBookController ->> Database: Record contribution event
    end
    StoryBookController ->> Database: Persist StoryBook and chapters with associated images
Loading

Assessment against linked issues

Objective Addressed Explanation
Store images in the webapp-lfs repo; Store ePUB images in the webapp-lfs repo (#2060)
Delete image download controller (#2060) No evidence of controller deletion in this PR; not addressed here.
Calculate checksum of bytes, and store in new database column (#2060) No changes related to checksum calculation/storage; assumed handled in previous PRs.
Delete bytes from the database (#2060) No changes related to deleting image bytes field in this PR.

Possibly related PRs

Suggested reviewers

  • AshishBagdane
  • SnehaHS65
  • Souvik-Cyclic

Tip

⚡💬 Agentic Chat (Pro Plan, General Availability)
  • We're introducing multi-step agentic chat in review comments and issue comments, within and outside of PR's. This feature enhances review and issue discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments and add commits to existing pull requests.
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
src/main/java/ai/elimu/web/content/storybook/StoryBookCreateFromEPubController.java (1)

291-297: 🛠️ Refactor suggestion

Replicate the robustness improvements for chapter‑level images

The same failure/duplication risk exists here. Please wrap the upload in try/catch and check for an existing image with the same MD5 before creating a new DB row, mirroring the cover‑image logic suggested above.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8daa59e and a40d873.

📒 Files selected for processing (6)
  • pom-dependency-tree.txt (1 hunks)
  • src/main/java/ai/elimu/entity/content/multimedia/Image.java (1 hunks)
  • src/main/java/ai/elimu/util/GitHubLfsHelper.java (1 hunks)
  • src/main/java/ai/elimu/web/content/storybook/StoryBookCreateFromEPubController.java (4 hunks)
  • src/main/webapp/WEB-INF/jsp/content/multimedia/image/edit.jsp (1 hunks)
  • src/main/webapp/WEB-INF/jsp/content/multimedia/video/edit.jsp (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (7)
  • GitHub Check: build (ubuntu-latest, 21)
  • GitHub Check: build (macos-latest, 17)
  • GitHub Check: build (windows-latest, 17)
  • GitHub Check: build (windows-latest, 21)
  • GitHub Check: build (ubuntu-latest, 17)
  • GitHub Check: build (macos-latest, 21)
  • GitHub Check: test_rest_localhost
🔇 Additional comments (5)
pom-dependency-tree.txt (1)

1-1: Version bump reflects new image handling implementation

The version update from 2.5.92-SNAPSHOT to 2.5.93-SNAPSHOT corresponds to the changes implemented for standardized image storage and retrieval using MD5 checksums.

src/main/webapp/WEB-INF/jsp/content/multimedia/video/edit.jsp (1)

361-361: Improved label formatting for better readability

The addition of line breaks after each label ensures consistent formatting and improves the visual separation between metadata fields in the UI.

Also applies to: 365-365, 369-369

src/main/webapp/WEB-INF/jsp/content/multimedia/image/edit.jsp (1)

413-426: Added display of image metadata for better transparency

This addition displays important image metadata (MD5 checksum, file URL, and GitHub checksum) that was previously not visible in the UI. This improves transparency and helps users understand how images are stored and accessed in the system.

src/main/java/ai/elimu/util/GitHubLfsHelper.java (1)

38-38: Standardized filename generation using content-based identifiers

Changed the filename generation strategy from using ID and revision number to using the MD5 checksum. This approach:

  1. Creates more stable and content-based identifiers
  2. Ensures consistent naming across the application
  3. Potentially reduces storage by avoiding duplicate content with different names

This aligns with the Image entity's URL generation to use MD5 checksums throughout the system.

src/main/java/ai/elimu/web/content/storybook/StoryBookCreateFromEPubController.java (1)

382-386: 👍 Condition now correctly persists image‑only chapters

Allowing chapters that contain an illustration but zero paragraphs fixes loss of visual‑only pages (e.g. title pages). The sort‑order update still happens only after a successful create, so numbering remains consistent.

Comment on lines 45 to +49
public String getUrl() {
String filename = getId() + "_r" + getRevisionNumber() + "." + getImageFormat().toString().toLowerCase();
if (cid != null) {
return "https://raw.githubusercontent.com/elimu-ai/webapp-lfs/main/" +
"lang-" + EnvironmentContextLoaderListener.PROPERTIES.getProperty("content.language") + "/" +
"images/" +
filename;
} else {
return "/image/" + filename;
}
return "https://raw.githubusercontent.com/elimu-ai/webapp-lfs/main" +
"/lang-" + EnvironmentContextLoaderListener.PROPERTIES.getProperty("content.language") +
"/images" +
"/" + getChecksumMd5() + "." + getImageFormat().toString().toLowerCase();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Avoid hard‑coding the GitHub branch & add null‑safety for env property

getUrl() now always points to
https://raw.githubusercontent.com/elimu-ai/webapp-lfs/main/....
Hard‑coding the branch (main) and the repository owner/name makes production/staging migrations brittle and blocks the possibility of running against forks.
In addition, if content.language is missing in EnvironmentContextLoaderListener.PROPERTIES, an NPE will be thrown during URL construction.

-    return "https://raw.githubusercontent.com/elimu-ai/webapp-lfs/main" +
-        "/lang-" + EnvironmentContextLoaderListener.PROPERTIES.getProperty("content.language") +
-        "/images" +
-        "/" + getChecksumMd5() + "." + getImageFormat().toString().toLowerCase();
+    String language = EnvironmentContextLoaderListener.PROPERTIES.getProperty("content.language");
+    Objects.requireNonNull(language, "Missing env property: content.language");
+    String lfsRepo = EnvironmentContextLoaderListener.PROPERTIES.getProperty(
+            "lfs.repo.url",
+            "https://raw.githubusercontent.com/elimu-ai/webapp-lfs/main");
+    return String.format("%s/lang-%s/images/%s.%s",
+            lfsRepo,
+            language.toLowerCase(),
+            getChecksumMd5(),
+            getImageFormat().name().toLowerCase());

This keeps the default behaviour intact while letting ops override the LFS repository and preventing silent NPEs.


🏁 Script executed:

#!/bin/bash
# Locate the listener class and inspect the PROPERTIES field
rg -n "class EnvironmentContextLoaderListener" -n src/main/java
rg -n "PROPERTIES" -n src/main/java

Length of output: 14825


Make the GitHub LFS URL configurable and enforce a non‑null content.language

Hard‑coding the LFS URL and branch (https://raw.githubusercontent.com/elimu-ai/webapp-lfs/main) makes deploys and forks brittle, and if content.language isn’t set you’ll silently end up with lang-null. In src/main/java/ai/elimu/entity/content/multimedia/Image.java replace getUrl() with:

 public String getUrl() {
-    return "https://raw.githubusercontent.com/elimu-ai/webapp-lfs/main" +
-        "/lang-" + EnvironmentContextLoaderListener.PROPERTIES.getProperty("content.language") +
-        "/images" +
-        "/" + getChecksumMd5() + "." + getImageFormat().toString().toLowerCase();
+    // ensure language is defined
+    String language = EnvironmentContextLoaderListener.PROPERTIES.getProperty("content.language");
+    Objects.requireNonNull(language, "Missing env property: content.language");
+
+    // allow overriding the LFS repo/branch via properties
+    String lfsRepoUrl = EnvironmentContextLoaderListener.PROPERTIES.getProperty(
+        "lfs.repo.url",
+        "https://raw.githubusercontent.com/elimu-ai/webapp-lfs/main");
+
+    return String.format(
+        "%s/lang-%s/images/%s.%s",
+        lfsRepoUrl,
+        language.toLowerCase(),
+        getChecksumMd5(),
+        getImageFormat().name().toLowerCase()
+    );
 }

This preserves the current behavior while letting lfs.repo.url be overridden and failing fast if content.language is missing.

Comment on lines +192 to +199

// Store the cover image
storyBookCoverImage.setTitle(storyBook.getTitle() + "_cover");
String checksumGitHub = GitHubLfsHelper.uploadImageToLfs(storyBookCoverImage, coverImageBytes);
storyBookCoverImage.setCid(checksumGitHub);
imageDao.create(storyBookCoverImage);
storeImageContributionEvent(storyBookCoverImage, session, request);

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling & duplicate detection when uploading the cover image

GitHubLfsHelper.uploadImageToLfs can fail due to I/O/network issues or GitHub‐side problems.
Currently an unchecked exception will bubble up, aborting the whole EPUB import and leaving partial DB artefacts (e.g. an image row without a StoryBook). It also stores the image unconditionally even if another image with identical MD5 already exists.

-        String checksumGitHub = GitHubLfsHelper.uploadImageToLfs(storyBookCoverImage, coverImageBytes);
-        storyBookCoverImage.setCid(checksumGitHub);
-        imageDao.create(storyBookCoverImage);
+        try {
+            // Re‑use existing image if the same MD5 is already present
+            Image existing = imageDao.readByChecksumMd5(storyBookCoverImage.getChecksumMd5());
+            if (existing != null) {
+                storyBookCoverImage = existing; // reuse, no extra upload
+            } else {
+                String checksumGitHub = GitHubLfsHelper.uploadImageToLfs(
+                        storyBookCoverImage, coverImageBytes);
+                storyBookCoverImage.setCid(checksumGitHub);
+                imageDao.create(storyBookCoverImage);
+            }
+        } catch (IOException | RuntimeException ex) {
+            log.error("Failed to upload cover image to GitHub LFS", ex);
+            result.reject("coverImageUploadFailed",
+                    "Unable to store cover image. Please try again");
+            return "content/storybook/create-from-epub";
+        }

This both hardens the flow and prevents duplicate rows.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Store the cover image
storyBookCoverImage.setTitle(storyBook.getTitle() + "_cover");
String checksumGitHub = GitHubLfsHelper.uploadImageToLfs(storyBookCoverImage, coverImageBytes);
storyBookCoverImage.setCid(checksumGitHub);
imageDao.create(storyBookCoverImage);
storeImageContributionEvent(storyBookCoverImage, session, request);
// Store the cover image
storyBookCoverImage.setTitle(storyBook.getTitle() + "_cover");
try {
// Re‑use existing image if the same MD5 is already present
Image existing = imageDao.readByChecksumMd5(storyBookCoverImage.getChecksumMd5());
if (existing != null) {
storyBookCoverImage = existing; // reuse, no extra upload
} else {
String checksumGitHub = GitHubLfsHelper.uploadImageToLfs(
storyBookCoverImage, coverImageBytes);
storyBookCoverImage.setCid(checksumGitHub);
imageDao.create(storyBookCoverImage);
}
} catch (IOException | RuntimeException ex) {
log.error("Failed to upload cover image to GitHub LFS", ex);
result.reject("coverImageUploadFailed",
"Unable to store cover image. Please try again");
return "content/storybook/create-from-epub";
}
storeImageContributionEvent(storyBookCoverImage, session, request);

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also stores the image unconditionally even if another image with identical MD5 already exists.

@coderabbitai Please create a new GitHub issue for avoiding storage of image files with existing MD5 checksums 🙏

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, let's check if the ImageDao class already has a method to read by MD5 checksum
rg "readByChecksumMd5|findByChecksumMd5|findByMd5|readByMd5" -A 2 -B 2

Length of output: 72


🏁 Script executed:

#!/bin/bash
# Let's look at the ImageDao interface structure to see what methods are available
rg -A 10 -B 5 "interface ImageDao" 

# And check the Image entity structure
rg -A 15 -B 5 "class Image" --type java

# Let's see if there are any existing issues related to image deduplication
gh issue list --state all --search "image duplicate MD5 checksum"

Length of output: 25610


🏁 Script executed:

#!/bin/bash
# Let's examine what is needed to create a method for finding an image by MD5 checksum
# First, let's look at the Image entity more closely to understand its structure
cat src/main/java/ai/elimu/entity/content/multimedia/Image.java

# Then let's look at the full ImageDao interface
cat src/main/java/ai/elimu/dao/ImageDao.java

# And the ImageDaoJpa implementation to see how other finder methods are implemented
cat src/main/java/ai/elimu/dao/jpa/ImageDaoJpa.java

# Now create the GitHub issue
gh issue create \
  --title "Feature: Add method to avoid duplicate storage of images with identical MD5 checksums" \
  --body "## Description
When uploading images to the application (particularly during ePUB import in \`StoryBookCreateFromEPubController\`), the system currently stores images unconditionally, even if another image with the same MD5 checksum already exists in the database. This leads to:

1. Duplicate storage of identical images
2. Redundant uploads to GitHub LFS 
3. Wasted storage space
4. Potential inconsistencies in image references

## Proposed Solution
1. Add a \`readByChecksumMd5(String checksumMd5)\` method to \`ImageDao\` interface
2. Implement the method in \`ImageDaoJpa\` to find images by their MD5 checksum
3. Update image upload workflows (especially in \`StoryBookCreateFromEPubController\`) to check for existing images before storing new ones

## Implementation Details
The new method should:
- Check if an image with the same MD5 checksum exists
- Return the existing image if found, or null if not found

Example implementation for \`ImageDaoJpa\`:
\`\`\`java
@Override
public Image readByChecksumMd5(String checksumMd5) throws DataAccessException {
    try {
        return (Image) em.createQuery(
            \"SELECT i \" +
            \"FROM Image i \" +
            \"WHERE i.checksumMd5 = :checksumMd5\")
            .setParameter(\"checksumMd5\", checksumMd5)
            .getSingleResult();
    } catch (NoResultException e) {
        return null;
    }
}
\`\`\`

## Context
This issue was identified during a code review in [PR #2144](https://github.com/elimu-ai/webapp/pull/2144#discussion_r2050317598) which points out that images are stored unconditionally without checking for existing identical images.

## Acceptance Criteria
- [ ] Add \`readByChecksumMd5\` method to \`ImageDao\` interface
- [ ] Implement the method in \`ImageDaoJpa\`
- [ ] Update image storage logic in \`StoryBookCreateFromEPubController\` to use this method
- [ ] Update other image upload flows to use this method
- [ ] Add appropriate unit tests
- [ ] Verify that duplicate images are not stored when the same image is uploaded multiple times" \
  --assignee "jo-elimu"

Length of output: 5749


@jo-elimu I've created a GitHub issue for you as requested: Issue #2145 - Feature: Add method to avoid duplicate storage of images with identical MD5 checksums.

The issue includes details on the problem, a proposed solution with example implementation for readByChecksumMd5(), and comprehensive acceptance criteria. I've assigned it to you as requested.


🧬 Code Graph Analysis Results

The following snippets are relevant to understanding the file src/main/java/ai/elimu/web/content/storybook/StoryBookCreateFromEPubController.java and the user comment about storing images unconditionally even if another image with identical MD5 already exists:


From src/main/java/ai/elimu/web/content/storybook/StoryBookCreateFromEPubController.java (lines 1-350 approx)

Summary:

  • This controller handles the creation of a StoryBook from an uploaded ePUB file.
  • The handleSubmit method processes the uploaded ePUB, extracts metadata, cover image, chapters, chapter images, and paragraphs.
  • For each image (cover and chapter images), it reads the bytes, calculates the MD5 checksum, determines the image format, and uploads the image to GitHub LFS.
  • The image is then stored in the database via imageDao.create(image).
  • There is no check to avoid storing duplicate images based on MD5 checksum; images are stored unconditionally.
  • The method storeImageContributionEvent is called after storing each image to log the contribution event and send a Discord notification.
  • The user comment points out that the image is stored unconditionally even if another image with the same MD5 checksum exists, suggesting a potential improvement.

From src/main/java/ai/elimu/util/GitHubLfsHelper.java (lines 20-111)

@Slf4j
public class GitHubLfsHelper {

    private static final String API_BASE_URL = "https://api.github.com/repos/elimu-ai/webapp-lfs/contents/";

    /**
     * Upload image to LFS.
     * 
     * @param image The Image representing the file bytes.
     * @param bytes The file bytes to be stored.
     * @return The checksum (SHA) generated by GitHub.
     */
    public static String uploadImageToLfs(Image image, byte[] bytes) {
        log.info("uploadImageToLfs");

        String languageCode = EnvironmentContextLoaderListener.PROPERTIES.getProperty("content.language");
        log.info("languageCode: " + languageCode);
        
        String filename = image.getChecksumMd5() + "." + image.getImageFormat().toString().toLowerCase();
        log.info("filename: " + filename);

        String path = "lang-" + languageCode + "/images/" + filename;
        log.info("path: " + path);
        
        return uploadFileToLfs(path, bytes);
    }

    /**
     * Store file in webapp-lfs
     * https://docs.github.com/en/rest/repos/contents?apiVersion=2022-11-28#create-or-update-file-contents
     */
    private static String uploadFileToLfs(String path, byte[] bytes) {
        log.info("uploadFileToLfs");

        String url = API_BASE_URL + path;
        log.info("url: " + url);

        JsonObject body = new JsonObject();
        body.addProperty("message", "chore: add " + path );
        body.addProperty("content", Base64.getEncoder().encodeToString(bytes));
        
        JsonObject committer = new JsonObject();
        committer.addProperty("name", "Nya Ξlimu");
        committer.addProperty("email", "info@elimu.ai");
        body.add("committer", committer);
        log.debug("body: " + body);

        String accessToken = ConfigHelper.getProperty("github.lfs.token");
        HttpResponse<String> httpResponse = Unirest.put(url)
            .header("Authorization", "Bearer " + accessToken)
            .body(body.toString())
            .asString();
        log.info("httpResponse: " + httpResponse);
        log.info("httpResponse.getStatus(): " + httpResponse.getStatus());
        log.info("httpResponse.isSuccess(): " + httpResponse.isSuccess());
        
        JsonObject responseAsJson = JsonParser.parseString(httpResponse.getBody()).getAsJsonObject();
        log.info("responseAsJson.keySet(): " + responseAsJson.keySet());
        if (!httpResponse.isSuccess()) {
            log.warn("responseAsJson: " + responseAsJson);
            return null;
        } else {
            JsonElement contentElement = responseAsJson.get("content");
            log.info("contentElement: " + contentElement);
            String sha = contentElement.getAsJsonObject().get("sha").getAsString();
            log.info("sha: " + sha);
            return sha;
        }
    }
}

Summary:

  • This helper class uploads images to GitHub LFS.
  • The filename is based on the MD5 checksum of the image bytes.
  • The upload is done via GitHub API PUT request.
  • The method returns the SHA checksum from GitHub.
  • There is no explicit check here to avoid uploading duplicate files; the filename is based on checksum, so GitHub may handle duplicates internally, but the code always attempts to upload.

From src/main/java/ai/elimu/web/content/multimedia/image/ImageCreateController.java (lines 39-173)

Summary:

  • This controller handles manual creation of images.
  • It checks if an image with the same title already exists and rejects duplicates by title.
  • It calculates the MD5 checksum of the uploaded image bytes.
  • It stores the image in the database and uploads it to GitHub LFS.
  • No explicit check for duplicate images by MD5 checksum before storing.
  • This is relevant as a comparison to the ePUB controller's behavior.

From src/main/java/ai/elimu/web/content/storybook/StoryBookCreateController.java (lines 29-113)

Summary:

  • This controller handles creation of StoryBooks (not from ePUB).
  • It checks for duplicate StoryBook titles before creating.
  • It stores the StoryBook and creates contribution events.
  • No direct image handling here, but relevant for understanding StoryBook creation flow.

From src/main/java/ai/elimu/web/content/storybook/StoryBookEditController.java (lines 49-249)

Summary:

  • This controller handles editing of StoryBooks.
  • It checks for duplicate titles on edit.
  • It updates StoryBook metadata and contribution events.
  • It manages images and chapters but does not show image deduplication logic.

From src/main/java/ai/elimu/web/content/storybook/StoryBookChapterCreateController.java (lines 29-122)

Summary:

  • Handles creation of StoryBook chapters.
  • Stores chapters and contribution events.
  • Manages images associated with chapters but no deduplication logic shown.

From src/main/java/ai/elimu/web/content/storybook/StoryBookParagraphCreateController.java (lines 31-121)

Summary:

  • Handles creation of StoryBook paragraphs.
  • Stores paragraphs and contribution events.
  • No image handling here.

From src/main/java/ai/elimu/web/content/storybook/StoryBookPeerReviewEventCreateController.java (lines 23-98)

Summary:

  • Handles peer review events for StoryBooks.
  • No image handling or deduplication.

From src/main/java/ai/elimu/util/ml/ReadingLevelUtil.java (lines 11-64)

Summary:

  • Utility class to predict reading level based on chapter, paragraph, and word counts.
  • Used in StoryBookCreateFromEPubController to set reading level after import.
  • No image handling.

From src/main/java/ai/elimu/web/context/EnvironmentContextLoaderListener.java (lines 32-167)

Summary:

  • Loads environment properties including GitHub LFS token and content language.
  • Used by GitHubLfsHelper and controllers for configuration.
  • No image deduplication logic.

From src/main/java/ai/elimu/web/content/multimedia/image/ImageEditController.java (lines 53-348)

Summary:

  • Handles editing of images.
  • Checks for duplicate image titles on edit.
  • Updates image metadata and uploads to GitHub LFS.
  • No explicit deduplication by MD5 checksum.

From src/main/java/ai/elimu/web/content/storybook/StoryBookCsvExportController.java (lines 30-138)

Summary:

  • Exports StoryBooks and related data as CSV.
  • Removes duplicate image content by replacing image object with an object containing only the image ID.
  • This shows awareness of duplicate image content in export but not in storage.

From src/main/java/ai/elimu/web/content/storybook/StoryBookParagraphEditController.java (lines 29-119)

Summary:

  • Handles editing of StoryBook paragraphs.
  • No image handling or deduplication.

From src/main/java/ai/elimu/web/content/storybook/StoryBookParagraphDeleteController.java (lines 25-110)

Summary:

  • Handles deletion of StoryBook paragraphs.
  • No image handling or deduplication.

From src/main/java/ai/elimu/web/content/storybook/chapter/StoryBookChapterDeleteController.java (lines 31-149)

Summary:

  • Handles deletion of StoryBook chapters and associated images.
  • Deletes images and contribution events related to the chapter image.
  • No deduplication logic.

From src/main/java/ai/elimu/web/content/storybook/StoryBookListController.java (lines 12-32)

Summary:

  • Lists StoryBooks by reading level.
  • No image handling or deduplication.

From src/main/java/ai/elimu/web/content/letter_sound/LetterSoundCreateController.java (lines 30-112)

Summary:

  • Handles creation of letter-sound correspondences.
  • Checks for duplicates before creation.
  • No image handling.

From src/main/java/ai/elimu/web/content/letter_sound/LetterSoundEditController.java (lines 34-130)

Summary:

  • Handles editing of letter-sound correspondences.
  • Checks for duplicates before update.
  • No image handling.

From src/main/java/ai/elimu/web/servlet/CustomDispatcherServlet.java (lines 58-259)

Summary:

  • Custom DispatcherServlet for initializing web app context and populating database.
  • No image handling or deduplication.

From src/main/java/ai/elimu/web/servlet/CustomSimpleMappingExceptionResolver.java (lines 7-14)

Summary:

  • Custom exception resolver for logging exceptions.
  • No image handling.

From src/main/java/ai/elimu/web/WelcomeController.java (lines 9-20)

Summary:

  • Simple welcome controller.
  • No image handling.

From test files (various)

Summary:

  • Selenium UI tests for various pages.
  • No image handling or deduplication.

Summary of Relevant Context for the User Comment

  • The StoryBookCreateFromEPubController stores images extracted from ePUB files unconditionally, without checking if an image with the same MD5 checksum already exists in the database.
  • The GitHubLfsHelper uploads images to GitHub LFS using a filename based on the MD5 checksum, but it does not prevent duplicate uploads at the application level.
  • Other image-related controllers (ImageCreateController, ImageEditController) check for duplicate image titles but not for duplicate image content by checksum.
  • The CSV export controller removes duplicate image content references by replacing image objects with only their IDs, indicating awareness of duplicate images in export but not in storage.
  • There is no existing code in the provided snippets that checks for existing images by MD5 checksum before storing a new image in the database or uploading to GitHub LFS.

This context supports the user comment about the need to avoid storing duplicate image files with existing MD5 checksums.


[End of relevant code snippets and summaries.]

@jo-elimu jo-elimu merged commit fb8d1f0 into main Apr 18, 2025
10 checks passed
@jo-elimu jo-elimu deleted the 2060-epub-images-in-lfs branch April 18, 2025 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Move storage of images from database to external platform

1 participant