Skip to content

Implement Hardlink Feature for Cache Optimization and Data Deduplication #1953

Open
@ChengyuZhu6

Description

@ChengyuZhu6

Description

I would like to propose the implementation of a hardlink feature in the caching mechanism to optimize memory usage, improve performance and save disk space.

Background

The current caching system stores files in memory, which can lead to high memory usage, especially when dealing with large datasets. By utilizing hardlinks, we can reduce memory consumption and storage redundancy by allowing multiple references to the same file on disk without duplicating the file content.

Design

Key Components

  1. HardlinkManager: Manages the creation, validation, and persistence of hardlinks.
  • CreateLink: Attempts to create a hardlink for a given cache key.
  • HasHardlink: Checks if a hardlink exists for a given key.
  • Persist and Restore: Manages the persistence of hardlink metadata to disk and restores it on startup.
  1. DirectoryCache: Implements the cache logic, including hardlink support.
  • CreateHardlink: Invokes the HardlinkManager to create a hardlink.
  • HasHardlink: Checks for the existence of a hardlink using the HardlinkManager.
  1. Configuration: The EnableHardlink flag in the configuration determines whether hardlinking is enabled.

Work Flow

[Start] 
   |
   v
[Initialize Cache]
   |
   v
[Check if Hardlinking is Enabled]
   |
   v
[Access Cached File] 
   |
   v
[Check if Hardlink Exists] -- No --> [Create Hardlink]
   |                                   |
  Yes                                  v
   |                             [Verify Hardlink]
   v                                   |
[Use Hardlink]                         v
   |                             [Rename to Final Location]
   v                                   |
[Persist Hardlink State] <-------------|
   |
   v
[Restore Hardlink State on Startup]
   |
   v
[End]
+-----------------------------+
  1. Cache Write:
  • When a file is added to the cache, the system checks if hardlinking is enabled.
  • If enabled, it attempts to create a hardlink for the cached file.
  1. Cache Read:
  • When accessing a cached file, the system checks if a hardlink exists.
  • If a hardlink exists, it uses the hardlink path to access the file.
    Persistence:
  • Hardlink metadata is periodically persisted to disk.
  • On startup, the system restores hardlink metadata from disk.

Benefits

  • Reduced Memory Usage: By leveraging hardlinks, we can significantly decrease the memory footprint of the caching system.
  • Improved Performance: Hardlinks allow for faster access to cached files, as they avoid the overhead of duplicating file data.
  • Data Deduplication: Hardlinks inherently support data deduplication by allowing multiple cache entries to reference the same physical file, reducing storage redundancy.
  • Scalability: This feature will enable the caching system to handle larger datasets more efficiently.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions