Skip to content

quobyte/io500-collaboration

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Overview

Compute

The 10 node io500 run has been run on the "Alpha Centauri" GPU cluster at Technical University Dresden. This cluster is designed for AI and ML tasks, it features 2x AMD EPYC 7352 (Zen 2 / Rome, released 2019) CPUs per node, running at 2.3Ghz. The compute cluster has been deployed in 2021. Since 2025-09-04 Quobyte is available to HPC users as a parallel file system on Alpha Centauri and Romeo compute clusters.

Storage

The storage component is a Quobyte cluster running on re-purposed full flash storage hardware. The storage devices have "power on hours" of roughly six years per device. Storage nodes have 64GB of RAM and 7 NVMe devices (3.2TB) each used for data, plus one NVMe used to store metadata. Storage nodes are connected via two Infiniband interfaces, one for client traffic, one for internal storage traffic (i.e. redundancy and maintenance traffic). The storage cluster consists of 78 storage nodes. Quobyte is available as a free version, the full feature set and capacity can be used with a commercial subscription.

Benchmark Execution

The benchmark was run during normal operations. Data and Metadata have been stored fully redundant (3 times replicated on the Quobyte software layer). The benchmark was logically isolated from other workloads using a dedicated storage tenant. The benchmark was started using SLURM as a regular user with no special privileges on compute or storage side. The SLURM job description can be found in reproducability/io500_job_leg0.sh. The job description refers two other scripts: reproducability/mount_wrapper.sh and reproducability/mount_checker.sh. The first one is accountable for mounting Quobyte (all volumes below the tenant io500) to the compute node. The second script adds an access key to that mount scope to enable dynamic volume creation.

Storage System Adjustments

The behavior of a Quobyte file system can be controlled through policies. All io500 relevant storage policy rules were triggered by the tenant-label io500. Further control was applied by specifying a value to that label. For example, to switch between replicated storage (i.e. fully redundant storage mode/ production mode) and unreplicated storage (i.e. scratch storage mode), only the tenant-label value replicated needs to be set. Tenant labels can be controlled by regular storage users (with privileges "tenant_admin" ) and do not need any intervention from the storage admin team.

The relevant filter part in policies.txt looks like this:

    tenant {
      label_pattern {
        name_regex: "io500"
        value_regex: "^replicated$"
      }
    }

To study the policies in detail (or use them for reproduction) have a look at policies.txt

Policies

Metadata control

To get metadata synchronized efficiently across all clients a policy rule is used that matches a specific file name pattern (files created by mdtest).

    file {
      filter_type: NAME
      operator: STARTS_WITH
      text_value: "file.mdtest"
    }
  ...
  policies {
    o_sync_behavior {
      mode: ENABLE_ALWAYS
    }
  }

The relevant rule is called mdtest_sync in "policies.txt".

Not specific to any file name pattern, but valid for the whole tenant, is the rule to disable negative metadata caching (i.e. we forget instantly that a file is not there, which would be cached information otherwise).

    file_metadata_cache {
      cache_ttl_ms: 10000
      negative_cache_ttl_ms: 0
      enable_write_back_cache: false
    }

This rule is called io500_no_negative_metacache in policies.txt

Throughput optimization (targeted at ior-hard)

For ior-hard (or any other application with a known write pattern) Quobyte can be adjusted to change the default block size (the smallest write I/O in Quobyte) as well as object- and segment size. This happens again on file level granularity by matching a name:

    file {
      filter_type: NAME
      operator: REGEX_MATCHES
      text_value: "^file$"
    }

...and then adjust sizes:

      file_structure {
        block_size_bytes: 47008
        object_size_bytes: 7991360
        segment_size_bytes: 43472998400
      }

To get a broad distribution of writes and reads the data stripe count is also increased:

      distribution_schema {
        data_stripe_count: 170
        striping_method: BLOCK_LEVEL
      }

This rule set is called io500_ior_hard_s170_replicated in policies.txt

A minor optimization is targeted at the read path: As a default Quobyte compares checksums after each write as well as after each read. For io500 we still keep write checks, but disable read checks.

    checksum {
      enable_server_checksum_computation_before_write: true
      enable_server_checksum_verification_after_read: false
      enable_client_checksum_computation_before_write: true
      enable_client_checksum_verification_after_read: false
    }

This rule set is called io500_read_path_crc_disabled in policies.txt

Caching

Client read caches are disabled throughout the whole run (which is a default in Quobyte). The client cache component on Quobyte side was only used as a prefetching target.

  policies {
    page_cache {
      mode: DISABLE_PAGECACHE
      enable_direct_writebacks: true
    }
  }

This rule set is called io500_client_pgcache_bypass in policies.txt

Metadata Sharding

Distributing metadata load in Quobyte works through sharding data across multiple volumes. To automate this sharding "mkdir" calls can be translated into "create volume" calls for the first layer of the storage hierarchy (the "tenant" level). This way volumes can be created (and obviously be removed after usage) on the fly and all available metadata services can be used. The user executing those mkdir / rmdir calls needs tenant admin privileges in Quobyte (to be able to create volumes within a dedicated tenant). Those privileges have been granted using access keys (see reproducability/mount_checker.sh how to inject those keys).

Fault Tolerance

The tested storage system does not have a single point of failure.

Controll Plane Availability

The system is running Metadata Services distributed across all storage nodes. The metadata system is a three times replicated database with automated failover in case of an error. The metadata failure domain level is "machine level", so one machine can fail and metadata is still served. Since data for this test is sharded across many, distributed (and dynamically created volumes), a double failure domain failure is affecting only a fraction of metadata.

Data Plane Availability

All data generated throughout the io500 run is replicated three times. Each replica is stored on a different failure domain (i.e. machine). A loss of a single machine will trigger automated fail over/ client redirects to another failure domain within seconds. A double machine failure will affect a fraction of data generated throughout this test, since all data is spread across all failure domains.

About

Documentation of the 10 node io500 benchmark run at TU Dresden

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages