Purpose

To systematically alleviate user inconvenience and management workload caused by SSD capacity issues on computing nodes

Method

Please note that in the following description, "system SSD" refers to the SSD where /local_dataset is located, while "additional SSD" refers to SSDs other than the system SSD, such as those where /data2/local_dataset is located.

  1. Server Classification

    Computing nodes and storage have been classified as follows.

    A. Low-Capacity Nodes These are nodes equipped with only a system SSD of less than 1TB.

    B. Low-Capacity Node with Additional SSD This refers to a node equipped with a system SSD of less than 1TB and an additional SSD.

    C. High-Capacity Node This refers to nodes equipped with a system SSD of 1TB or more.

    D. High-Capacity Nodes with Additional SSDs Refers to nodes equipped with a system SSD of 1TB or more and an additional SSD.

    E. Remote Storage Refers to NAS and Ceph storage, excluding the SSDs of computing nodes.

  2. Classification of Datasets

    The types of datasets that can be uploaded to computing nodes and the applicable policies are summarized as follows.

    Note: In the context below, a "dataset" refers to the directory containing files loaded for model training and inference, or files saved during training and inference.

    Please delete the original containers of datasets (such as .tar and .zip files) so that they do not remain on the computing node.

    Additionally, datasets that do not comply with the upload policy may be deleted at any time without prior notice. Please take note of this.

    1. Low-capacity datasets
      • This refers to datasets of 50 GB or less.
      • They can only be uploadedto the system SSDs of (A, B, C, D) type nodes .
      • These datasets will be subject to deletion every three months following a public announcement, without individual notification.
    2. High-Capacity Datasets
      • This refers to datasets of 50GB or more.
      • Uploads are permitted onlyto the additionalSSD on Type B nodesand to the system and additional SSDs on Type C and D nodes .
      • Every three months, a list of datasets scheduled for deletion from each computing node will be announced via the Slack channel.
        1. Datasets subject to deletion are those uploaded more than 6 months prior to the announcement.
        2. A two-week grace period is granted from the announcement until deletion.
        3. During the grace period**,** users can post a deletion prevention request as a comment on the announcement.
        4. The administrator will delete the remaining datasets for which no deletion prevention requests were received and will announce the results in the Slack channel.
    3. Datasets such as webdatasets and streaming datasets, which are split into several large containers (Streaming Datasets)
      • This refers to datasets stored in the form of several large containers, where only the necessary portions are streamed from within each container for model training.
      • When used on computing nodes ,the policies in a. and b. apply depending on the capacity.
      • (Pilot Operation) Since it can be used for model training without placing a load on remote storage, it can also be used withNAS, Ceph , and other storage solutions.
        • However, during the pilot operation period, the minimum allowed container file size and the number of splits may be adjusted depending on circumstances, and permission to use NAS or Ceph storage may be revoked.
  3. Other Notes

The policies applied to each node according to the above classifications are summarized in the table below.

🟧 : Upload allowed only to system SSD 🟦 : Upload allowed only to additional SSD 🟩 : Upload allowed to both system and additional SSD ⭕ : Upload allowed to remote storage (/data, /ceph_data) ❌ : Upload not allowed to this node or storage

| Type | Node List | Low-capacity datasets (<50GB) | Large datasets (>50GB) | Streaming Dataset | Pretrained weights, compile cache | | --- | --- | --- | --- | --- | --- | | A. Low capacity | moana-u[2-6] | 🟧 | ❌ | Varies by size | ❌ | | B. Low capacity + additional | ariel-v[1-13] | 🟧 | 🟦 | Varies by capacity | ❌ | | C. High capacity | moana-r[2, 5], y[1-7], u[1, 8], ariel-m2, n1 aurora-g1 | 🟧 | 🟩 | Varies by capacity | ❌ | | D. High capacity + Additional | moana-r[1, 3, 4], ariel-k[1, 2], g[1-5], aurora-g[2-8] | 🟧 | 🟩 | Varies by capacity | ❌ | | E. Ceph, NAS | - | ❌ | ❌ | ⭕ (Pilot Operation) | ⭕ |