I/O model based on HDF5 - Hua Xu, Gnosis Research Center (IIT) - HUG24

Поделиться
HTML-код
  • Опубликовано: 15 сен 2024
  • From the 2024 HDF5 User Group Meeting (#HUG24) held August 5-7, 2024 in Chicago, IL.
    I/O model based on HDF5 - Hua Xu, Gnosis Research Center (IIT)
    As computer applications become more data-intensive, their demands on storage systems for efficient storage and retrieval have significantly increased. Compute resources on clusters are often used exclusively by users for maximizing performance, but storage resources are shared across multiple users for better utilization. In such environments where resources are shared by workloads, an application’s IO performance can vary significantly due to interference from other jobs. A related problem is that of scheduling user jobs on a cluster to maximize resource utilization and minimize total execution time. A data acquisition system(DAC) deployed on clusters is a useful tool that can be used by job schedulers to make informed scheduling decisions. In this work we propose a DAC with predictive models that can learn the I/O workloads on clusters and provide predictions on system performance.
    Modeling the performance of the storage layer on clusters is challenging due to the presence of multiple interacting software, sophisticated hardware, variable file types and layouts on disks and variable IO traffic from users. User-observed IO performance depends on the IO library and its usage of the file system. The IO library’s metadata APIs and the available parallelism in the file system affect the parallel IO performance. The file layout on disks (stripe count and stripe size) is another significant factor that affects load balance and parallelism in the storage layer. The impact of interference from other users is hard to model accurately. This interference is one of the reasons why empirical models of IO performance and storage systems have not been successful for modern HPC systems.
    We propose a supervised learning based I/O model that updates itself with feedback from the cluster. This IO model will predict the IO time (read/write) per process for a given file layout, average IO request size (number of bytes), number of concurrent readers/writers, IO servers and storage disks. The learning framework will consist of a trained base performance model which will be continually updated as new data arrives. Updates will be incorporated in to the base model by minimizing the influence of outliers to provide accurate predictions in the presence of interference. The predicted IO performance is an indicator of the current load on the storage servers. It can be used by the job scheduling algorithm to minimize the total IO time of a set of IO jobs on the cluster.
    This work will be carried out on the Ares cluster, which consists of one rack of compute nodes. All nodes share a 48TB RAID-5 storage pool comprised of eight 8TB 7200 RPM SAS hard drives. Nodes within each rack are connected with 40Gbps Ethernet with RoCE support. The model will be built and analyzed for the HDF5 file format, with ROMIO extensions for MPI-IO and PVFS2 (Parallel Virtual File System). Key parameters in HDF5 and PVFS, such as the number of processes, servers, and clients in PVFS, and stripe size, are considered as significant parameters for the model.
    For more information on this conference including all sessions and slide decks, visit www.hdfgroup.o... To learn more about upcoming HUG events, please visit www.hdfgroup.o...

Комментарии •