IMTM HPC Cluster¶

IMTM operates its own high-performance computing (HPC) facility, designed to handle the demanding needs of cutting-edge research in areas like bioinformatics, chemoinformatics, and genome analysis. Since many of these projects involve sensitive patient data, the entire system is managed solely by IMTM staff—no external vendors or third parties are involved, ensuring maximum data security and privacy.

IMTM Datacenter

Compute Resources¶

Node class	CPU	Nodes	Cores / node	RAM / node
Density	AMD EPYC 9654 96-core	12	192	512 GiB
Frequency	AMD EPYC 9474F 48-core	5	96	768 GiB

Total capacity 2784 CPU cores and 9.8 TiB RAM memory.

Theoretical Peak FP64 Performance¶

Clock profile	AMD EPYC 9654 nodes `12`	AMD EPYC 9474F nodes `5`	Whole cluster
Base clocks	88.5 TFLOPs	27.6 TFLOPs	116 TFLOPs
Max boost	136.4 TFLOPs	31.5 TFLOPs	168 TFLOPs

Networking¶

Every node is connected to our core switches with two bonded 25GbE SFP28 ports, giving 50 Gbit/s of effective bandwidth with RoCE-v2 RDMA support. Aggregate leaf bandwidth 850 Gbit/s across the 17 compute nodes.

The cluster is connected to the internet with 10 Gbit/s uplink and is connected to IT4I supercomputer with 100 Gbit/s dedicated line.

Storage¶

Tier	Drives	Raw capacity
SSD `NVMe`	192 × 7.68 TB	1.48 PB
HDD	128 × 18 TB	2.30 PB

Total raw capacity 3.78 PB in a two-tier SSD and HDD.

Software¶

Our cluster comes preinstalled with a wide range of scientific and development tools. It uses Lmod, a Lua-based environment module system, to manage and load software modules from the preinstalled library. These modules are built using EasyBuild, ensuring reproducibility, compatibility, and ease of deployment across the system.

Scheduler¶

The cluster's workload orchestration relies on the latest stable release of OpenPBS, augmented with a suite of custom Python hooks that tailor scheduling to IMTM's research priorities. These hooks enforce fair-share policies, dynamic QoS levels, cgroup-based resource isolation and scratch space management.

Monitoring¶

Our High Performance Computing (HPC) cluster is actively monitored using Prometheus, which collects real-time metrics on system performance and resource usage. These metrics are visualized through Grafana dashboards, allowing operators to monitor cluster health, identify performance bottlenecks, and make informed decisions regarding resource allocation and workload management.