计算系统

sinfo查看集群资源情况，例如：查询到有2个分区(partition)：vgpu和dcu，分别是英伟达GPU和国产曙光DCU，以及每个分区的节点数量、状态信息等。

1 计算资源

Partition（资源分区）	节点数量	节点物理资源配置
dcu	4	每台DCU服务器的资源配置如下： - 8 * K100AI 64GB PCI-e 国产海光DCU卡 - 2 * Intel(R) Xeon(R) Gold 6430(32 core） - 1TB 内存，7.68 TB NVME 本地硬盘 - 2 * 200Gbps IB网卡，1 * 25Gbps 以太网卡 - 8 * 196 TFlops = 1.568 Pflops FP16算力
gpu	2	共2台GPU服务器，其中一台为A800 GPU服务器，另一台为L40 GPU服务器 A800GPU服务器资源配置如下： - 8 * A800 80GB PCI-e NVIDIA GPU卡 - 2 * Intel(R) Xeon(R) Gold 6430(32 core） - 1TB 内存，7.68 TB NVME 本地硬盘 - 2 * 200Gbps IB网卡，1 * 25Gbps 以太网卡 - 8 * 625 TFlops = 5 Pflops FP16算力 L40 GPU服务器资源配置如下： - 8 * L4048GB PCI-e NVIDIA GPU卡 - 2 * Intel(R) Xeon(R) Gold 6430(32 core） - 1TB 内存，7.68 TB NVME 本地硬盘 - 2 * 200Gbps IB网卡，1 * 25Gbps 以太网卡 - 8 * 362 TFlops = 2.896 Pflops** FP16算力

总计：14.168 Pflops FP16算力

AI算力平台采用Slurm管理资源并实现作业调度，提交作业脚本时需指定 --partition, --qos 以及 --account，这些参数的对应关系如下表所示：

--partition	--qos	--account
dcu	dcudebug,dcunormal, dcuintera（选择其中一个使用）	ihepai
gpu	gpudebug, gpunormal, gpuintera (选择其中一个使用)	ihepai

qos name	优先级（值越大优先级越高）	单用户可用的最大资源数量	单用户可提交的最大作业数	作业最大运行时间
dcudebug	20	cpu=64,gres/dcu:k100ai=16,mem=1280G	24	15分钟
dcunormal	0	cpu=48,gres/dcu:k100ai=12,mem=960G	16	2天
dcuintera	10	cpu=1,gres/dcu:k100ai=1,mem=80G	1	4小时
gpudebug	20	cpu=16,gres/gpu:a800=4,gres/gpu:l40=4,gres/gpu=4,mem=960G	8	15分钟
gpunormal	0	cpu=8,gres/gpu:a800=2,gres/gpu:l40=2,gres/gpu=2,mem=480G	4	2天
gpuintera	10	cpu=1,gres/gpu:a800=1,gres/gpu:l40=1,gres/gpu=1,mem=80G	1	4小时

# 训练作业样例脚本
/aifs/public/data/sample_job_script/gpu_train.sh

# 推理作业样例脚本
/aifs/public/data/sample_job_script/gpu_inference.sh