Creating a SageMaker HyperPod cluster
See the following instructions on creating a new SageMaker HyperPod cluster using the SageMaker HyperPod console UI.
-
Open the Amazon SageMaker AI console at https://console.aws.amazon.com/sagemaker/
. -
Choose HyperPod Clusters in the left navigation pane and then Cluster Management.
-
In the SageMaker HyperPod landing page, choose Create HyperPod cluster.
-
From the drop-down menu of Create HyperPod cluster, choose Orchestrated by Amazon EKS.
-
From the Amazon EKS cluster list, choose the EKS cluster with which you want to configure the new HyperPod cluster.
-
If you need to create a new EKS cluster, choose Create EKS cluster. You can create it from the EKS cluster list page without having to open the Amazon EKS console.
Note
The VPC subnet you choose for HyperPod has to be private.
-
After submitting a new EKS cluster creation request, wait until the EKS cluster becomes Active.
-
Install the Helm chart as instructed in Installing packages on the Amazon EKS cluster using Helm.
-
After the EKS cluster creation has completed, choose Create HyperPod cluster and then Orchestrated by EKS again. You should be able to find and select the new EKS cluster. To proceed, choose Select.
-
-
On the Configure a new HyperPod cluster page, set up the basic information for the cluster such as name, options to enable the HyperPod cluster resiliency features, and tags.
-
For Cluster name, specify a name for the new cluster.
-
For Cluster resiliency - node recovery, specify
Automatic
to enable automatic node recovery. SageMaker HyperPod replaces or reboots instances (nodes) when issues are found by the health-monitoring agent. -
For Tags, add key and value pairs to the new cluster and manage the cluster as an AWS resource. To learn more, see Tagging your AWS resources.
-
In Step 2: Specify networking, configure network settings within the cluster and in-and-out of the cluster. For orchestration of SageMaker HyperPod cluster with Amazon EKS, the VPC is automatically set to the one configured with the EKS cluster you selected.
-
In Step 3: Configure instance groups, choose Create instance group. Each instance group can be configured differently, and you can create a heterogeneous cluster that consists of multiple instance groups with various instance types. In the Create an instance group configuration pop-up window, fill the instance group configuration information.
On Create an instance group pop-up page, choose Standard to configure a new instance group following the UI guidance.
-
For Instance group name, specify a name for the instance group.
-
For Select instance type, choose the instance for the instance group.
-
For Quantity, specify an integer not exceeding the instance quota for cluster usage.
-
Prepare a lifecycle configuration script and upload to an Amazon S3 bucket, such as
s3://
.amzn-s3-demo-bucket
/Lifecycle-scripts
/base-config
/For a quick start, download the sample script
on_create.sh
from the AWSome Distributed Training GitHub repository, and upload it to the S3 bucket. You can also include additional setup instructions, a series of setup scripts, or commands to be executed during the HyperPod cluster provisioning stage. -
For S3 bucket URI for lifecycle scripts, enter the Amazon S3 path in which the lifecycle scripts are stored.
-
For Directory path to the entrypoint script in the base Amazon S3 path, enter the file name of the lifecycle script under Amazon S3 path to lifecycle script files. If you use the provided sample script, enter
on_create.sh
. -
For IAM role, choose the IAM role you have created for SageMaker HyperPod resources, following the section IAM role for SageMaker HyperPod.
-
Under Advanced configuration, you can set up the following optional configurations.
-
(Optional) For Threads per core, specify
1
for disabling multi-threading and2
for enabling multi-threading. To find which instance type supports multi-threading, see the reference table of CPU cores and threads per CPU core per instance type in the Amazon EC2 User Guide. -
(Optional) For Additional instance storage configs, specify an integer between 1 and 16384 to set the size of an additional Elastic Block Store (EBS) volume in gigabytes (GB). The EBS volume is attached to each instance of the instance group. The default mount path for the additional EBS volume is
/opt/sagemaker
. After the cluster is successfully created, you can SSH into the cluster instances (nodes) and verify if the EBS volume is mounted correctly by running thedf -h
command. Attaching an additional EBS volume provides stable, off-instance, and independently persisting storage, as described in the Amazon EBS volumes section in the Amazon Elastic Block Store User Guide.
-
-
For Deep health check, select the advanced health checks you want to run on the instances. To learn more, see Deep health checks.
-
-
In Step 3: Configure instance groups, choose Create instance group. Each instance group can be configured differently, and you can create a heterogeneous cluster that consists of multiple instance groups with various instance types. In the Create an instance group configuration pop-up window, fill the instance group configuration information.
On Create an instance group pop-up page, choose Restricted Instance Group (RIG) to configure a new restricted instance group following the UI guidance. RIG is only required when you want to create a cluster for Amazon Nova model customization. For more information, see Amazon Nova customization on Amazon SageMaker HyperPod.
-
For Instance group name, specify a name for the restricted instance group.
-
For Select instance type, choose the instance for the restricted instance group.
-
For Quantity, specify an integer not exceeding the instance quota for cluster usage.
-
For Instance group IAM role, choose the IAM role you have created for SageMaker HyperPod resources, following the section IAM role for SageMaker HyperPod.
-
Under Environment Config - FSx for Lustre, you can set up the following optional configurations.
-
For Throughput per unit of storage, choose the unit of storage you need.
-
For Storage capacity, enter the value you need.
-
-
For Cluster resiliency - deep health checks of accelerated computing instances - optional, choose the options(s) based on your use case. To learn more, see Deep health checks.
-
For Advanced configuration:
-
In Threads per core, choose the number you need.
-
In Additional storage volume per instance size (GB) - optional, specify the size of an additional Elastic Block Store (EBS) volume to attach to each instance in your instance group.
-
In Override cluster-level subnet and security group settings, choose to toggle this setting you need.
-
For Subnet, choose private subnets in an availability zone supported by SageMaker AI. To create new subnets, go to the Amazon VPC console.
-
For Security group(s), choose security groups that are either attached to the Amazon EKS cluster or whose inbound traffic is permitted by the security group associated with the Amazon EKS cluster. To create new security groups, go to the Amazon VPC console.
-
-
-
Choose Save.
-
-
In Step 4: Review and create, review the configuration you have set from Step 1 to Step 3 and finish submitting the cluster creation request.
-
After the status of the cluster turns to
InService
, you can start logging into the cluster nodes. To access the cluster nodes and start running ML workloads, see Jobs on SageMaker HyperPod clusters.