batchRebootClusterNodes method
Reboots specific nodes within a SageMaker HyperPod cluster using a soft
recovery mechanism. BatchRebootClusterNodes performs a
graceful reboot of the specified nodes by calling the Amazon Elastic
Compute Cloud RebootInstances API, which attempts to cleanly
shut down the operating system before restarting the instance.
This operation is useful for recovering from transient issues or applying certain configuration changes that require a restart.
- Rebooting a node may cause temporary service interruption for workloads running on that node. Ensure your workloads can handle node restarts or use appropriate scheduling to minimize impact.
- You can reboot up to 25 nodes in a single request.
- For SageMaker HyperPod clusters using the Slurm workload manager, ensure rebooting nodes will not disrupt critical cluster operations.
May throw ResourceNotFound.
Parameter clusterName :
The name or Amazon Resource Name (ARN) of the SageMaker HyperPod cluster
containing the nodes to reboot.
Parameter nodeIds :
A list of EC2 instance IDs to reboot using soft recovery. You can specify
between 1 and 25 instance IDs.
-
Either
NodeIdsorNodeLogicalIdsmust be provided (or both), but at least one is required. -
Each instance ID must follow the pattern
i-followed by 17 hexadecimal characters (for example,i-0123456789abcdef0).
Parameter nodeLogicalIds :
A list of logical node IDs to reboot using soft recovery. You can specify
between 1 and 25 logical node IDs.
The NodeLogicalId is a unique identifier that persists
throughout the node's lifecycle and can be used to track nodes that are
still being provisioned and don't yet have an EC2 instance ID assigned.
-
This parameter is only supported for clusters using
Continuousas theNodeProvisioningMode. For clusters using the default provisioning mode, useNodeIdsinstead. -
Either
NodeIdsorNodeLogicalIdsmust be provided (or both), but at least one is required.
Implementation
Future<BatchRebootClusterNodesResponse> batchRebootClusterNodes({
required String clusterName,
List<String>? nodeIds,
List<String>? nodeLogicalIds,
}) async {
final headers = <String, String>{
'Content-Type': 'application/x-amz-json-1.1',
'X-Amz-Target': 'SageMaker.BatchRebootClusterNodes'
};
final jsonResponse = await _protocol.send(
method: 'POST',
requestUri: '/',
exceptionFnMap: _exceptionFns,
// TODO queryParams
headers: headers,
payload: {
'ClusterName': clusterName,
if (nodeIds != null) 'NodeIds': nodeIds,
if (nodeLogicalIds != null) 'NodeLogicalIds': nodeLogicalIds,
},
);
return BatchRebootClusterNodesResponse.fromJson(jsonResponse.body);
}