Artificial Intelligence

Category: Amazon SageMaker HyperPod

Streamline machine learning workflows with SkyPilot on Amazon SageMaker HyperPod

This post is co-written with Zhanghao Wu, co-creator of SkyPilot. The rapid advancement of generative AI and foundation models (FMs) has significantly increased computational resource requirements for machine learning (ML) workloads. Modern ML pipelines require efficient systems for distributing workloads across accelerated compute resources, while making sure developer productivity remains high. Organizations need infrastructure solutions […]

New capabilities in Amazon SageMaker AI continue to transform how organizations develop AI models

In this post, we share some of the new innovations in SageMaker AI that can accelerate how you build and train AI models. These innovations include new observability capabilities in SageMaker HyperPod, the ability to deploy JumpStart models on HyperPod, remote connections to SageMaker AI from local development environments, and fully managed MLflow 3.0.

cluster dashboard

Accelerate foundation model development with one-click observability in Amazon SageMaker HyperPod

With a one-click installation of the Amazon Elastic Kubernetes Service (Amazon EKS) add-on for SageMaker HyperPod observability, you can consolidate health and performance data from NVIDIA DCGM, instance-level Kubernetes node exporters, Elastic Fabric Adapter (EFA), integrated file systems, Kubernetes APIs, Kueue, and SageMaker HyperPod task operators. In this post, we walk you through installing and using the unified dashboards of the out-of-the-box observability feature in SageMaker HyperPod. We cover the one-click installation from the Amazon SageMaker AI console, navigating the dashboard and metrics it consolidates, and advanced topics such as setting up custom alerts.

Amazon SageMaker HyperPod launches model deployments to accelerate the generative AI model development lifecycle

In this post, we announce Amazon SageMaker HyperPod support for deploying foundation models from SageMaker JumpStart, as well as custom or fine-tuned models from Amazon S3 or Amazon FSx. This new capability allows customers to train, fine-tune, and deploy models on the same HyperPod compute resources, maximizing resource utilization across the entire model lifecycle.

Accelerate foundation model training and inference with Amazon SageMaker HyperPod and Amazon SageMaker Studio

In this post, we discuss how SageMaker HyperPod and SageMaker Studio can improve and speed up the development experience of data scientists by using IDEs and tooling of SageMaker Studio and the scalability and resiliency of SageMaker HyperPod with Amazon EKS. The solution simplifies the setup for the system administrator of the centralized system by using the governance and security capabilities offered by the AWS services.

Training Llama 3.3 Swallow: A Japanese sovereign LLM on Amazon SageMaker HyperPod

The Institute of Science Tokyo has successfully trained Llama 3.3 Swallow, a 70-billion-parameter large language model (LLM) with enhanced Japanese capabilities, using Amazon SageMaker HyperPod. The model demonstrates superior performance in Japanese language tasks, outperforming GPT-4o-mini and other leading models. This technical report details the training infrastructure, optimizations, and best practices developed during the project.

Accelerating Articul8’s domain-specific model development with Amazon SageMaker HyperPod

Learn how Articul8 is redefining enterprise generative AI with domain-specific models that outperform general-purpose LLMs in real-world applications. In our latest blog post, we dive into how Amazon SageMaker HyperPod accelerated the development of Articul8’s industry-leading semiconductor model—achieving 2X higher accuracy that top open source models while slashing deployment time by 4X.

Multi-account support for Amazon SageMaker HyperPod task governance

In this post, we discuss how an enterprise with multiple accounts can access a shared Amazon SageMaker HyperPod cluster for running their heterogenous workloads. We use SageMaker HyperPod task governance to enable this feature.

visual language model

How Apoidea Group enhances visual information extraction from banking documents with multimodal models using LLaMA-Factory on Amazon SageMaker HyperPod

Building on this foundation of specialized information extraction solutions and using the capabilities of SageMaker HyperPod, we collaborate with APOIDEA Group to explore the use of large vision language models (LVLMs) to further improve table structure recognition performance on banking and financial documents. In this post, we present our work and step-by-step code on fine-tuning the Qwen2-VL-7B-Instruct model using LLaMA-Factory on SageMaker HyperPod.

Customize DeepSeek-R1 671b model using Amazon SageMaker HyperPod recipes – Part 2

In this post, we use the recipes to fine-tune the original DeepSeek-R1 671b parameter model. We demonstrate this through the step-by-step implementation of these recipes using both SageMaker training jobs and SageMaker HyperPod.