Accelerate deep learning training and simplify orchestration with AWS Trainium and AWS Batch

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

In training large language models (LLM), effective orchestration and compute resource management is a key challenge. Automation of resource provisioning, scaling, and workflow management is critical to optimizing resource utilization and streamlining complex workflows, thereby achieving efficient deep learning training processes. Simple orchestration enables researchers and practitioners to focus more on model experimentation, hyper-parameter tuning, and data analysis rather than dealing with cumbersome infrastructure management tasks. Straightforward orchestration also accelerates innovation, shortens time to market for new models and applications, and ultimately increases the overall efficiency and effectiveness of LLM research and development efforts.

This post explores the seamless integration of AWS Trainium with AWS Batch, demonstrating how to leverage Trainium's powerful machine learning (ML) acceleration capabilities with the powerful orchestration functionalities offered by AWS Batch. can Trinium provides massive scalability, enabling easy scaling of training jobs from small models to LLM, and offers affordable access to computational power, making LLM training affordable and accessible. Is. AWS Batch is a managed service facilitating batch computing workloads on the AWS cloud, handling tasks such as infrastructure management and job scheduling, while enabling you to focus on application development and results analysis. AWS Batch provides comprehensive features, including managed batch computing, containerized workloads, customizable compute environments, and prioritized job queues, with seamless integration with other AWS services.

Solution overview

The following diagram describes the architecture of the solution.

The training process proceeds as follows:

  1. The user creates a Docker image according to the basic training task requirements.
  2. The image is sent to Amazon Elastic Container Registry (Amazon ECR) to prepare it for deployment.
  3. The user submits the training job to the AWS batch along with the Docker image.

Let's dive deep into this solution to see how you can integrate Trinium with AWS Batch. The following example shows how to train the Llama 2-7B model using AWS Batch with Trinium.


It is recommended not to run the following scripts on your local machine. Instead, clone the GitHub repository and run the provided scripts on an x86_64-based instance, preferably with a Linux/Ubuntu operating system using the C5.xlarge instance type. For this post, we run the example on an Amazon Linux 2023 instance.

Before you start training on AWS Batch, you should have the following resources and tools:

sudo yum install -y docker 
sudo yum install -y jq

Clone the repo.

Clone the GitHub repo and move to the desired directory:

git clone 
cd aws-neuron-samples/torch-neuronx/training/aws-batch/llama2

Update the configuration.

First, update config.txt file to specify values ​​for the following variables:

REGION                          # your aws region 
SUBNET                          # your subnet in which the Trainium instances would be launched 
SG                              # your security group you want to associate with your instances 
ECR_REPO                        # your ECR repo where the docker container image will be pushed to 
INSTANCE_ROLE                   # Instance profile ARN for your IAM Instance Role 
DO_PRE_COMPILATION              # boolean value (truefalse) indicating if you want to do neuron pre-compilation for your training job 
TOKENIZED_DATASET_URI           # s3 uri to store the tokenized dataset 
NEURON_COMPILE_CACHE_URI        # s3 uri to store the neuron compile caches 
CHECKPOINT_SAVE_URI             # s3 uri to store the checkpoints

After providing these values, your config.txt file should look something like the following code.


Get the Llama Tokenizer.

To tokenize a dataset, you'll need to get a tokenizer from HuggingFace. Follow the instructions to access the Llama Tokenizer. (You need to acknowledge and accept the license terms.) After you are granted access, you can download the tokenizer from Hugging Face. After successful download, keep tokenizer.model file in the root directory (llama2).

Set up llama training.

run script, which streamlines the steps necessary to start AWS batch training. This script downloads the Python files required for training. Llama 2-7B Model Additionally, it performs environment variable substitution within provided templates and scripts designed to set up AWS batch resources. When it runs, it makes sure your directory structure conforms to the following setup.

├── build
│ ├── compute_env.json
│ ├── job_def.json
│ ├── job_queue.json
│ └── launch_template.json
├── config.txt
├── data
│ ├──
│ ├── config.json
│ └── tokenizer.model
├── docker
│ ├── Dockerfile
│ ├── llama2
│ │ ├──
│ │ ├── config.json
│ │ ├──
│ │ ├──
│ │ ├── requirements.txt
│ │ └──
│ └──
├── images
│ └── aws-batch.png
├── scripts
│ ├──
│ ├──
│ ├──
│ ├──
│ └──
└── templates
├── compute_env.json
├── job_def.json
├── job_queue.json
└── launch_template.json

Tokenize the dataset.

Next, run Script to complete data preprocessing steps for Llama 2-7B training. In this example, we use the Wikicorpus dataset from Hugging Face. After retrieving the dataset, the script performs tokenization and uploads the tokenized dataset to a predefined S3 location. config.txt configuration file. The following screenshots show the preprocessing results.

Provision of resources

Next, run A script, which organizes the provision of resources required for the training task. This includes creation of placement group, launch template, compute environment, job queue, and job definition. The following screenshots illustrate the process.

Build and push the Docker image.

Now you can run the script., which creates a custom Docker container image for your specific training task. This script uses the deep learning container image published by the Neuron team, which contains the required software stack, and then adds instructions to run Llama 2-7B training on top of it. Uses a training script. neuronx_distributed As well as a library with tensor parallelism Zero-1 The reformer. After that, the newly created Docker container image is uploaded to your designated ECR repository as specified by the variable. ECR_REPO In the configuration file config.txt.

If you want to modify any of the llama training hyperparameters, make the desired changes. ./docker/ Before running

The following screenshots illustrate the process of building and deploying a Docker image.

Submit training assignments.

run The script to start the AWS batch job and start the Llama2 model training, as shown in the following screenshots.

Upon batch job submission, an Amazon Elastic Container Service (Amazon ECS) cluster is dynamically provisioned. When it is operational, you can navigate to the cluster to monitor all actively running tasks. Trn1.32xl Examples are introduced through this work. By default, this instance is configured to use 4 trn1.32xl instances. To customize this setting, you can edit No. Nodes I parameter Scripts.

Log and monitor

After submitting a job, you can use Amazon CloudWatch Logs to comprehensively monitor, store, and view all logs generated by AWS Batch. Complete the following steps to access the logs:

  1. On the CloudWatch console, select Log groups Under the Inscriptions In the navigation pane.
  2. select /aws/batch/job To view batch job logs.
  3. Find login groups that match the job names or job definitions of your AWS batch.
  4. Select a task to view its details.

The following screenshot shows an example.


Checkpoints generated during training will be stored in the default S3 location defined. CHECKPOINT_SAVE_URI I config.txt File By default, the checkpoint is saved when training is complete. However, you can adjust this behavior by choosing to save the checkpoint after every N steps in the training loop. For detailed instructions on this customization, refer to Checkpointing.

clear up

Run when you're done. Script to manage the removal of resources created during post. This script takes care of removing various components, such as the launch template, placement group, job definition, job queue, and compute environment. AWS Batch automatically handles the cleanup of ECS Stack and Trinium instances, so there is no need to manually remove or stop them.


Trainium's seamless integration with AWS Batch represents a significant development in the realm of ML training. By combining Trinium's unparalleled capabilities with AWS Batch's powerful orchestration functionalities, you can benefit in a number of ways. First, you gain access to massive scalability, with the ability to easily scale training jobs from small models to LLM. With up to 16 Trinium chips per instance and distributed training capacity across tens of thousands of accelerators, you can easily handle even the most demanding training tasks thanks to Trinium instances. Additionally, it offers a cost-effective solution, helping you harness the power you need at an attractive price point. With a fully managed service offered by AWS Batch for computing workloads, you can offload operational complexities such as infrastructure provisioning and job scheduling, allowing you to focus your efforts on building applications and Can focus on analyzing results. Ultimately, Trainium's integration with AWS Batch empowers you to accelerate innovation, reduce time-to-market for new models and applications, and increase the overall efficiency and effectiveness of your ML efforts.

Now that you've learned about orchestrating Trinium using AWS Batch, we encourage you to give it a try for your next deep learning training job. You can explore more tutorials that will help you gain experience with AWS Batch and Trainium, and help you manage your deep learning training workloads and resources for better performance and cost efficiency. Will enable. So why wait? Start exploring these tutorials today and take your deep learning training to the next level with Trinium and AWS Batch!

About the authors

Scott Perry Annapurna is a Solution Architect on the ML Accelerator team at AWS. Based in Canada, he helps customers deploy and optimize deep learning training and inference workloads using AWS Inferentia and AWS Trainium. His interests include large language models, deep reinforcement learning, IoT, and genomics.

Sadaf Rasool Annapurna is a machine learning engineer with the ML Accelerator team at AWS. As a passionate and optimistic AI/ML professional, he firmly believes that the ethical and responsible application of AI has the potential to improve society for years to come, leading to both economic growth and social well-being. will be promoted.

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Leave a Comment