Skip to main content

Overview

All cluster management operations are available through multiple interfaces for programmatic control and automation:
  • Together CLI: Command-line tool for cluster operations.
  • REST API: Full HTTP API for custom integrations. See the GPU Clusters API reference.
  • SkyPilot: Orchestrate AI workloads across clusters.

Together CLI

The Together CLI provides a command-line interface for managing clusters, storage, and scaling. It’s included with the Together Python SDK.

Installation

# Install
uv tool install "together[cli]"

# List commands
tg --help

Authentication

The CLI authenticates with the TOGETHER_API_KEY environment variable. You can find your API token in your account settings:
export TOGETHER_API_KEY=<your_key>

Common commands

Create a cluster:
tg beta clusters create \
  --name my-cluster \
  --num-gpus 8 \
  --gpu-type H100_SXM \
  --region us-central-8 \
  --billing-type ON_DEMAND \
  --cluster-type KUBERNETES
Specify billing type (reserved vs on-demand):
# Reserved capacity
tg beta clusters create \
  --name my-cluster \
  --num-gpus 8 \
  --gpu-type H100_SXM \
  --region us-central-8 \
  --billing-type RESERVED \
  --duration-days 30 \
  --cluster-type KUBERNETES

# On-demand capacity
tg beta clusters create \
  --name my-cluster \
  --num-gpus 8 \
  --gpu-type H100_SXM \
  --region us-central-8 \
  --billing-type ON_DEMAND \
  --cluster-type KUBERNETES
Delete a cluster:
tg beta clusters delete [CLUSTER_ID]
List clusters:
tg beta clusters list
Scale a cluster:
tg beta clusters update [CLUSTER_ID] --num-gpus 16
Download cluster credentials (kubeconfig):
tg beta clusters get-credentials [CLUSTER_ID] --set-default-context
Run tg beta clusters create with no flags to launch an interactive prompt that walks through the required fields. See the clusters CLI reference for the full command and flag list.

SkyPilot Integration

Orchestrate AI workloads on GPU Clusters using SkyPilot for simplified cluster management and job scheduling.

Installation

uv pip install skypilot[kubernetes]

Setup

  1. Launch a Kubernetes cluster via Together Cloud
  2. Configure kubeconfig:
Download the cluster credentials with the Together CLI. This merges the cluster context into your local ~/.kube/config:
tg beta clusters get-credentials [CLUSTER_ID] --set-default-context
  1. Verify SkyPilot access:
sky check k8s
Expected output:
Checking credentials to enable infra for SkyPilot.
  Kubernetes: enabled [compute]
    Allowed contexts:
    └── t-51326e6b-25ec-42dd-8077-6f3c9b9a34c6-admin: enabled.

🎉 Enabled infra 🎉
  Kubernetes [compute]
  1. Check available GPUs:
sky show-gpus --infra k8s

Example: Launch a Workload

Create a SkyPilot task file (task.yaml):
resources:
  accelerators: H100:8
  cloud: kubernetes

setup: |
  pip install torch transformers

run: |
  python train.py
Launch the task:
sky launch -c my-job task.yaml

Example: Fine-tune GPT OSS

Download the gpt-oss-20b.yaml configuration. Launch fine-tuning:
sky launch -c gpt-together gpt-oss-20b.yaml

Benefits

  • Simplified orchestration – Abstract away Kubernetes complexity.
  • Multi-cloud support – Same workflow across different clouds.
  • Cost optimization – Auto-select cheapest available resources.
  • Job management – Easy monitoring and cancellation.

Automation Patterns

CI/CD Integration

GitHub Actions example:
name: Train Model

on: push

jobs:
  train:
    runs-on: ubuntu-latest
    env:
      TOGETHER_API_KEY: ${{ secrets.TOGETHER_API_KEY }}
    steps:
      - uses: actions/checkout@v3

      - name: Install the Together CLI
        run: uv tool install "together[cli]"

      - name: Create GPU Cluster
        run: |
          tg beta clusters create \
            --name training-${{ github.sha }} \
            --num-gpus 8 \
            --billing-type ON_DEMAND \
            --gpu-type H100_SXM \
            --region us-central-8 \
            --cluster-type KUBERNETES \
            --non-interactive

      - name: Run Training
        run: |
          # Submit training job to cluster
          kubectl apply -f training-job.yaml

      - name: Cleanup
        if: always()
        run: |
          tg beta clusters delete [CLUSTER_ID]

Scheduled Jobs

Cron-based cluster creation:
# Create cluster daily at 6 AM for batch processing
0 6 * * * tg beta clusters create \
  --name daily-batch \
  --num-gpus 16 \
  --billing-type ON_DEMAND \
  --gpu-type H100_SXM \
  --region us-central-8 \
  --cluster-type KUBERNETES \
  --non-interactive

Auto-scaling Scripts

Scale a cluster up or down based on demand with the Together CLI:
# Scale based on job queue length
if [ "$JOB_QUEUE_LENGTH" -gt 100 ]; then
  tg beta clusters update [CLUSTER_ID] --num-gpus 16
else
  tg beta clusters update [CLUSTER_ID] --num-gpus 8
fi

Best Practices

API usage

  • Use environment variables for API keys (never hardcode).
  • Implement retry logic for transient failures.
  • Check cluster status before submitting jobs.
  • Clean up resources after completion.

CLI usage

  • Set TOGETHER_API_KEY in your environment so commands authenticate automatically.
  • Use cluster IDs for cluster references (more reliable than names).
  • Pass --non-interactive (or --json) to skip prompts in scripts and CI.
  • Script common operations for team consistency.

Troubleshooting

Authentication issues

  • Verify your API key is set: echo $TOGETHER_API_KEY
  • Confirm the key is valid in your account settings

API rate limits

  • Implement exponential backoff
  • Batch operations when possible
  • Contact support for higher limits

What’s Next?