GitHub - treesan/vcutclaw: 面向批量长视频的多智能体协作音乐同步剪辑系统 Agentic Batch Long-Form Video Editing System with Music Synchronization

🦞 vcutclaw: Agentic Batch Long-Form Video Editing System with Music Synchronization

🎬 Your personal editor: turn a few clips into a cinematic montage, instantly.

Overview • Roadmap • Features • Gallery • Quick Start • CLI Reference • Troubleshooting • Citation • Star History

💡 Overview

vcutclaw is an end-to-end editing system for long-form footage + music.

It first deconstructs raw video/audio into structured captions, then uses a multi-agent pipeline to plan shots (shot_plan), select clip timestamps (shot_point), and validate final quality before rendering.

🗺️ Roadmap

We warmly welcome new issues and ideas from the community. If you have suggestions, please open an issue. Your feedback will help shape our future plans and be the fuel that helps this project take off. 🔥

Short-Term Goals

🧩 ARC-Chapter Integration
Bring in ARC-Chapter to reduce the cost of long-form footage deconstruction.
💸 Low-Cost Mode
Add a budget-friendly mode that proactively reads only relevant footage instead of fully processing all source material.

Long-Term Goals

Broader product and ecosystem directions for the next stage of vcutclaw.

🎯 Clip Preference System
Allow users to specify which clips, time ranges, or subjects (people/landscapes) should receive more shots in the generated plan. For example: "keep more shots from clip DSC_8324 between 2-5s", "preserve more frames of the mountain landscape", or "prioritize shots with the main character". The web UI will support multi-clip selection with visual time range editors.
📱 JianYing Pro / CapCut Draft Export
Generate JianYing Pro (剪映专业版) draft projects from vcutclaw's shot_plan/shot_point, enabling users to further refine edits in a professional NLE. Leverages the jianying-editor-skill API for draft creation, media import, and timeline assembly.
🌐 Set up an online service page
Build a web-based online service interface to lower the barrier to entry and improve deployment convenience.

✨ Key Features

🎬 One-Click Deconstruction

Effortlessly transforms hours-long raw video and audio into structured, searchable assets with a single click.

🎯 Instruction Control

Requires only one text instruction to steer the editing style—easily generating fast-paced character montages or slow-paced emotional narratives.

📱 Smart Auto-Cropping

Content-aware cropping automatically identifies core subjects and adjusts aspect ratios to fit various social platforms.

🎵 Music-Aware Sync

Extracts musical beats and energy signals to build rhythm-aware cuts that perfectly match the music's pacing.

🖼️ Gallery（remember to turn on the audio）

……

🚀 Quick Start

1. Install

git clone https://github.com/treesan/vcutclaw.git
cd vcutclaw
conda create -n CutClaw python=3.12
conda activate CutClaw
pip install -r requirements.txt

We strongly recommend the GPU-accelerated Decord/NVDEC build for faster video decoding. Build from source.

2. Add your files

resource/
├── video/      ← put your .mp4 / .mkv here
├── audio/      ← put your .mp3 / .wav here
└── subtitle/   ← optional .srt (skips ASR, saves time)

3. Run

UI (recommended)

streamlit run app.py

Then open http://localhost:8501 in your browser. (*If http://localhost:8501 does not work well, try http://127.0.0.1:8501)

Place your footage in the paths above, then you can directly select those files in the UI.

Model selection guidance:

Video model
- Role: shot/scene understanding and visual captioning.
- Recommended: Gemini-3, Qwen3.5, GPT-5.3
Audio model
- Role: ASR plus music-structure parsing (beat/downbeat, pitch, energy) for music-aware segmentation.
- Recommended: Gemini-3
Agent model
- Role: drives the Screenwriter + Editor + Reviewer loop to generate shot_plan and shot_point.
- Recommended: MiniMax-2.7, Kimi-2.5, Claude-4.5

We leverage LiteLLM as the api manager gateway, the typical Model name is e.g. 'openai/MiniMax-2.7' which means using openai protocol to call the given model, more information see LiteLLM documents.

CLI (advanced)

python local_run.py \
  --Video_Path "resource/video/xxxx.mp4" \
  --Audio_Path "resource/audio/xxxx.mp3" \
  --Instruction "xxxx"

Common config overrides

Any src/config.py parameter can be overridden with --config.PARAM_NAME VALUE.

Parameter	Default	Effect
`VIDEO_PATH`	`"resource/video/The_Dark_Knight.mkv"`	Default input video path used by UI remembered inputs
`AUDIO_PATH`	`"resource/audio/Way_Down_We_Go.mp3"`	Default input audio path used by UI remembered inputs
`INSTRUCTION`	`"Joker's crazy that want to change the world."`	Default editing instruction prompt
`ASR_BACKEND`	`"litellm"`	ASR engine (`litellm` cloud or `whisper_cpp` local)
`VIDEO_FPS`	`2`	Sampling FPS for preprocessing
`MAIN_CHARACTER_NAME`	`"Joker"`	Protagonist name for character-focused edits
`AUDIO_MIN_SEGMENT_DURATION`	`3.0`	Minimum beat segment duration (seconds)
`AUDIO_MAX_SEGMENT_DURATION`	`5.0`	Maximum beat segment duration (seconds)
`AUDIO_DETECTION_METHODS`	`["downbeat", "pitch", "mel_energy"]`	Audio keypoint detection methods
`PARALLEL_SHOT_MAX_WORKERS`	`4`	Parallel shot selection workers

Example:

python local_run.py \
  --Video_Path "resource/video/xxxx.mp4" \
  --Audio_Path "resource/audio/xxxx.mp3" \
  --Instruction "xxxx" \
  --config.MAIN_CHARACTER_NAME "Batman" \
  --config.VIDEO_FPS 2 \
  --config.AUDIO_TOTAL_SHOTS 50

Then render manually:

python render/render_video.py \
  --shot-plan  "Output/<video_audio>/shot_plan_*.json" \
  --shot-json  "Output/<video_audio>/shot_point_*.json" \
  --video  "resource/video/xxxx.mp4" \
  --audio  "resource/audio/xxxx.mp3" \
  --output "output/final.mp4" \
  --crop-ratio "9:16" \
  --no-labels --render-hook-dialogue

🚀 CLI Quick Reference

All commands must be run from the vcutclaw project directory with the correct conda environment:

cd vcutclaw
conda activate CutClaw

1. Video Preprocessing

Analyze video without BGM (used when BGM is not yet selected):

python local_run.py \
  --Video_Path "resource/video/sample.MOV" \
  --Instruction "video analysis only" \
  --type vlog \
  --preprocess-only

Output: Output/Video/{VIDEO_ID}/captions/scene_summaries_video/ + shot_scenes.txt

2. BGM Rhythm Analysis

Analyze BGM structure (after the content strategist has downloaded the BGM):

python -c "
from src.audio.audio_caption_madmom import caption_audio_with_madmom_segments

caption_audio_with_madmom_segments(
    audio_path='resource/audio/bgm.mp3',
    output_path='Output/Audio/{BGM_ID}/captions/captions.json',
)
"

Output: Output/Audio/{BGM_ID}/captions/captions.json (BPM, structure segments, keypoints)

3. Combined Video + BGM Preprocessing

Run both video and BGM analysis together (preprocess only, no creative generation):

python local_run.py \
  --Video_Path "resource/video/sample.MOV" \
  --Audio_Path "resource/audio/bgm.mp3" \
  --Instruction "preprocess only" \
  --type vlog \
  --preprocess-only

4. BGM Download (Pixabay)

Search and download BGM from Pixabay (free commercial use, no API key required):

# Search
python3 ~/.openclaw/skills/pixabay-music-skill/scripts/pixabay_music.py \
  search "upbeat travel vlog" --max-duration 120

# Download
python3 ~/.openclaw/skills/pixabay-music-skill/scripts/pixabay_music.py \
  download "upbeat travel vlog" \
  -o vcutclaw/resource/audio/bgm.mp3

5. Generate Shot Plan (shot_plan)

Based on scene analysis + BGM structure, the content strategist generates a shot plan:

python src/planner_agent.py \
  --video "resource/video/sample.MOV" \
  --scene-summaries "Output/Video/{VIDEO_ID}/captions/scene_summaries_video" \
  --audio-captions "Output/Audio/{BGM_ID}/captions/captions.json" \
  --subtitle "Output/Video/{VIDEO_ID}/subtitles_with_characters.srt" \
  --bgm-name "bgm.mp3" \
  --output-dir "Output/Output/{VIDEO_ID}_{BGM_ID}" \
  --strategy "fast cuts in first 4s, warm interaction in middle 6s, emotional climax in last 5s" \
  --action shot_plan

6. Generate Shot Point

Generate precise clip timestamps from the confirmed shot plan:

python src/short_video_editor.py \
  --video "resource/video/sample.MOV" \
  --shot-plan "Output/Output/{VIDEO_ID}_{BGM_ID}/shot_plan_xxx.json" \
  --scene-summaries "Output/Video/{VIDEO_ID}/captions/scene_summaries_video" \
  --audio-captions "Output/Audio/{BGM_ID}/captions/captions.json" \
  --scene-cuts "Output/Video/{VIDEO_ID}/frames/shot_scenes.txt" \
  --instruction "warm family outing, 15s beat-sync short video" \
  --shot-point-context "prioritize shots with children laughing" \
  --action shot_point

7. Preview Shot Point (dry-run)

Preview the generated composition without rendering:

python src/short_video_editor.py ... --action dry_run

8. Render Final Video

Once shot points are confirmed, render the final video:

python src/short_video_editor.py \
  --video "resource/video/sample.MOV" \
  --shot-plan "Output/Output/{VIDEO_ID}_{BGM_ID}/shot_plan_xxx.json" \
  --scene-summaries "Output/Video/{VIDEO_ID}/captions/scene_summaries_video" \
  --audio-captions "Output/Audio/{BGM_ID}/captions/captions.json" \
  --action render

9. Batch Editing (Multi-Clip Project)

For projects with multiple source clips (e.g. a trip with 40+ DJI drone videos), use the --project commands to create, preprocess, plan, edit, and render from a unified workflow.

Step 1 — Create a project from a video directory:

python local_run.py --project create \
  --video-dir "/path/to/your/videos" \
  --project-name "My Trip"

Scans all .mp4/.mov files, extracts metadata via ffprobe, and groups clips by recording date.

Step 2 — Review source media consistency:

python local_run.py --project review-sources \
  --project-path "Output/Projects/<project_id>/project.json"

Checks codec, resolution, fps, and colorspace across all clips. Flags issues and reports whether normalization is needed during rendering.

Step 3 — Batch preprocess all clips:

python local_run.py --project preprocess \
  --project-path "Output/Projects/<project_id>/project.json" \
  --type vlog \
  --max-workers 2

Runs shot detection, captioning, scene merge, and scene analysis for every clip in parallel. Supports checkpoint-based resume — if interrupted, rerun the same command to skip completed clips.

Step 4 — Build global material index:

python local_run.py --project build-index \
  --project-path "Output/Projects/<project_id>/project.json"

Aggregates all clip scene summaries into a flat material_index.json for the planner agent to select shots across the entire project.

Step 5 — Generate shot plan (BGM rhythm auto-analysis):

python local_run.py --project plan \
  --project-path "Output/Projects/<project_id>/project.json" \
  --profile bilibili_1080p \
  --strategy "epic drone shots with cinematic transitions"

The planner agent automatically analyzes the BGM (madmom keypoint detection + LLM section/sub-segment captioning), selects scenes from the material index, and generates a cross-clip shot plan. BGM analysis results are cached to bgm_captions/.

Step 6 — Generate shot points (precise timestamps):

python local_run.py --project edit \
  --project-path "Output/Projects/<project_id>/project.json" \
  --profile bilibili_1080p

Reads the shot plan, groups shots by source clip, and runs DirectShotSelector (LLM) per clip to generate precise start/end timestamps. Outputs shot_point_<profile>.json with clip_file_path per shot.

Step 7 — Render the final video:

python local_run.py --project render \
  --project-path "Output/Projects/<project_id>/project.json" \
  --profile bilibili_1080p \
  --extract-timeout 600

Multi-source renderer: validates → extracts → stitches → BGM mix → subtitles → ending video. Shot point auto-discovered from shot_points/ directory. Supports --with-ending, --ending-path, --ending-duration, --ending-fade for appending an outro clip.

Check project status at any time:

python local_run.py --project status \
  --project-path "Output/Projects/<project_id>/project.json"

Batch workflow summary:

# Full pipeline (5 commands)
PROJECT="Output/Projects/MyTrip/project.json"
python local_run.py --project create --video-dir "/videos" --project-name "MyTrip"
python local_run.py --project review-sources --project-path "$PROJECT"
python local_run.py --project preprocess --project-path "$PROJECT" --type vlog --max-workers 2
python local_run.py --project build-index --project-path "$PROJECT"
python local_run.py --project plan --project-path "$PROJECT" --profile bilibili_1080p --strategy "travel vlog"
python local_run.py --project edit --project-path "$PROJECT" --profile bilibili_1080p
python local_run.py --project render --project-path "$PROJECT" --profile bilibili_1080p

10. Key Config Overrides

Common runtime configuration overrides:

python local_run.py ... \
  --config.VIDEO_FPS 2 \
  --config.AUDIO_TOTAL_SHOTS 50 \
  --config.MAIN_CHARACTER_NAME "Tree" \
  --config.MIN_PROTAGONIST_RATIO 0.7 \
  --config.AUDIO_MIN_SEGMENT_DURATION 1.8 \
  --config.AUDIO_MAX_SEGMENT_DURATION 3.8

Output Files

Operation	Output Path	Description
Video Analysis	`Output/Video/{ID}/captions/scene_summaries_video/`	Per-scene descriptions
Scene Cuts	`Output/Video/{ID}/frames/shot_scenes.txt`	Shot boundaries
BGM Analysis	`Output/Audio/{ID}/captions/captions.json`	Rhythm structure + captions
ASR Subtitles	`Output/Video/{ID}/subtitles.srt`	Speech-to-text
Shot Plan	`Output/Output/{ID}/{BGM}/shot_plan_xxx.json`	Creative plan
Shot Point	`Output/Output/{ID}/{BGM}/shot_point_xxx.json`	Precise timestamps
Final Video	`Output/Output/{ID}/{BGM}/output_9x16.mp4`	Rendered video

Batch Editing Outputs

Operation	Output Path	Description
Project	`Output/Projects/{ID}/project.json`	Project metadata + clip list
Source Review	`Output/Projects/{ID}/source_review.json`	Codec/resolution/fps audit
Clip Preprocess	`Output/Projects/{ID}/Clips/{clip_id}/`	Per-clip scene analysis
Checkpoints	`Output/Projects/{ID}/checkpoints/`	Resumable stage state
Material Index	`Output/Projects/{ID}/material_index.json`	Global scene index for planner
BGM Captions	`Output/Projects/{ID}/bgm_captions/`	Auto-generated BGM rhythm analysis
Shot Plan	`Output/Projects/{ID}/shot_plans/shot_plan_<profile>.json`	Cross-clip creative plan
Shot Points	`Output/Projects/{ID}/shot_points/shot_point_<profile>.json`	Per-shot timestamps with source clip
Render Output	`Output/Projects/{ID}/output/<profile>.mp4`	Final multi-source rendered video

🛠️ Troubleshooting

Very slow runtime

API latency — the pipeline sends a large number of concurrent requests to vision/language APIs. Speed is heavily dependent on your API provider's response time and rate limits.
First-run Footage Deconstruction — the first time you process a video, shot detection, captioning, ASR, and scene analysis all run from scratch. This is a one-time cost per video; subsequent edits with the same footage reuse the cached results and are much faster.
GPU acceleration — a CUDA-capable GPU significantly speeds up video decoding and encoding. We recommend building Decord with NVDEC support (see Install section).
Video codec compatibility — if the pipeline appears to hang during video-related steps, the source video's encoding may be the cause. In our testing, videos encoded with libx264 worked reliably.

⭐ Citation

If you find vcutclaw useful for your research, welcome to cite the original work:

@article{cutclaw,
 title={CutClaw: Agentic Hours-Long Video Editing via Music Synchronization},
 author={Shifang Zhao, Yihan Hu, Ying Shan, Yunchao Wei, Xiaodong Cun},
 journal={arXiv preprint arXiv:2603.29664},
 year={2026}
}

📜 License & Attribution

vcutclaw is a derivative work of GVCLab/CutClaw, the original academic research project by Shifang Zhao, Yihan Hu, Ying Shan, Yunchao Wei, and Xiaodong Cun from Beijing Jiaotong University, Great Bay University, and Tencent ARC Lab.

The original codebase and research are (c) GVCLab and its authors.
New features, modifications, and extensions by @treesan are released under the MIT License (see LICENSE).
Please cite the original CutClaw paper if you use this work in your research.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
asset		asset
docs		docs
render		render
resource		resource
src		src
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
app.py		app.py
local_run.py		local_run.py
readme.md		readme.md
readme_zh.md		readme_zh.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🦞 vcutclaw: Agentic Batch Long-Form Video Editing System with Music Synchronization

💡 Overview

🗺️ Roadmap

Short-Term Goals

Long-Term Goals

✨ Key Features

🎬 One-Click Deconstruction

🎯 Instruction Control

📱 Smart Auto-Cropping

🎵 Music-Aware Sync

🖼️ Gallery（remember to turn on the audio）

……

🚀 Quick Start

1. Install

2. Add your files

3. Run

🚀 CLI Quick Reference

1. Video Preprocessing

2. BGM Rhythm Analysis

3. Combined Video + BGM Preprocessing

4. BGM Download (Pixabay)

5. Generate Shot Plan (shot_plan)

6. Generate Shot Point

7. Preview Shot Point (dry-run)

8. Render Final Video

9. Batch Editing (Multi-Clip Project)

10. Key Config Overrides

Output Files

Batch Editing Outputs

🛠️ Troubleshooting

⭐ Citation

📜 License & Attribution

📈 Star History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages