/blog/Creating Composite Performance Videos in Small Rooms with Meta's SAM2 (Segment Anything Model 2)

Creating Composite Performance Videos in Small Rooms with Meta's SAM2 (Segment Anything Model 2)

Category:Tech BlogTags:

#Python #Computer Vision #SAM2 #YOLO #Ultralytics

Published: 2026 - 1 - 15

Japanese houses are small. But I want to play multiple instruments and make an ensemble video. So, I created a system to segment a person using SAM2 and Ultralytics, and composite multiple performance videos.

Background:
Wanting
to
Make
an
Ensemble
Video
in
a
Small
Room

Living in Japan, rooms are inevitably small. There's almost no space to line up multiple instruments. On the other hand, I play multiple instruments — guitar, bass, drums, keyboards, and more — and I always run into this problem.

Even if I want to make a "solo ensemble video" or "loop performance video" often seen on YouTube, I need to shoot each instrument in separate cuts and composite them. The traditional approach is to use a green screen (chroma key), but setting up a green screen in a small room is simply not practical.

Isn't there an easier way? That's what I thought — and I figured that using SAM2 (Segment Anything Model 2), which appeared in 2024, should let me cleanly cut out a person (and their instrument, which is actually the harder part) from normal indoor shooting without any green screen.

What
is
SAM2?

Segment Anything Model 2 (SAM2), released by Meta in 2024, is a model that can perform real-time segmentation not only on images but also on videos.

Just by specifying the target in the first frame, it tracks it in subsequent frames and generates a mask.
It also handles temporary occlusion of objects.
Paper: SAM 2: Segment Anything in Images and Videos (Ravi et al., 2024)

Why
Use
Ultralytics?

There is a way to use Meta's official SAM2 API directly, but setting up dependencies is complicated, and the API is somewhat difficult to handle.

Therefore, I use Ultralytics. Ultralytics is a famous framework for the YOLO series, and in the latest version, you can handle YOLO and SAM with the same Python API.

pip install ultralytics

from ultralytics import SAM

model = SAM("sam2_b.pt")  # Load SAM2 Base model

With just this, you can use SAM2. The model weights are automatically downloaded on the first run.

Processing
Pipeline

The overall flow is as follows.

Input video (performance video of each instrument)
        ↓
  Generate person mask frame by frame with SAM2
        ↓
  Cut out the person area using the mask → Export as RGBA video
        ↓
  Overlay and composite each part video on top of the background video (or image)
        ↓
  Completion of composite video

Segment
Person
with
SAM2

from ultralytics import SAM
import cv2
import numpy as np

model = SAM("sam2_b.pt")

# Track the person by specifying a click point in the first frame
results = model.track(
    source="guitar_take.mp4",
    points=[[320, 240]],   # Click the person near the center of the screen
    labels=[1],            # 1 = foreground
    stream=True,
)

masks = []
for r in results:
    if r.masks is not None:
        masks.append(r.masks.data[0].cpu().numpy())
    else:
        masks.append(None)

This retrieves the default masked image for each frame.

Cutting out the bass performance

Cutting out the drum performance

Generate
RGBA
Video
Using
Masks

cap = cv2.VideoCapture("guitar_take.mp4")
fourcc = cv2.VideoWriter_fourcc(*"mp4v")
out = cv2.VideoWriter("guitar_masked.mp4", fourcc, 30, (width, height))

for i, mask in enumerate(masks):
    ret, frame = cap.read()
    if not ret or mask is None:
        break
    alpha = (mask * 255).astype(np.uint8)
    rgba = cv2.cvtColor(frame, cv2.COLOR_BGR2BGRA)
    rgba[:, :, 3] = alpha
    out.write(rgba)

This outputs a video file that concatenates each frame with the background cut out.

Alpha channel

Composite
on
Background

# Overlay each part on the background image with alpha blending
def composite(bg, fg_rgba):
    alpha = fg_rgba[:, :, 3:4] / 255.0
    fg_rgb = fg_rgba[:, :, :3]
    return (fg_rgb * alpha + bg * (1 - alpha)).astype(np.uint8)

Here is a comparison of SAM2 and YOLOv11 after compositing onto the background.

Composite result

Aside

GPU
vs
Apple
Silicon
Inference
Speed

When comparing inference speed between Google Colab's T4 GPU and MacBook Pro M4 (Apple Silicon), there wasn't a huge difference.

Environment	Inference Time per Frame (Approx.)
Google Colab (T4 GPU)	Approx. 30–50 ms
MacBook Pro M4 (MPS)	Approx. 40–60 ms

Why is this? It's likely because Ultralytics has implemented many optimizations to speed up inference, such as model quantization, conversion to TensorRT / CoreML, and batch processing optimization. The Metal Performance Shaders (MPS) backend of Apple Silicon is also effectively utilized.

In reality, for this use case (short performance videos of tens of seconds to a few minutes), both environments provided sufficient throughput.

Summary

With Ultralytics, you can load SAM2 in one line and use it without dependency troubles.
You can combine person detection by YOLO and segmentation by SAM2 with the same API.
The inference speeds of GPU (Colab T4) and Apple Silicon M4 are surprisingly close, and a practical pipeline can be built with just an M4 Mac.
It has become possible to easily create composite performance videos from indoor shooting without a green screen.

Give it a try and composite your own performance videos at home.

atsuya koba

Creating Composite Performance Videos in Small Rooms with Meta's SAM2 (Segment Anything Model 2)

Background:
Wanting
to
Make
an
Ensemble
Video
in
a
Small
Room

What
is
SAM2?

Why
Use
Ultralytics?

Processing
Pipeline

Segment
Person
with
SAM2

Generate
RGBA
Video
Using
Masks

Composite
on
Background

Aside

GPU
vs
Apple
Silicon
Inference
Speed

Summary

Read more articles

Creating a browser-based recording tool with flask + recorder.js + p5.js on TypeScript

Sorting Algorithm Visualization & Sonification Plugin for Max for Live

Real-time YOLO Inference in the Browser

Creating Composite Performance Videos in Small Rooms with Meta's SAM2 (Segment Anything Model 2)

Background:WantingtoMakeanEnsembleVideoinaSmallRoom

WhatisSAM2?

WhyUseUltralytics?

ProcessingPipeline

SegmentPersonwithSAM2

GenerateRGBAVideoUsingMasks

CompositeonBackground

Aside

GPUvsAppleSiliconInferenceSpeed

Summary

Read more articles

Creating a browser-based recording tool with flask + recorder.js + p5.js on TypeScript

Sorting Algorithm Visualization & Sonification Plugin for Max for Live

Real-time YOLO Inference in the Browser

Background:
Wanting
to
Make
an
Ensemble
Video
in
a
Small
Room

What
is
SAM2?

Why
Use
Ultralytics?

Processing
Pipeline

Segment
Person
with
SAM2

Generate
RGBA
Video
Using
Masks

Composite
on
Background

GPU
vs
Apple
Silicon
Inference
Speed