Skip to content

BERT GPU Retention & Batch Inference

On the GPU path, two key optimization points stand out.

  • Removing the unnecessary CPU transfer from the original SBV2 to keep BERT output as a GPU tensor
  • Batching multi-sentence BERT calls into a single invocation to reduce kernel launch overhead (the fixed cost incurred each time an operation is dispatched to the GPU)

Why It Matters

The original SBV2 fundamentally synthesizes entire text in one pass (line_split=False).

Since BERT is called only once, batching was not needed.

HayaKoe introduced punctuation-based sentence splitting for prosody stability, which created the new problem of BERT being called once per sentence.

Unnecessary CPU Transfer of BERT Output

The original SBV2's BERT feature extraction code contains this:

python
# Original SBV2 (style_bert_vits2/nlp/japanese/bert_feature.py)
res = torch.cat(res["hidden_states"][-3:-2], -1)[0].cpu()

After running BERT forward on GPU, it calls .cpu() on the output tensor, transferring it to CPU every time.

This output is then passed to the Synthesizer, which runs on GPU, requiring another transfer back to GPU.

The result is a GPU -> CPU -> GPU round-trip per sentence, and this unnecessary round-trip itself becomes a bottleneck.

Per-sentence Individual BERT Calls

When calling BERT separately for each sentence after splitting, GPU kernel launches repeat once per sentence.

A kernel launch is the fixed cost incurred each time an operation is dispatched to the GPU.

For short sentences, the overhead proportion exceeds the actual computation time, accumulating inefficiency proportional to sentence count.

Implementation

Removing .cpu() — Keeping GPU Tensors

The original .cpu() call was removed so BERT output passes to the Synthesizer as a GPU tensor directly.

python
# Original SBV2
res = torch.cat(res["hidden_states"][-3:-2], -1)[0].cpu()    # GPU -> CPU

# HayaKoe
res = torch.cat(res["hidden_states"][-3:-2], -1)[0].float()  # stays on GPU

The BERT model itself is loaded to GPU at prepare() time and stays there through inference.

The BERT model is managed as a global singleton, so loading multiple speakers still loads BERT only once, shared by all.

Multi-sentence BERT Batching

The BERT (DeBERTa) used by HayaKoe is a HuggingFace Transformer model that natively supports batch input.

Leveraging this, instead of calling BERT individually for each sentence in multi-sentence synthesis, all sentences are grouped into a single batch for one call.

Multiple sentences are fed to the tokenizer at once to create a padded batch input, and BERT is called only once.

The same batch logic is implemented on the ONNX path as well.

Improvement Results

GPU Batch Inference Speed

Sequential vs batched comparison on the same hardware (5-run average).

SentencesSequentialBatchedSpeedup
20.447 s0.364 s1.23x
40.812 s0.566 s1.43x
81.598 s1.121 s1.43x
162.972 s2.264 s1.31x

With kernel launch overhead consolidated into a single call, speedups of +23% to +43% are observed.

GPU Memory

We verified that batching does not consume additional memory.

SentencesSequential peakBatched peakDifference
21,662.2 MB1,661.9 MB−0.3 MB
41,661.8 MB1,662.2 MB+0.4 MB
81,697.7 MB1,699.0 MB+1.3 MB
161,934.3 MB1,934.3 MB0 MB

The difference between sequential and batched is within 1.3 MB, essentially identical.

No Effect on CPU

Repeating the same experiment on CPU (ONNX) shows virtually no batching benefit.

SentencesSequentialBatchedSpeed Difference
22.566 s2.564 s1.00x
45.464 s4.855 s1.13x
810.647 s11.783 s0.90x
1624.559 s24.195 s1.01x

ONNX Runtime's graph optimization is already strong enough that Python-level dispatch overhead is not the bottleneck, and padding overhead in batching offsets the gains.

Batching is maintained on GPU, and since neither gains nor losses are significant on CPU, the same path is kept for code consistency across backends.