Skip to content

Speed & Prosody Controls

Quickstart recommendation: just use speed

In most cases, leaving parameters other than speed at their defaults sounds most natural.

pitch_scale and intonation_scale can introduce slight quality degradation when moved away from 1.0.

If you just want to make things a bit faster or slower, try speed alone, and come back to the Advanced Parameters section when you need more.

generate() accepts six keyword arguments for adjusting speech speed, pitch, intonation, and variability.

All are optional, and if none are passed, the trained defaults are used for synthesis.

The samples below all use the same sentence ("今日はどんな国に辿り着くのでしょうか。新しい出会いが楽しみです。") with the tsukuyomi_chan speaker, varying only the parameter in question.

Parameter behavior is identical to the infer() function in Style-Bert-VITS2, which HayaKoe forked from.

speed — Speech Speed

Based on the default of 1.0, smaller values are faster and larger values are slower.

Internally, this multiplies the phoneme durations predicted by the Duration Predictor directly by speed, so pronunciation itself is well preserved.

つくよみちゃんspeed = 1.0Default
0:00 / 0:00

Below 0.8, pronunciation starts to blur, and above 1.3, it sounds more "dragged out" than simply "slow".

In practice, 0.9 to 1.1 sounds most natural.

python
speaker.generate(text, speed=0.9)   # slightly faster
speaker.generate(text, speed=1.1)   # slightly slower

Advanced Parameters

The settings below are already natural at their defaults, but can be adjusted when fine-tuning is needed.

Summary

ParameterDefaultRecommended RangeEffect
pitch_scale1.00.95 ~ 1.05Pitch multiplier. Slight quality loss away from 1.0
intonation_scale1.00.8 ~ 1.3Intonation range. Slight quality loss away from 1.0
sdp_ratio0.20.0 ~ 0.5Blend ratio of deterministic DP and stochastic SDP
noise0.60.3 ~ 0.9Voice variability (tonal randomness)
noise_w0.80.5 ~ 1.2Rhythm variability (SDP noise)

We recommend moving one parameter at a time.

In the samples below, we intentionally pushed values beyond the recommended range so you can hear the differences.

pitch_scale — Pitch

A simple multiplier that raises or lowers the overall pitch.

Moving away from 1.0 introduces slight quality degradation, so it is recommended to adjust this more narrowly than other parameters.

つくよみちゃんpitch_scale = 1.0Default
0:00 / 0:00

In the 0.95 to 1.05 range, speaker identity is mostly preserved, but at extreme values the voice sounds like a different person or quality noticeably drops.

python
speaker.generate(text, pitch_scale=1.05)

intonation_scale — Intonation Range

Controls the "width" of intonation variation.

0.0 is a near-completely monotone robotic tone, while 2.0 is an exaggerated reading tone.

Like pitch_scale, moving away from 1.0 introduces slight quality degradation.

つくよみちゃんintonation_scale = 1.0Default
0:00 / 0:00

In practice, 0.85 to 1.3 sounds natural.

python
speaker.generate(text, intonation_scale=1.2)

sdp_ratio — Deterministic/Stochastic Duration Blend

HayaKoe (and Style-Bert-VITS2) uses two types of duration predictors together.

  • DP (Deterministic Duration Predictor) — Always produces the same duration for the same text
  • SDP (Stochastic Duration Predictor) — Produces slightly different durations each time

sdp_ratio is the blend ratio between the two, where 0.0 uses DP only and 1.0 uses SDP only.

Higher values increase rhythm variation within sentences, and results differ with each run for the same text.

つくよみちゃんsdp_ratio = 0.25DP-dominant
0:00 / 0:00

For services where reproducibility matters (e.g., fixed subtitle timing), set it to 0.0; for one-off generation, 0.2 ~ 0.4 sounds natural.

python
speaker.generate(text, sdp_ratio=0.0)   # always identical

noise / noise_w — Voice & Rhythm Variability

Each controls noise at a different stage (not the phoneme audio itself).

  • noise — Voice variability. Controls overall tonal randomness in the Flow stage. Always has an effect regardless of sdp_ratio.
  • noise_w — Rhythm variability. Noise fed into the SDP (stochastic predictor). Has no effect when sdp_ratio is 0.

The samples below were generated with all other parameters at their defaults, changing only the respective noise value.

つくよみちゃんnoise = 0.6Default
0:00 / 0:00
つくよみちゃんnoise_w = 0.8Default
0:00 / 0:00

In most cases, leaving the defaults (0.6, 0.8) sounds most natural.

If you feel the output is "wobbling too much", try lowering the corresponding noise slightly; if it sounds "too mechanical", try raising it a bit.