Advanced aspects of karaoke timing

On this page, we will go into more detail which cues the blue sound spectrum in Aegisub can give you to make your timing precise and hopefully more efficient. Therefore, several aspects that were mentioned in the general tutorial will be explained in more depth here.

Beat

Most syllables in most songs will be on beat and therefore should also be timed there. However, recogninzing the beat is not always easy.

What is beat

Beat is a musical term and not extremely sharply defined. “The rhythm a listener taps their toes to when listening to a song” is a very practical interpretation. For more, you can always consult the Wikipedia article.

In the image above the beats are marked with semi-transparent red lines. Sometimes beats are high peaks and other times they are after an empty space that makes a dent in the bottom line of the spectrum. When this is timed, the single syllables fall directly onto the beats as shown below.

Of course there are exceptions to this concept. However, if you recognize the beat structures, you will more easily place the syllable at the right spot leading to a fast, precise timing.

Start and end times of lines

Usually the start of a line is also on a beat. However, the end is not always so clear. Sometimes it is also very hard to hear because the voice is fading out in a way. As a rule of thumb, one can generally time until the next beat like this:

What is beat

If you time both start and end time precisely, most songs will have a small gap of 1 to 2 beats between the lines.

s, f and h sounds

Some consonants show a “sound cloud” in the spectrum where usually only a fraction or nothing looks good to be included in the actual syllable timing. This can commonly be seen for s sounds (also z, x and c in most languages) and a little less pronounced for f and h sounds.

The structure between 27.0 and 27.1 here is such a sound cloud belonging to an s shound (shi in this case).

In the timing of this audio stream, you can see that the majority of this sound cloud is spoken before the beat and therefore the syllable start. The proportion of the sound cloud at the beat varies from approximately half to nothing. It is very rare to see more than half of the sound cloud actually belonging to the syllable after the beat.

Slower or faster

As a rule of thumb, one generally sees more of the sound cloud belonging to the syllable the faster the song is. So for very slow songs, the whole sound cloud tends to be outside the syllable and for faster songs, usually about half.

The image above shows an h sound cloud between 49.8 and 49.9. It is very similar to an s sound cloud but usually a lot smaller. In this particular example you can also see some kind of microstructure within the sound cloud showing you where the beat is. To make it clearer which patterns are meant, please look at the timed version below.

F sounds usually have a smaller sound cloud intensity compared with the beat. Here you see the f sound cloud at 30.7.

h and f sounds

Please note that f and h sounds commonly (maybe about 50% of the cases) not have a sound cloud, whereas s sounds nearly always have one.

st and sp sounds and alikes

Whenever you have st or sp sounds, the timing looks best if you exclude the s sound cloud from the syllable completely as there is usually a smaller or larger gap between s and t sounds (and other following consonants). Time the syllable to the t sound. Here you can see an example for strange, where the s sound cloud between 22.0 to 22.1 better belongs into the previous syllable (it is a merged s sound cloud here between the s from is – 21.9 to 22.0 – and the s from strange).

If these sounds occur at the beginning of a line, it can make sense to separate s-trange if the s at the beginning is emphasized and very long. Usually, it will look better to give the singled out s a \kf effect even if the syllable is shorter than 100cs.

If those sounds occur within a word, it usually looks better to separate the syllable between them like for fas-ter here:

m, n and l sounds

Syllables that start with m, n or l often only appear as base line in the spectrum and mostly have a dent or other structural abnormality in this baseline to indicate the beat.

I cannot find the beginning of n, m and l syllables

As the beginning of m, n and l sounds is not very visible in the spectrum, it is often also hard to hear because the sounds are so muted. If in doubt, time the syllable to the first vowel and ignore m, n and l completely.

Especially in l sounds, the l sound is audible a little while before the actual beat. In that case, we recommend timing the syllable to the beat.

y sounds

y sounds can be worse than n and m sounds. There are plenty that have a regular beat at the beginning with an easily visible peak, but you might also encounter one that is absolutely invisible in the spectrum.

There might be a slight intensity “break” in the spectrum like in the example above but sometimes you don’t even have that. y sounds can be one of the most difficult to time precisely. Unfortunately, only repeated moving, listening, moving, listing will give you experience and a good result. Here is an example where no real structure anomaly can be seen when the y sound starts:

Can this sound also be represented by a different letter?

In some languages this sound is not written with y but j. For clarification: The sound refers to the y in English you.

Syllables starting with a vowel

The start of syllables starting with a vowel is often not very easily visible in the spectrum. Here is an example of ho‑u, where the syllables are clearly separated when listing to it but not visually. In this case, you can only listen, move, listen, move until you like the result. There are usually some hints in the spectrum where the separation might be but it is still some guessing involved.

Here is a second example where the beginning of und is not really visible.

k and t sounds and everything else

k and t sounds are the perfect examples of easily visible, on-time syllables where everything that belongs to the sound also belongs to the syllable.

They usually appear with a pronounced peak that marks the beginning of the syllable.

All other syllables that do not produce a sound cloud are usually somewhere between k and t vs. m, n and l sounds. Most lean more towards the side of k and t, fortunately.

Pauses

When to time pauses cannot be strictly universally defined. Generally it is only mandatory if you are using \kf effect on the syllable before the pause. If you do so, there are 2 cases in which pauses are definitely needed. The first one is when the breathing of the singer is audible and there is no singing at the same time.

In the image above you can see such an audible breath from 7.5 to 7.7 (from 7.7 to 8.0 there is no singing). This requires a pause.

The 2nd case is the duration of the pause. As a rule of thumb: If the pause is at least one beat long, time it in. For medium fast to slow songs even half a beat is recommended.

Can a pause be within a word?

Although most pauses are between words, in some songs you will also find pauses within words. It is a bit counterintuitive but don’t be confused by that. If the singer makes a pause between two syllables of one word, that is how it is. Time the pause as it is as the audio is the absolute judge of your timing and no rule can ever be more correct than that.

When should I split lines due to a pause?

As a rule of thumb, splitting a line if the pause is longer than 50cs works really well. Smaller pauses are usually between 15 and 30cs. Those are quite common and make nice splitting points for lines that take up too much space on the screen. If they are displayed in one line, one can generally keep them together.

Background vocals and overlapping singing

Especially at the beginning, it can be really challening to time background vocals that cannot be seen well in the spectrum. This is also true for multiple singers who sing different lyrics. In these cases, timing 1 audible voice line like the foreground is a must. However, if you don’t hear the background or 2nd singers vocals well, better not time them than having them out of place. Nevertheless, the amount of background vocals you can hear and how precisely you can time them gets better with time and practise.

There are also some “lyrics” that are commonly considered background and often not included in (official) lyrics you find online like oh, ah, uh etc. Generally, it is very nice to have them in karaoke subs if there are no other lyrics sung at that moment. But at the end, it is personal taste whether to include them or not.

Copying background singing?

In a lot of songs background vocals like oh, oh oh are very similar in timing all the time. Therefore, it can really help to copy the timed lyrics into other lines and just adapt it instead of timing it from scratch for every line.

If you have 2 lines being sung with overlap, time the overlap exactly as it is. It can be challenging to style them nicely so that everything fits on the screen, but better have it times as is than not.

Artistically repeated vowels

In some songs you will find words with artistically repeated vowels like in David Bowie’s Star Man. Some karaoke subs display it with a tilde (~) but there is a better option that tells the singer more precisely what to expect:

If the repeat is e. g. on the i of light, it could look like this: li-i-ight

This is of course optional but it is a nice cue for the singer.

Spoken text

To be clear: This is not about rap! This is about text that is really just said. For example, sometimes you have dialogue at the beginning or in the middle of a song.

If you want, you can include this but then do not time the syllables. Just time the lines, make sure they are single lines and put them best to the bottom, so that they are not misunderstood for singing.

It will look good like this if you adapt the style so that both primary and secondary colors are white. In that way, you can still apply the karaoke template with lead-in but do not get a color change when the lead-in is over.

Language specifics

Every language has its own characteristics when it comes to sounds and syllables which can be useful to know when separating and timing syllables in a karaoke subtitle. Some languages will be covered here. For the more general rules, please refer to making the karaoke and everything discussed above.

Japanese

The Japanese language has a limited and strictly defined set of syllables that dictates where to separate syllables for karaoke timing. Here is the overview of Japanese syllables and some rules on how to transfer this knowledge into k‑timing:

Most times, 1 syllable is 1 kana (see the table above). The exeption are compound syllables with small や、ゆ、よ (ya, yu, yo) like rya りゃ or hyo ひょ displayed on the right side of the chart.
There is one case where romaji transcription might be misleading which is n ん followed by a vowel or ya, yu, yo which could in theory also be one of the n syllables (na, ni, nu, ne, no) or compund syllables with them. It is common to have in this case an apostrophe separating n and the vowel. If it is missing in the transcript, it is highly recommended to add it. The most common occurences of this generally rare case are ren’ai and kon’ya.
For double consonants (e.g. somatta), you have two options depending on what you can hear: either you cut so|ma|t|ta if you hear the a twice like so-ma-a-ta. Or, if the a is said once you cut as so|mat|ta (and not “so|ma|tta”).

2 syllables in 1

In any case, you must never ever have two distinct sound bits timed in one syllable if it can be prevented. For example, if it is sung as “so-ma-a-ta” but you cut as “so | mat | ta”, you’ll have “-ma-a-” on your timing for mat and we lose the visual cue for one whole syllable. You must absolutely avoid that kind of error.

If there is a pause within the double consonant like zu - pause - to the pause reflects the first t of zutto. In this case, separate zut | to and time the pause with the first syllable. It then can look like this:

A very puzzling case can be, when the pause in the above example of zutto becomes very long. As a rule of thumb: Pauses in these cases which are 1.5 times of the length of the sound of the first stylle should be timed/considered a real pause.
Some kana might be hard to distinguish from each other. If in doubt, better keep them together. For example: According to the hiragana chart above, shin should be cut as shi | n but if it is sung as shin, that is only one syllable for our timing. Other common examples are any kind of double vowels like “ii”, “au” and long o sounds ending with “ou”.
Syllables like “tsu”, “chi” or those starting with k and t can also have sound clouds. Those however are nearly always included into the syllable.

Sometimes you will find syllables where only the consonant is audible. This happens commonly for shi > sh, tsu > ts and syllables starting with k. In these cases you have 2 options: either make it visible in the lyrics and don’t separate it like nats’ | ka | shi | i or – the more common option – separate regularly and time the sound cloud as the whole syllable (na | tsu | ka | shi | i).

Korean

The Korean writing system has like the Japanese well defined syllables. Basic knowledge about the writing system is here as well helpful to understand the correct syllable separation.

The Korean characters are called hangul (or hanguel) and look like this:

The syllables are formed by combining: consonant + vowel + (optional) 1-2 consonants. So the syllables look like this (c – consonant, v - vowel):

You always have to start a Korean syllable with a consonant, however, if it starts with ng, this sound is silent. So syllables starting with ng actually start with a vowel.

The beginning of a syllable can also contain a double consonant. Those are: kk, tt, pp, ss and jj. Additionally, there is a rule that says, that the consonant of the end of a syllable is pronounced together with the next syllable if there is no leading consonant in the next syllable. So if 2 vowels are separated by 1 consonant, the consonant belongs to the latter syllable.

Here is an example for Korean syllable separation:

With these information in mind, there are some cases where the syallables are clear in Hangul but not in romanization. This can happen with 2 vowels that follow each other. But more common is a combination with the consonant “ng”. The unclear cases are ng-(vowel) and ngg-(vowel). Here is a simple example how the Hangul look, so that you can check in the Korean lyrics. It is highly recommended to indicate the separation of these ambiguous cases with an apostrophe in the transcript to support the karaoke singer.

anga can be either an-ga 안가 or ang-a 앙아
angga can be either an-gga 안까 or ang-ga 앙가

Can I just hear for the difference?

Although it is possible to hear the difference, we strongly advise you to look it up in the hangul lyrics because it is hard to hear for someone not proficient in Korean.

There are some more occurences in the Korean language important for karaoke timing:

The common syllables neun and neul are sometimes hardly audible if at all. Time them to best of your ability.
Do NOT separate double consonants in Korean. They belong together because they are one sound. Do not confuse this with Japanese.
s at the beginning of a syllable is often transcribed as sh as is it also pronounced like that. Do not mistake it for s-h. You can easily idetify it by looking for the s/sh sound cloud. If in doubt, check the lyrics in hangul.

Korean ch and j produce a sound cloud. Depending on the syllable, excluding from everything to half of the sound cloud from the syllable is recommended.

English

English does not have strictly defined syllables which gives you a bit of freedom in separating them. This freedom often leads to the fact that the timer does not know where to separate the syllables. A good orientation can be found in a dictionary indicating syllables in the phonetic (tran-)script like dictionary.com. For a commonly confusing word like everyone, you find for example:

As the second e in the word is not explicitly spoken, you have to decide where to put it and both options are equally valid. One version is eve-ry-one. You might also encounter a song with a more unique pronunciation that is not presented in the dictionary that rather sounds like e-ve-ry-one. In that case, you of course separate as fits the song best.

Some more things to take note of when timing English songs:

th sounds have a similar sound cloud to h sounds and should also be handled the same

It might be counterintuitive at first to separate contractions like didn’t in two syllables, but it is actually in accordance with dictionary.com and also in most cases makes sense with the beats. Some contractions are also usually one syllable like don’t.

t sounds can produce rather large sound clouds in English, those are usually best included into the syllable like the one here at 0.4

y, especially in you, often produces a sound cloud. In most cases, it is best to exclude it from the syllable but as the sound cloud stems from a t-esque sound, sometimes it also looks a lot better when including it. Watch out for the beat to make this decision. Here is an example with the sound cloud at 50.0 to 50.1 excluded:

French

Something that is especially challenging when timing French lyrics is to not have syllables spanning spaces as there are so many contractions in French. You also have to frequently decide where to put letters that are not pronounced on top of that. It might sound impossible if you do not speak French yourself but with a little practice, your ear will adapt to the French sounds and syllable separation will get a lot easier.

Especially lines with y, but also with et , le and est can commonly be challenging to hear like this one:

For syllable separation in general, there are strict rules in French that are basically like this:

2 vowels > separate the vowels
vowel - consonant - vowel > consonant goes to the second syllable
vowel - 2x consonant - vowel > 1 consonant to the first syllable, 1 to the second – this also applies for double consonants
if the 2nd consonant in the case above is r or l, the consonants both go into the 2nd syllable

You can read about the French syllable separation more extensively here.

Some more thing to note in French:

ch sounds also produce a sound cloud that is better excluded from the syllable, in the example it can be seen at 34.1

in French, a space is placed before !, ? and : – That is not a mistake. That is correct! However, Aegisub will always think those marks are syllables then, so make sure to always erase the syllable separation there.
the common word rien can often be heard as 2 syllables. The first one usually is very short then. Whether you separate it or not is often a matter of personal taste. However, we recommend separating if possible.

as the consonant from the previous word gets ligated to a word starting with a vowel, don’t let yourself be confused by that and separate according to the space like this although the second to last syllable sounds like so:

for contractions like je > j’, it can make sense to time the sound cloud of j a separate syllable like this:

German

Like in English, syllable separation can be challenging in German. However, there is an official syllable separation in an orthographical sense documented in the duden for example that can give you orientation. However, this might not always suit the song best. Here you can see the proportion of the website that shows the orthographic syllable separation for irgendwie.

One common except to this is words that have ck but not at the end. Here it usually looks a bit better to separate c|k like here in Soc|ken.

Another exception are words where a single vowel would be the first syllable because those are officially not separated but for karaoke that is better. Examples are o|der or ü|ber.

Some more things to note for timing German lyrics:

ch sounds usually have a sound cloud that is better be excluded from the syllable similar to s sounds

v is usually spoken as f and therefore produces the same kind of sound cloud

Words ending on -en are very common contracted like this: gehen > geh’n. These are usually also not marked in lyrics. Please note that in these cases the word is reduced by one syllable. You can mark the contraction to better support the singer but it is also common to not have it marked in timed lyrics.
If you have a verb in dictionary form without contraction, the last syllable usually looks like nothing is spoken in the spectrum similar to m, n and l syllables.