Thimbleweed Park Italian fan dub Project - Official Thread(TM)

Alright, back to business. :grinning:

I’ve created a hacked version of Rhubarb that I’d like you to try. For now, I’m using this forum thread to discuss it. If anybody has a better suggestion, let me know!

First, download the .exe file and use it to replace the original file. If you’re on OS X or Linux, creating a native binary might take a bit.

The basic idea of my hack is this: Normally, Rhubarb tries to recognize whole words (and phrases). Since Rhubarb only knows English, it has a hard time finding English words and phrases that resemble the Italian dialog. That’s why the results are often rather inaccurate.

My hack simply lowers the granularity. Instead of looking for whole English words and phrases, it now looks for English phonemes and syllables. So the underlying language model is still English, but the chances that a given Italian phone is similar to an existing English phone are pretty good. And the chances that a given Italian syllable has a matching English syllable are still not bad.

There is, however, still some fine-tuning to be done. If Rhubarb only worked at the syllable level, there would still be many Italian syllables it couldn’t match. As a result, the animation would look wrong in those places. Worse, the mouth could even stop moving for a moment if Rhubarb really couldn’t find a suitable match.

The obvious solution would be not to work at the syllable level, but only at the phone level. Most Italian phones are also present in the English language. The problem here is fluttering. If the voice actor is saying a long phone that’s exactly between two known English phones, Rhubarb might first recognize phone A, then phone B, then A again and so on, while actually the speaker is still saying the same sound. As a result, the animated mouth might flutter between several shapes during a single phone, which looks quite bad.

The solution, then, is to blend the two approaches. I’ve temporarily added an additional (mandatory) command-line argument modelWeight. If you specify a high value (such as 2.0), Rhubarb will try to recognize whole syllables, leading to imprecise or freezing animation. If you specify a low value (such as 0.1), Rhubarb will try to recognize individual phones, leading to fluttering. I found that the value 1.0 seems to work well, balancing the advantages of both approaches. But I didn’t try any other values between the two extremes. So maybe something like 0.8 or 1.3 could work even better. Also, I only tried the new approach with a short one-minute dialog containing Italian, Spanish, French, and German. Trying it out on a larger body of recordings may give additional insights.

My plan is to settle for a fixed model weight before the release. Then I’ll add a new command-line option to switch between the original, word-based recognition (which looks best for English) and the phonetic recognition (which will hopefully work better for non-English dialog).

Let me know what you think! I’m grateful for any feedback. And if you found a modelWeight value that seems to work better than 1.0, let me know.

4 Likes