Official Thimbleweed Park Forums

Text Version of Podcasts


I love you guys!
And also I love AI :blush:


Great work! I will try to download the subtitles later (after work :)).


So, I should stop my manual work, I guess? Or perhaps finish one and see if some things can be improved in the auto-extract?
Note that I’m adding some annotations in the transcript, which no tool could do.


Wow. That is pretty bad (edit: for such an expensive tool), imo. Without proper punctuation and cleaning of speech patterns, stuttering etc., it is quite unreadable unless used as a subtitle.
It may look like actual text at first, but it is very tiring to make sense out of it.


Well, this process is just auto “captoning” for the video format of the podcast. I would see this as an “aid” to proper transcription, doing most of the word entry, but it should be changed to include speaker names and clean up the occasional (now more rare) transcription errors.

If you are near completion with one or more podcasts, I would finish that work, and then we can compare the effort with someone working to clean up one of the auto-created transcripts, and decide the best way to go forward.


Great idea. I’m not a fast typer, so any base to start from will be useful. For the Friday Questions, I would make sure everyone’s unpronounceable nickname be written correctly by going back to the posts (and copy and paste the questions while I’m at it, of course)


Yes, it’s not bad for an automatic tool, but has lots of limitations. I needed to double check which post you were replying to, and I see it was the Sonix version. The Google automatic captions are actually much better than the Sonix output, but it doesn’t do speaker identification, and punctuation, etc…


If Sonix can write time-stamps in the output, we could merge the speaker informations from Sonix and the translation from YouTube.


It’s working. :wink:

youtube-dl is able to download all subtitles from all videos in the playlist. The results are WebVTT files that looks like this:

Kind: captions
Language: en
::cue(c.colorCCCCCC) { color: rgb(204,204,204);
::cue(c.colorE5E5E5) { color: rgb(229,229,229);

00:00:06.550 --> 00:00:11.610 align:start position:19%

00:00:09.650 --> 00:00:14.130 align:start position:19%
hi<c.colorCCCCCC><00:00:10.650><c> I'm</c><00:00:10.889><c> Ron</c><00:00:11.099><c> Gilbert</c></c>

00:00:11.610 --> 00:00:16.949 align:start position:19%
I'm<c.colorE5E5E5><00:00:11.880><c> Gary</c><00:00:12.150><c> winning</c><00:00:12.480><c> and</c><00:00:12.780><c> this</c><00:00:13.530><c> is</c><00:00:13.650><c> our</c><00:00:13.799><c> first</c></c>

00:00:14.130 --> 00:00:18.449 align:start position:19%
stand-up<c.colorE5E5E5><00:00:15.000><c> meeting</c><00:00:15.360><c> podcast</c><00:00:15.929><c> and</c><00:00:16.350><c> a</c><00:00:16.529><c> stand-up</c></c>

I can convert these files into the SRT format that looks like this:

00:00:06,550 --> 00:00:11,610

00:00:09,650 --> 00:00:14,130
hi I'm Ron Gilbert

00:00:11,610 --> 00:00:16,949
I'm Gary winning and this is our first

00:00:14,130 --> 00:00:18,449
stand-up meeting podcast and a stand-up

00:00:16,949 --> 00:00:21,720
meeting is the thing a project usually

00:00:18,449 --> 00:00:24,810
has in the mornings where everybody on

00:00:21,720 --> 00:00:27,029
the team gets together and talks very

A little sed command converts this to plain TXT:

hi I'm Ron Gilbert
I'm Gary winning and this is our first
stand-up meeting podcast and a stand-up
meeting is the thing a project usually
has in the mornings where everybody on
the team gets together and talks very
very briefly about what's going on in
the project what happened what's gonna
happen today and what happened yesterday
usually the meetings are stand up just
so they're quick and they don't drag on
so we're gonna do the stand-up meeting
podcasts they're probably gonna last
less than five minutes and we'll

Which file format do we need? All? :slight_smile:

There are several tools to edit and translate SRT files. So theoretically someone could translate the podcast in, let’s say, Italian and then we could re-import the subtitles into the video. So everyone can hear the English podcast with Italian subtitles. But that would be a huge amount of work of course. :slight_smile:


I always knew it. While Gary doesn´t show himself too much publicly, he secretly does all of the winning. So much winning!


So, you would pay the Sonix service?

I thought the same. Although I liked the Pockesphinx interpretation more, for example: “brenda our red phantom asks how much of the boot park was playboy wire frame before the real art was added.”


No, I forgot about the costs. Sorry.


I hope so.


Can we quantify the Sonix costs? How many hours of podcasts are there?


I like the idea of this. Once the captioning issues are figured out, I’d be happy to put them on the official YouTube channel if that’s desired.


There are approximately 21 hours of audio in the podcasts. At $8/hr that’s $168. The service also costs $15/month on top of that. It would require 1 month at least. The service has a nice editing tool as well as the transcription. However, it’s clear the transcription quality is substantially worse than Google’s, and I don’t know how easy it would be to merge data. Also, looking at the one example I posted earlier, it missed speaker transitions quite a bit, especially with more than 2 speakers.

The service offers a free trial with one hour of audio transcription. If someone wanted to, they could do one of the <1 hour podcasts, and try it out with the free trial and see how well it works, along with the online editing (like with Podcast #1, only about 14 minutes I think). Theoretically, we could import the transcription made from Sonix into the Youtube video captions, assuming the data is represented with a timestamp format.


21 hours? This can’t be right, you are missing half an hour!


:slight_smile: Good point! I’ll add these tonight, I’m curious how the Google transcription will work…


I think I have found the name of what we are discussing: Speaker diarisation.

There are some open-source speaker recognition tools out there. Though, I’m not sure about their functionality.

Here is a link to one tool which seems to provide diarization. The “Tutorial for LIA_SpkSeg — Top-down Speaker Segmenting and Clustering System” may be interesting for us.


Ahah, thank you for your mention, but I don’t think I deserve it. Maybe as a “bonus file” :grinning:

…with an English pronunciation with heavy italian accent. It’s a tough challenge for the AI !