Text Version of Podcasts

I love you guys!
And also I love AI :blush:

Great work! I will try to download the subtitles later (after work :)).

So, I should stop my manual work, I guess? Or perhaps finish one and see if some things can be improved in the auto-extract?
Note that Iā€™m adding some annotations in the transcript, which no tool could do.

Wow. That is pretty bad (edit: for such an expensive tool), imo. Without proper punctuation and cleaning of speech patterns, stuttering etc., it is quite unreadable unless used as a subtitle.
It may look like actual text at first, but it is very tiring to make sense out of it.

Well, this process is just auto ā€œcaptoningā€ for the video format of the podcast. I would see this as an ā€œaidā€ to proper transcription, doing most of the word entry, but it should be changed to include speaker names and clean up the occasional (now more rare) transcription errors.

If you are near completion with one or more podcasts, I would finish that work, and then we can compare the effort with someone working to clean up one of the auto-created transcripts, and decide the best way to go forward.

3 Likes

Great idea. Iā€™m not a fast typer, so any base to start from will be useful. For the Friday Questions, I would make sure everyoneā€™s unpronounceable nickname be written correctly by going back to the posts (and copy and paste the questions while Iā€™m at it, of course)

Yes, itā€™s not bad for an automatic tool, but has lots of limitations. I needed to double check which post you were replying to, and I see it was the Sonix version. The Google automatic captions are actually much better than the Sonix output, but it doesnā€™t do speaker identification, and punctuation, etcā€¦

If Sonix can write time-stamps in the output, we could merge the speaker informations from Sonix and the translation from YouTube.

1 Like

Itā€™s working. :wink:

youtube-dl is able to download all subtitles from all videos in the playlist. The results are WebVTT files that looks like this:

WebVTT
WEBVTT
Kind: captions
Language: en
Style:
::cue(c.colorCCCCCC) { color: rgb(204,204,204);
 }
::cue(c.colorE5E5E5) { color: rgb(229,229,229);
 }
##

00:00:06.550 --> 00:00:11.610 align:start position:19%
[Music]

00:00:09.650 --> 00:00:14.130 align:start position:19%
hi<c.colorCCCCCC><00:00:10.650><c> I'm</c><00:00:10.889><c> Ron</c><00:00:11.099><c> Gilbert</c></c>

00:00:11.610 --> 00:00:16.949 align:start position:19%
I'm<c.colorE5E5E5><00:00:11.880><c> Gary</c><00:00:12.150><c> winning</c><00:00:12.480><c> and</c><00:00:12.780><c> this</c><00:00:13.530><c> is</c><00:00:13.650><c> our</c><00:00:13.799><c> first</c></c>

00:00:14.130 --> 00:00:18.449 align:start position:19%
stand-up<c.colorE5E5E5><00:00:15.000><c> meeting</c><00:00:15.360><c> podcast</c><00:00:15.929><c> and</c><00:00:16.350><c> a</c><00:00:16.529><c> stand-up</c></c>
...

I can convert these files into the SRT format that looks like this:

SRT
1
00:00:06,550 --> 00:00:11,610
[Music]

2
00:00:09,650 --> 00:00:14,130
hi I'm Ron Gilbert

3
00:00:11,610 --> 00:00:16,949
I'm Gary winning and this is our first

4
00:00:14,130 --> 00:00:18,449
stand-up meeting podcast and a stand-up

5
00:00:16,949 --> 00:00:21,720
meeting is the thing a project usually

6
00:00:18,449 --> 00:00:24,810
has in the mornings where everybody on

7
00:00:21,720 --> 00:00:27,029
the team gets together and talks very
...

A little sed command converts this to plain TXT:

Summary
[Music]
hi I'm Ron Gilbert
I'm Gary winning and this is our first
stand-up meeting podcast and a stand-up
meeting is the thing a project usually
has in the mornings where everybody on
the team gets together and talks very
very briefly about what's going on in
the project what happened what's gonna
happen today and what happened yesterday
usually the meetings are stand up just
so they're quick and they don't drag on
so we're gonna do the stand-up meeting
podcasts they're probably gonna last
less than five minutes and we'll
...

Which file format do we need? All? :slight_smile:

There are several tools to edit and translate SRT files. So theoretically someone could translate the podcast in, letā€™s say, Italian and then we could re-import the subtitles into the video. So everyone can hear the English podcast with Italian subtitles. But that would be a huge amount of work of course. :slight_smile:

1 Like

I always knew it. While Gary doesnĀ“t show himself too much publicly, he secretly does all of the winning. So much winning!

So, you would pay the Sonix service?

I thought the same. Although I liked the Pockesphinx interpretation more, for example: ā€œbrenda our red phantom asks how much of the boot park was playboy wire frame before the real art was added.ā€

1 Like

No, I forgot about the costs. Sorry.

I hope so.

2 Likes

Can we quantify the Sonix costs? How many hours of podcasts are there?

I like the idea of this. Once the captioning issues are figured out, Iā€™d be happy to put them on the official YouTube channel if thatā€™s desired.

7 Likes

There are approximately 21 hours of audio in the podcasts. At $8/hr thatā€™s $168. The service also costs $15/month on top of that. It would require 1 month at least. The service has a nice editing tool as well as the transcription. However, itā€™s clear the transcription quality is substantially worse than Googleā€™s, and I donā€™t know how easy it would be to merge data. Also, looking at the one example I posted earlier, it missed speaker transitions quite a bit, especially with more than 2 speakers.

The service offers a free trial with one hour of audio transcription. If someone wanted to, they could do one of the <1 hour podcasts, and try it out with the free trial and see how well it works, along with the online editing (like with Podcast #1, only about 14 minutes I think). Theoretically, we could import the transcription made from Sonix into the Youtube video captions, assuming the data is represented with a timestamp format.

21 hours? This canā€™t be right, you are missing half an hour!

:slight_smile: Good point! Iā€™ll add these tonight, Iā€™m curious how the Google transcription will workā€¦

I think I have found the name of what we are discussing: Speaker diarisation.

There are some open-source speaker recognition tools out there. Though, Iā€™m not sure about their functionality.

Here is a link to one tool which seems to provide diarization. The ā€œTutorial for LIA_SpkSeg ā€” Top-down Speaker Segmenting and Clustering Systemā€ may be interesting for us.

1 Like

Ahah, thank you for your mention, but I donā€™t think I deserve it. Maybe as a ā€œbonus fileā€ :grinning:

ā€¦with an English pronunciation with heavy italian accent. Itā€™s a tough challenge for the AI !

1 Like