Text Version of Podcasts

BigRedButton · October 28, 2017, 11:25pm

That’s impressive indeed. This might suffice, I think, so no one would need to do this by hand.
Developing the algorithm must have been costly - even though I wonder why its dictionary doesn’t contain the term Thimbleweed Park.

By the way, you can “open” the transcript there, in order to copy & paste it and edit it in a text editor.

Someone · October 28, 2017, 11:40pm

Yes, that looks very good. Is it possible for you to upload all other podcasts to YouTube? Even if the lines aren’t always correct, the subtitle system helps deaf people and/or people with hearing problems.

The problem isn’t the algorithm, it’s more the phonetic database. But all Google and especially Android users are helping Google constantly to improve the system.

besmaller · October 29, 2017, 12:32am

First, I’d want to make sure @RonGilbert didn’t have any objections to that. This would mean in essence redistributing his podcast. For this experience, I kept the file in the “Unlisted” state so it wouldn’t be findable via a search engine, but a direct link like I put in that previous post works.

In addition, I’d want to think of a way to streamline that process. There were several manual steps to the process the first time, including a bunch of hassling around in the program I used to encode the jpeg image file and mp3 podcast file into an mp4 video. The video conversion took around an hour I think, and the upload to youtube and their initial processing took quite a while as well.

I’m sure I could find ways to script and streamline the process.

Another option is to just upload them, and pull out the transcript data. Note, this method doesn’t identify the speakers, so we would probably ultimately want to edit those transcripts and clean them up.

I guess we have to ask ourselves if there is value in the video format, where you can see the transcript while the audio is playing., or if we really just want a the text.

ZakPhoenixMcKracken · October 29, 2017, 3:09am

My answer is that it’s fine, double fine (oops!), triple fine this way! The live transcript while the audio is playing can really improve your (=my) comprehension of the American English.

nihilquest · October 29, 2017, 3:13am

Some time ago I encountered a site for uploading mp3’s straight to YT, but I can’t find a link right now. It would save you some work.

LowLevel · October 29, 2017, 3:27am

They usually are and in some cases they are even better, for example if the system has more information about the context. In other videos I have seen the sound “Thimbleweed Park” being recognized without issues.

It was. Speech-to-text was one of the first problems addressed by Google’s neural networks, which required many years of research. Also the quantity of data the system continuously learns on is huge and that helps a lot.

I think that a certain amount of cleaning would be necessary anyway, regardless of what software will be used. Maybe it would make sense to choose the solution that will produce the results that require less manual fixes.

There are a lot of videos about Thimbleweed Park on YouTube and I assume that putting the podcasts there (with a link to the original blog post) might be interesting to other fans.

It seems a lot of work, though. Let’s see what the developers think about it.

Ema · October 29, 2017, 9:46am

That’s a very good thing. I second that, if @RonGilbert, Gary Winnik and @David agree, it would be important to offer a subtitled version of the podcasts, for the reasons given by @Someone and @ZakPhoenixMcKracken.

Anyway, it is true what @besmaller implies: having a text version is different than having a subtitled podcast. I’m a fast reader, and I easily get through a text in a fraction of the time needed to listen the whole thing. And having a less distracting experience, and a better comprehension. That’s true, I’ll improve my English listening to the subtitled podcasts, but still, I think a lot of people would prefer plain text. I’m sure that if exporting the subtitles into text is fast and easy (according to @BigRedButton) , somebody will do it immediately.

BigRedButton · October 29, 2017, 10:42am

It’s quite simple. You just have to play the video on the YouTube page by clicking on the hyperlinked title of the embedded video above. Then you click on the three dots beside “share” (above “Subscribe”) and select “open transcript”. Then it will be displayed on the right and you can select the text.

Someone · October 29, 2017, 11:02am

Of course, but you must have the possibilities too. If it’s too much work to upload the videos to YouTube then we don’t have to ask Ron at all.

Try FFmpeg: You can convert the audio and merge it with a video/picture in one single step/command line.

AFAIR you can extract the subtitles from a YouTube video with some tool, so you get the whole text. (Isn’t youtube-dl able to do that…?)

Ema · October 29, 2017, 11:41am

Maybe I do something wrong, I can open the transcript, but can’t select the text.
Or maybe @Someone is right, and you need a third party utility…

BigRedButton · October 29, 2017, 11:44am

That’s strange. I just did it successfully.
Maybe it’s due to the browser. I am using Firefox with some very restrictive settings.
It’s a pity that I cannot share .txt files here.

Someone · October 29, 2017, 12:06pm

If you can see the text in your browser you can extract it at least with the developer tools in your browser, for example in the source code view.

But you have to do this with each podcast, so the help of a tool would be nice.

BigRedButton · October 29, 2017, 12:32pm

The extraction of the transcripts of 67 videos on YouTube might be the least time-consuming step of such an endeavour. It took me less than a minute for this video.
Separating the speakers and correcting all mistakes is way more time-consuming.
In order to separate the speakers, I would programme a tool which would allow me to create time-stamps in a text file whenever the speaker changes. Just a tool that has a button for each speaker, on which I can click manually. Though, I would still have to listen to every podcast in real-time then. After that I would use a Python script which would add these informations in the transcript, taking the time-stamps from YouTube into account, in order to split the text in the right places.
Also, you can search for the keyword “park” in order to correct all wrong translations for “Thimbleweed Park”.
I presume that we would have to listen to every podcast twice. Once for separating the speakers (real-time) and once in order to correct all mistakes (slower than real-time).

Someone · October 29, 2017, 3:44pm

1 minute * 67 videos = 67 minutes = at least 1 hour of work.

youtube-dl is able to save the subtitles to file (see the section “subtiltes option” in the documentation). youtube-dl is also able to process a playlist - so why not use the help of this little tool?

BigRedButton · October 29, 2017, 5:18pm

That’s true. Though, I haven’t worked with youtube-dl yet. I presume that it would take a while to get it started. But, if you can save some time with it, why not!
Though, as I already mentioned, you might have to spend many hours with the corrections.

Someone · October 29, 2017, 5:30pm

I would do it/help if someone gives me the links to the videos. (I’ve worked with youtube-dl already.)

besmaller · October 29, 2017, 6:51pm

At this point there is just the one (linked in my earlier post)

besmaller · October 29, 2017, 8:28pm

Thank you! That is exactly the kind of tool I was looking for. And I found this page with advice on exactly how to do this sort of thing: Convert mp3 -> Video with static image ( ffmpeg/libav & BASH ) - Stack Overflow

I’ll try it out today.

Someone · October 29, 2017, 9:38pm

The official Wiki of FFmpeg is not everywhere up to date. If you need help, let me know.

besmaller · October 30, 2017, 4:46am

OK, I was able to use FFmpeg (thanks @Someone for the reference) to encode the video on the command-line, and I’ve scripted that to encode all the podcasts. Then I found a way to upload the videos using the Google Youtube API from a python script, and I made a small script to do that just for the first 9 podcasts. I also created a second channel on my Youtube account (so my personal name wasn’t used). So here’s the result.

One point of followup. I am going to keep this playlist “unlisted” so it isn’t found in searches on Youtube or Google, until we get confirmation from Ron whether it’s OK, but I’ll add the other podcasts soon. I’m still optimizing my scripts, and I need to add one more capability (to automatically add an uploaded video to an existing playlist), and also add a description to each video pointing to the original source in Thimbleweed Park Development blog, and crediting @RonGilbert , @David , and Gary.

Topic		Replies	Views
Animated Podcasts Podcasts	187	5729	November 1, 2021
Thimbleweed Park Fan Forum Podcast #0 (Pilot) Podcasts	114	2111	August 24, 2018
All the phonebook messages and all the books in the library	10	1139	May 25, 2021
Transcript Podcast #2 Podcasts	0	1038	November 3, 2017
About the Podcasts category Podcasts	1	946	December 7, 2017

Text Version of Podcasts

Related topics