Text Version of Podcasts

Sushi · November 1, 2017, 1:39pm

Yeah, PM me all or the first ten (whatever is easier for you). That should keep me going for a couple of days.

Sushi · November 1, 2017, 1:44pm

@besmaller, could you try and see what happens with such time markers in the captions? Also single ended time markers (as I am using right now albeit formated differently)

@someone, is the number before each time marker necessary? If I need to include that, it will make the pure transcript less useful.

Edit: I had a look to the SRT format and I am afraid you can either add subtitles to youtube in that format or let google create its own (in which case all text will be displayed, at least for a very short time.) so it’s either one or the other I am afraid.
What I 'd propose is to use a human friendly format which can also be put on the forum here for fast searching or fast readers and postprocess those to remove all parts that should not appear in the closed captions (and letting google handle reinserting the timemarkers- it does a good enough job, IMO). Even manually merging the auto-caption with the cleaned up transcript, so reusing initial timemakers, looks a huge job to me, and unlike the editing of the text, not something I want to waste some time on.
All in favor, say “aye” (or propose a better alternative)!

milanfahrnholz · November 1, 2017, 2:08pm

Right around the corner of Hogwarts.

Someone · November 1, 2017, 3:31pm

The SRT format was made as a compromise to be readable by humans and programs. But you can use it as a starting point. For example you can easily remove the numbered lines, but keep the timestamps. There are text processing programs like “sed” in Linux that can help you with that. (If you tell me what you need, I can pre-process the files for you.) And maybe you want to have a look at other subtitle formats: It should be easy to convert the SRT files to other formats.

I’ll PM you.

besmaller · November 1, 2017, 3:34pm

I agree with this. Most of the work is in cleaning up the transcripts and adding speaker names. If we just use (any) consistent syntax for the ‘extras’ beyond just the spoken text, we can write and run a simple script to clean up as needed to allow for import into YouTube for captioning.

besmaller · November 1, 2017, 8:52pm

I like your thinking. Maybe it would make sense to create a new Topic just for “Development Blog Podcast Transcripts”, with a post for each podcast, and the posts could be Wiki posts, so that multiple individuals could edit them. It’s better to have a different post per podcast so people can edit different podcast transcripts separately without conflict. Discourse Wiki posts also maintain revision history, and I think it can be configured so all users can view that history. But we’d need Ron’s help with configuring the discourse forum for that. I think the option to make a Wiki post is available now for those at Moderator level.

Someone · November 1, 2017, 9:28pm

I like that idea. I would offer my help to maintain the corresponding threads and Wiki posts.

Is it possible in Discourse to create sub categories? If so, we could make “Podcasts” a sub category of “Development”.

Yes. @RonGilbert: What do you think?

besmaller · November 2, 2017, 2:11am

I don’t think we need Ron’s help at first. We can just create a new Topic under the Development/Design Category called Development Blog Podcasts, or just Podcasts, and then add a Post for Podcast #1(making it a Wiki Post), another Post for Podcast #2 (Wiki Post), etc…

Ron’s help would only be needed if we felt we wanted to have a Topic (or thread, as you call it) for each Podcast, in which case we’d either want a new Top level Category for Podcasts, or a subcategory of “Development/Design” for Podcasts (yes, Discourse supports 2 levels of categories, but that is it). But I don’t think we need to do it that way.

Someone · November 2, 2017, 10:06am

In that case we had to “cut” the whole thread into new topics later (= additional work, remember we have over 60 long podcasts), comment’s aren’t possible, the thread will be very long (we have over 60 long podcasts) and we can’t edit the podcast texts after a while.

So If the other method (one podcast = one topic in a new sub-category) is possible, why not use it immediately? Beside that: I don’t know if Ron would like to see all his podcasts transcribed in this forum or the internet. So I would appreciate a comment from him before posting more texts.

BigRedButton · November 2, 2017, 1:04pm

I would prefer non-Wiki posts, in order to prevent chaos. It might be the best, if one guy keeps track of the process. Though, we could spread the total number of podcasts to different volunteers, so that no one would have to “supervise” the transcription processes for all of the 67 podcasts.

In addition, there could be another thread, in which the initial post contains a table that gives an overview of the status for all podcasts - maybe with links to the respective development threads.

Nonetheless, it’s still a really really huge amount of work, if we do everything by hand…
And, I’m not sure how much time I would be able to put into this.

Someone · November 2, 2017, 2:20pm

The problem is, that you can’t edit normal posts after a while anymore. Wiki posts on the other hand could be edited by all forum members.

That’s another plus for the Wiki posts: We can put the auto-generated transcripts into single Wiki posts so everyone is able to correct them and work on them.

Sushi · November 2, 2017, 6:14pm

We better make one post to keep track of who will work on a certain podcast. To avoid that two or more are doing the same job in parallel. I will create two posts later today for the first two podcasts and then I’ll start on the third.

BigRedButton · November 2, 2017, 7:57pm

@Someone Okay, that’s true. I agree with that.

I’m not sure if everyone would always check out this post before starting to work. Maybe usual threads would be better than a Wiki, for this reason, so that no one would be able to interfere with someone else’s works. If two people were working on the same paragraphs in the same transcript at the same time, they would at least not be able to mess it up.

Sushi · November 3, 2017, 1:06am

ok, so I created two new posts, one for a single podcast transcript.
As you can see the formatting tries to take both closed captions extraction and transcript text into account.
Unfortunately, some parts should be removed from one but not the other and vice versa. So currently, they contain the sum of both. Once we smooth out some extraction scripts AND how to share the common source file, I will edit the transcript post to replace it with the extracted transcript (i.e. removing all the distracting text between < > and all curly brackets { } preserving text in between.)

@besmaller, could you try and upload the captions once more to see if we have covered everything? If you want, I can write a small linux script to extract what we need (especially since I have the original txt files, while a copy and paste from the forum might introduce some symbol characters, like for three dots: (…) )

I’ll get cranking on the next podcasts in the meantime

Sushi · November 3, 2017, 1:24am

but that would mean someone has still to go and merge everything…

What I was rather suggesting is that 1 single podcast transcript is done by 1 single person. If the quality is good enough, minor corrections could be handled through regular post comments. The risk with a wiki is that it needs perpetual monitoring to avoid some joker messing it up. Note that we can still wiki-fy an existing post (or unwiki-fy if necessary).

So I’m also in favor of regular posts, but each with just 1 owner.
As a better alternative to a dedicated post to keep track of who’s transcribing what, I’d propose this way of working : when you commit yourself to transcribing a certain podcast, you create the new topic first (category: Development/Design, title: Transcript Podcast #n) with an initial entry “coming soon”, which you later replace.

How does that sound?

besmaller · November 3, 2017, 2:02am

Google has posted recommendations for how to format transcripts to upload and use for captioning. (I didn’t find this until today) : Tips for creating a transcript file - YouTube Help

If you can give me a version of the files like that I can upload them. Alternatively, I can write and run the cleanup script myself, but then we’d need a way to exchange the formatted text files, which as you point out, isn’t easily possible within the discourse forum. We could explore cloud shares or something as an option too.

Someone · November 3, 2017, 10:24am

I know who you mean, but as I wrote above: I would do that/maintain the texts if the work isn’t too much.

I don’t think that this is a problem. My version history is a wiki post and no one has messed it up - yet.

The problem is, that you can’t edit the posts after a while. So what happens if we have to rewrite parts of a text? After the post is locked we would only be able to track the edits in the comments below. And …

… this could lead to the situation that the person can’t edit “coming soon” anymore. So the person has to keep “coming soon” in the first post and post the podcast text in the comments below. And shouldn’t we publish the auto-generated texts as a help?

BigRedButton · November 3, 2017, 11:00am

I think it would be feasible this way. There is no necessity to edit the first post, even though it would have been represented more clearly by doing it that way. We could just publish everything in the respective thread.

Our transcription threads here would be just for development purposes anyway. Ron mentioned that he would like to add the final transcriptions to the podcast entries in the dev blog, once they are done.

In my opinion, uploading the modified transcripts on YouTube would be nice to have (in addition to the text versions on Ron’s dev blog). Though, we could also create our own videos with a different software tool that would display the subtitles in a more attractive way - maybe with a TWP style font.

LowLevel · November 3, 2017, 10:18pm

I agree with this. People messing up wiki posts are not a real issue in my opinion, for several reasons:

existing users of this forum have always been very respectful of wiki posts
new users cannot edit wiki posts. Only users with Trust Level 1 can edit them and this level can be changed in Discourse configuration to make it even more restrictive
the quantity of active users of this forum is steadily decreasing, there are less and less users who pay attention to what’s happening here
even if somebody manages to mess with a wiki post, anyone with TL1 can easily revert it to its previous version

BigRedButton · November 4, 2017, 12:42am

Sadly.
I hope that the transcriptions will be nonetheless interesting for many readers.

Topic		Replies	Views
The (almost) Real Transcript Podcast #68 (off-topic again) Podcasts	76	3369	April 13, 2018
Animated Podcasts Podcasts	187	5730	November 1, 2021
Thimbleweed Park Fan Forum Podcast #0 (Pilot) Podcasts	114	2111	August 24, 2018
Transcript Podcast #8 Podcasts	9	1593	November 21, 2017
Transcript Podcast #10 Podcasts	5	1307	December 4, 2017

Text Version of Podcasts

Related topics