Yeah, PM me all or the first ten (whatever is easier for you). That should keep me going for a couple of days.
@besmaller, could you try and see what happens with such time markers in the captions? Also single ended time markers (as I am using right now albeit formated differently)
@someone, is the number before each time marker necessary? If I need to include that, it will make the pure transcript less useful.
Edit: I had a look to the SRT format and I am afraid you can either add subtitles to youtube in that format or let google create its own (in which case all text will be displayed, at least for a very short time.) so itās either one or the other I am afraid.
What I 'd propose is to use a human friendly format which can also be put on the forum here for fast searching or fast readers and postprocess those to remove all parts that should not appear in the closed captions (and letting google handle reinserting the timemarkers- it does a good enough job, IMO). Even manually merging the auto-caption with the cleaned up transcript, so reusing initial timemakers, looks a huge job to me, and unlike the editing of the text, not something I want to waste some time on.
All in favor, say āayeā (or propose a better alternative)!
Right around the corner of Hogwarts.
The SRT format was made as a compromise to be readable by humans and programs. But you can use it as a starting point. For example you can easily remove the numbered lines, but keep the timestamps. There are text processing programs like āsedā in Linux that can help you with that. (If you tell me what you need, I can pre-process the files for you.) And maybe you want to have a look at other subtitle formats: It should be easy to convert the SRT files to other formats.
Iāll PM you.
I agree with this. Most of the work is in cleaning up the transcripts and adding speaker names. If we just use (any) consistent syntax for the āextrasā beyond just the spoken text, we can write and run a simple script to clean up as needed to allow for import into YouTube for captioning.
I like your thinking. Maybe it would make sense to create a new Topic just for āDevelopment Blog Podcast Transcriptsā, with a post for each podcast, and the posts could be Wiki posts, so that multiple individuals could edit them. Itās better to have a different post per podcast so people can edit different podcast transcripts separately without conflict. Discourse Wiki posts also maintain revision history, and I think it can be configured so all users can view that history. But weād need Ronās help with configuring the discourse forum for that. I think the option to make a Wiki post is available now for those at Moderator level.
I like that idea. I would offer my help to maintain the corresponding threads and Wiki posts.
Is it possible in Discourse to create sub categories? If so, we could make āPodcastsā a sub category of āDevelopmentā.
Yes. @RonGilbert: What do you think?
I donāt think we need Ronās help at first. We can just create a new Topic under the Development/Design Category called Development Blog Podcasts, or just Podcasts, and then add a Post for Podcast #1(making it a Wiki Post), another Post for Podcast #2 (Wiki Post), etcā¦
Ronās help would only be needed if we felt we wanted to have a Topic (or thread, as you call it) for each Podcast, in which case weād either want a new Top level Category for Podcasts, or a subcategory of āDevelopment/Designā for Podcasts (yes, Discourse supports 2 levels of categories, but that is it). But I donāt think we need to do it that way.
In that case we had to ācutā the whole thread into new topics later (= additional work, remember we have over 60 long podcasts), commentās arenāt possible, the thread will be very long (we have over 60 long podcasts) and we canāt edit the podcast texts after a while.
So If the other method (one podcast = one topic in a new sub-category) is possible, why not use it immediately? Beside that: I donāt know if Ron would like to see all his podcasts transcribed in this forum or the internet. So I would appreciate a comment from him before posting more texts.
I would prefer non-Wiki posts, in order to prevent chaos. It might be the best, if one guy keeps track of the process. Though, we could spread the total number of podcasts to different volunteers, so that no one would have to āsuperviseā the transcription processes for all of the 67 podcasts.
In addition, there could be another thread, in which the initial post contains a table that gives an overview of the status for all podcasts - maybe with links to the respective development threads.
Nonetheless, itās still a really really huge amount of work, if we do everything by handā¦
And, Iām not sure how much time I would be able to put into this.
The problem is, that you canāt edit normal posts after a while anymore. Wiki posts on the other hand could be edited by all forum members.
Thatās another plus for the Wiki posts: We can put the auto-generated transcripts into single Wiki posts so everyone is able to correct them and work on them.
We better make one post to keep track of who will work on a certain podcast. To avoid that two or more are doing the same job in parallel. I will create two posts later today for the first two podcasts and then Iāll start on the third.
@Someone Okay, thatās true. I agree with that.
Iām not sure if everyone would always check out this post before starting to work. Maybe usual threads would be better than a Wiki, for this reason, so that no one would be able to interfere with someone elseās works. If two people were working on the same paragraphs in the same transcript at the same time, they would at least not be able to mess it up.
ok, so I created two new posts, one for a single podcast transcript.
As you can see the formatting tries to take both closed captions extraction and transcript text into account.
Unfortunately, some parts should be removed from one but not the other and vice versa. So currently, they contain the sum of both. Once we smooth out some extraction scripts AND how to share the common source file, I will edit the transcript post to replace it with the extracted transcript (i.e. removing all the distracting text between < > and all curly brackets { } preserving text in between.)
@besmaller, could you try and upload the captions once more to see if we have covered everything? If you want, I can write a small linux script to extract what we need (especially since I have the original txt files, while a copy and paste from the forum might introduce some symbol characters, like for three dots: (ā¦) )
Iāll get cranking on the next podcasts in the meantime
but that would mean someone has still to go and merge everythingā¦
What I was rather suggesting is that 1 single podcast transcript is done by 1 single person. If the quality is good enough, minor corrections could be handled through regular post comments. The risk with a wiki is that it needs perpetual monitoring to avoid some joker messing it up. Note that we can still wiki-fy an existing post (or unwiki-fy if necessary).
So Iām also in favor of regular posts, but each with just 1 owner.
As a better alternative to a dedicated post to keep track of whoās transcribing what, Iād propose this way of working : when you commit yourself to transcribing a certain podcast, you create the new topic first (category: Development/Design, title: Transcript Podcast #n) with an initial entry ācoming soonā, which you later replace.
How does that sound?
Google has posted recommendations for how to format transcripts to upload and use for captioning. (I didnāt find this until today) : Tips for creating a transcript file - YouTube Help
If you can give me a version of the files like that I can upload them. Alternatively, I can write and run the cleanup script myself, but then weād need a way to exchange the formatted text files, which as you point out, isnāt easily possible within the discourse forum. We could explore cloud shares or something as an option too.
I know who you mean, but as I wrote above: I would do that/maintain the texts if the work isnāt too much.
I donāt think that this is a problem. My version history is a wiki post and no one has messed it up - yet.
The problem is, that you canāt edit the posts after a while. So what happens if we have to rewrite parts of a text? After the post is locked we would only be able to track the edits in the comments below. And ā¦
ā¦ this could lead to the situation that the person canāt edit ācoming soonā anymore. So the person has to keep ācoming soonā in the first post and post the podcast text in the comments below. And shouldnāt we publish the auto-generated texts as a help?
I think it would be feasible this way. There is no necessity to edit the first post, even though it would have been represented more clearly by doing it that way. We could just publish everything in the respective thread.
Our transcription threads here would be just for development purposes anyway. Ron mentioned that he would like to add the final transcriptions to the podcast entries in the dev blog, once they are done.
In my opinion, uploading the modified transcripts on YouTube would be nice to have (in addition to the text versions on Ronās dev blog). Though, we could also create our own videos with a different software tool that would display the subtitles in a more attractive way - maybe with a TWP style font.
I agree with this. People messing up wiki posts are not a real issue in my opinion, for several reasons:
- existing users of this forum have always been very respectful of wiki posts
- new users cannot edit wiki posts. Only users with Trust Level 1 can edit them and this level can be changed in Discourse configuration to make it even more restrictive
- the quantity of active users of this forum is steadily decreasing, there are less and less users who pay attention to whatās happening here
- even if somebody manages to mess with a wiki post, anyone with TL1 can easily revert it to its previous version
Sadly.
I hope that the transcriptions will be nonetheless interesting for many readers.