menu
Tatoeba
language
Eman izena Hasi saioa
language Euskara
menu
Tatoeba

chevron_right Eman izena

chevron_right Hasi saioa

Arakatu

chevron_right Erakutsi ausazko esaldia

chevron_right Nabigatu hautatutako hizkuntzan

chevron_right Nabigatu hautatutako zerrendan

chevron_right Nabigatu hautatutako etiketetan

chevron_right Arakatu audioa

Komunitatea

chevron_right Horma

chevron_right Kideen zerrenda

chevron_right Kideen hizkuntzak

chevron_right Jatorrizko hiztunak

search
clear
swap_horiz
search
Saney Saney 2011(e)ko abenduakren 14(a) 2011(e)ko abenduakren 14(a) 11:52:53 (UTC) flag Report link Esteka iraunkorra

In order to get some sentences with interrelated context and audio, I plan to write a (linux shell) script that takes a movie and two subtitle files as argument. The idea is to link the sentences from the subtitle files and have sox (an audio editing program) cut out the audio from the movie, by looking at the time given in the subtitle files. All these information would be fed into a file that can be imported to anki.

Now, couldn't this be used for the tatoeba corpus as well (text only, not the audio due to obvious licensing problems)? I guess there's some kind of bulk import on the admin side. I quickly checked the first subtitle page I could come up with, opensubtitles.org, and found this on their homepage:

"As a name suggest OpenSubtitles.org is trying to be as open as possible. You can code application, script, utility or whatever you think is nice. First check which applications exists here [a link], then follow this [another link]."

What do you think? Or did someone have the idea already anyway?

{{vm.hiddenReplies[8675] ? 'expand_more' : 'expand_less'}} ezkutatu iruzkinak erakutsi iruzkinak
sacredceltic sacredceltic 2011(e)ko abenduakren 14(a) 2011(e)ko abenduakren 14(a) 12:37:25 (UTC) flag Report link Esteka iraunkorra

subtitles rarely match at the sentence level, since they have to respect a format that is readable in the scene timing. It's interesting to see double-subtitling in countries such as Belgium (where foreign films are sometimes subtitled in both French and Dutch). I noticed many times that the subtitles didn't match. Sometimes they were not even close, sometimes whole sentences are just overlooked for lack of reading time...

So I doubt this is interesting on the sentence level...

{{vm.hiddenReplies[8677] ? 'expand_more' : 'expand_less'}} ezkutatu iruzkinak erakutsi iruzkinak
sysko sysko 2011(e)ko abenduakren 14(a) 2011(e)ko abenduakren 14(a) 13:06:48 (UTC) flag Report link Esteka iraunkorra

Using http://www.universalsubtitles.org/en/database, it's still can worth it to have some kind of application that diplay on two column the sub and its translation and let a user quickly "validate" those who match a sentence level, could be a good comprromise ?

For open subtitles, script of movies are themselves protected by copyright, the same as for a book you have both the text itself and the way the editor has chosen to cut the paragraph, the font etc. which are also protected. So though it is "open" in the sense of "they ease reuse by other" I think they are on a grey line in term of copyright.

{{vm.hiddenReplies[8678] ? 'expand_more' : 'expand_less'}} ezkutatu iruzkinak erakutsi iruzkinak
sysko sysko 2011(e)ko abenduakren 14(a) 2011(e)ko abenduakren 14(a) 13:08:09 (UTC) flag Report link Esteka iraunkorra

I need to add that universal subtitle script is what we use for the video in the main page of tatoeba for non-connected user.

jakov jakov 2011(e)ko abenduakren 14(a) 2011(e)ko abenduakren 14(a) 14:12:29 (UTC) flag Report link Esteka iraunkorra

A feature to post a sentence from a video with an open license (!) would be a great feature, especially when it could also add the sound (although i doubt the level of quality due to compression and ambient noise). Of course such tool should add the apropriate tags and comments for the correct attribution of the source.

{{vm.hiddenReplies[8680] ? 'expand_more' : 'expand_less'}} ezkutatu iruzkinak erakutsi iruzkinak
sacredceltic sacredceltic 2011(e)ko abenduakren 14(a) 2011(e)ko abenduakren 14(a) 14:14:24 (UTC) flag Report link Esteka iraunkorra

>although i doubt the level of quality due to compression and ambient noise

Tu regardes trop de films d'action...

{{vm.hiddenReplies[8681] ? 'expand_more' : 'expand_less'}} ezkutatu iruzkinak erakutsi iruzkinak
jakov jakov 2011(e)ko abenduakren 14(a) 2011(e)ko abenduakren 14(a) 14:38:59 (UTC) flag Report link Esteka iraunkorra

:) mais non, je voulais seulement qu'on fait attention à ça. Peut-être c'est drôle d'avoir des phrases comme: « Je vais te tuer ! [pouff] »

{{vm.hiddenReplies[8682] ? 'expand_more' : 'expand_less'}} ezkutatu iruzkinak erakutsi iruzkinak
sacredceltic sacredceltic 2011(e)ko abenduakren 14(a) 2011(e)ko abenduakren 14(a) 14:46:26 (UTC) flag Report link Esteka iraunkorra

[fra] « Je vais te tuer ! [poum*] »

sacredceltic sacredceltic 2011(e)ko abenduakren 14(a) 2011(e)ko abenduakren 14(a) 14:43:46 (UTC) flag Report link Esteka iraunkorra

Comment dis-tu ? http://tatoeba.org/fre/sentences/show/1299152

Saney Saney 2011(e)ko abenduakren 20(a) 2011(e)ko abenduakren 20(a) 08:55:07 (UTC) flag Report link Esteka iraunkorra

Sorry for pulling this up again, but I just wanted to say to things: First, thanks for all your comments! There are really some points worth considering. Also I wasn't aware of universalsubtitles.org - I'll check that out sometimes. Second, I guess eventually I will try to realize the proposed script, but it may not happen during the next few weeks. If I can achieve something that may be of use for Tatoeba, I'll leave another comment. While I'm at it: Thanks to everyone contributing here, this has become my one and only source for learning material - it's awesome to have native speaker audio and sentences that are batch-processable :-)