Tips

Here you can ask general questions like how to use Tatoeba, report bugs or strange behavior, or simply socialize with the rest of the community.

Before asking a question, make sure to read the FAQ.

Wall (678 threads)

<<
<
1234567 >>>
  1. Hi everyone.

    I've been using Anki (spaced repetition software) for the longest time to learn German vocabulary. I've modified a script to look up flash card content and open the selected word/sentence in the browser for Anki users (from a script by SteveAW) here -> https://ankiweb.net/shared/info/443435286

    Let me know if you find it useful. If you need help to configure it for your language of choice, let me know.

  2. Hi,

    When doing the search on the annotations editing screen for the japanese sentences, is there a way to have to not trim the spaces when doing a search. For example, if I search for '四 ' in the search box, it actually searches for '四'. This makes it difficult to find instances where '四' stands alone and has not been qualified.

    Thanks
    Ray
    • Yes, you have to actually type <space>. What you want to search is: 四<space>
      • I’m afraid it doesn’t work.
      • This worked for me by typing the literal string '<space>', thank you!
      • Oops, spoke too soon. Works for search, but when I try and replace and preview it doesn't find anything. For example if I search for '<space>四<space>' to replace with '<space>四(よん)<space>'.
        • It's only when searching that you need to use <space>, and only for spaces at the beginning or the end.
          When replacing you would use a normal space.
          • I'm guessing the algorithm only works when <space> is at the beginning OR the end then? Because when it's both it doesn't seem to work. :(
            • Searching "<space>四<space>" works fine for me.
  3. ** Tatoeba update (August 16, 2014) **

    http://blog.tatoeba.org/2014/08...t-16-2014.html


    # New downloads URL's

    We have changed the URL of the downloads files[1], containing the data that we redistribute. The files are also now compressed. The old URL is still available for the time being, but will no more contain the latest data.

    # International targeting

    We've included the necessary HTML tags for Google to display the results in the relevant language[2], and not systematically in English.
    On a related topic, we still (and will always) need people to help us translate Tatoeba's interface into other languages. If you would like to help, check out the instructions here[3].

    # Donation news

    We'd like to thank our two latest donators, Dmitriy and Aleksandr. We've had 8 donations so far, that amount to a total of 295€. The top donation was 100€.

    -----

    [1] http://tatoeba.org/downloads
    [2] https://support.google.com/webm.../6059209?hl=en
    [3] http://en.wiki.tatoeba.org/arti...ce-translation
    • >We've included the necessary HTML tags for Google to display the results in the relevant language, and not systematically in English.

      :)
    • >On a related topic, we still (and will always) need people to help us translate Tatoeba's interface into other languages.

      Personne ne valide jamais les traductions en français sur Launchpad, donc les noms de nouvelles langues restent éternellement en anglais...
    • Not to be nitpicking but the links on

      http://tatoeba.org/eng/downloads

      are wrong as there happens to be no file

      http://downloads.tatoeba.org/ex...tailed.tar.bz2

      and no file links.tar.bz2 ,
      however , there is a file

      http://downloads.tatoeba.org/ex..._links.tar.bz2

      PS: The change successfully broke my automatic download system .... :)
      • Sorry about breaking your automatic download system, but thanks for reporting this. I wrote an issue ticket about this. We'll fix it soon.
      • Should be fixed ^^
        • Yes, I verified that it's working now. Thanks, Trang!
  4. i noticed the tags count in this page: http://tatoeba.org/ita/tags/view_all doesn't increase when an existing tag (any) is added to a sentence
    • Thanks. I added an issue ticket for this. I intend to go through all the places where Tatoeba displays a count (number of sentences per language, etc.) and make sure that all relevant operations change the count appropriately.
  5. is there a JSON or YAML version of the sentence database anywhere, or any tools to make one from the daily downloads? I run this site: http://jgram.org and wanted to update our examples.

    Ideally something like:

    cname: uniqueId
    ex:
    en: sentence in english
    ja: same in japanese
    tags: [list, of, tags]

    or similar?
    • Have you seen the downloads section on this site: http://tatoeba.org/eng/downloads ?

      The format is very different from what you describe, though.
      • yes i have, but to match english and japanese I'd need to write a script to pair sentences together using the links.csv. I'm sure i could manage that but just wondering what there was already. It would seem to make the content much more accessible for people if that step wasn't required.

        I came across this https://github.com/allan-simon/tatodb
        but running my own graphDB written in C is a bit esoteric. a JSON version of the corpus that could go straight into mongoDB would be ideal.
        • You want a table format, while Tatoeba corpus is a graph. See the difference here: http://blog.tatoeba.org/2010/02...eba.html#rule2

          Graph format cannot be converted to table without losing information. For example, sentence "Ul alma aşıy" may be translated as "He eats an apple" and "She eats an apple" or even "It eats an apple" (and also "He is eating an apple", "She is eating an apple"). In your format, it cannot be stored:
          {
          tat: "Ul alma aşıy.",
          yue: "佢食緊個蘋果。",
          eng: ???
          }

          Which variant should we choose? "He eats an apple", or "She eats an apple", or "He is eating an apple", or "She is eating an apple"?

          The format you are asking for just doesn’t work well for Tatoeba.

          (It can, however, work for a subset of Tatoeba data. If you just need a Japanese examples, you can use only one translation — probably choose it randomly — and ignore all the others.

          Of course, diffrent people need different subsets, so Tatoeba can’t provide all the possible subsets. You need to generate your own subset yourself.)

          Also, you probably don’t need TatoDB.
    • Thanks for the replies!

      Is it a true graph DB in that the links are not always both ways?

      otherwise if its just a question of multiple entries, you can easily do that with a JSON/YAML format, or a document DB like mongo, you just add an array for the entries:

      {
      _id: 1234
      tat: ["Ul alma aşıy."],
      eng: ["It eats an apple", "she eats an apple", "etc"]
      }

      Is there a script around to just build a simple table list with example pairs? For my purpose at JGram I just need one example (maybe the first is the most representative?)
      • No, the links are bidirectional. If X is a translation of Y, then Y is a translation of X.

        I don't know of ready-make scripts for creating sentence pairs, but I think they shouldn't be difficult to write...


        Unfortunately, arrays won't work well because information is still lost:
        {
        _id: 1234
        tat: ["Ul alma aşıy."],
        yue: ["佢食緊個蘋果。"]
        cmn: ["她在吃個蘋果。", "他在吃個蘋果。"]
        eng: ["It eats an apple.", "She eats an apple.", "He eats an apple."]
        }

        In such format, there is no way to know than Mandarin "她在吃個蘋果。" means "She eats an apple." (and not "He eats an apple.").
        • ok got it. so each LINK is bidirectional, but the sentences aren't formed into clear-cut groups where all meanings for all sentences pair out.

          something like this:
          https://www.dropbox.com/s/3ajeb...word-graph.png

          well, that's certainly an interesting format, and solves a lot of the conceptual issues I was having with my Japanese grammar database too, in that some phrases in english relate to japanese, but only in certain contexts. ie not a 1:1 translation that's always valid.

          when i get some simple scripts built out I'll share them back.

          thanks!
  6. The links URL is dead. It gives a 404 error.

    http://blog.tatoeba.org/2014/08...t-16-2014.html

    wget http://downloads.tatoeba.org/exports/links.tar.bz2
    --2014-08-17 19:11:40-- http://downloads.tatoeba.org/exports/links.tar.bz2
    Resolving downloads.tatoeba.org (downloads.tatoeba.org)... 188.213.24.161
    Connecting to downloads.tatoeba.org (downloads.tatoeba.org)|188.213.24.161|:80... connected.
    HTTP request sent, awaiting response... 404 Not Found
    2014-08-17 19:11:41 ERROR 404: Not Found.
  7. Are there any good visualisations of tatoeba content? Since its an interesting graph of relations I was thinking something like http://bl.ocks.org/d3noob/5141278 these D3 forced graphs could make it more attractive to explore the content, seeing what is and is not filled in with translations yet.
    • #
    • AlanF_US
    • AlanF_US
    • 5 day(s) ago - edited 5 day(s) ago
    It would be helpful if people could translate the sentence "The server is down" (see http://tatoeba.org/eng/sentences/show/3402624 -- note that I updated that link) into as many languages as possible so that we can use them in case we have another emergency. We have similar sentences, such as "The server is down again" ( http://tatoeba.org/eng/sentences/show/3261051 ), "The server was down" ( http://tatoeba.org/eng/sentences/show/54205 ), and "I can't check my mail. The server is down." ( http://tatoeba.org/eng/sentences/show/54206 ), which you can use as reference. I will eventually transfer this sentence to Launchpad as well.
    • Do you expect the server to go down again?
    • Do you need translations of the phrase "The server will never be down again any more"? :-D
      • Not at all! What we need are realistic sentences. ;-)
  8. There was a discussion about UIs in [[#3410618]], so I thought I could submit a wish.

    Current Tatoeba UI supports only 1 form of language names, which is a problem when translating into languages where words have several forms. In "French example sentence" and "French" in the language list, words "French" may be different in different languages.

    Effectively this means that in the current translations of "French example sentence", Russian has a completely unnatural "Пример предложения, язык французский" (Example sentence, language is French), while Polish has an ungrammatical "Zdanie przykładowe francuski" with "francuski" having an incorrect gender.

    Surely we don’t want to have unnatural and ungrammatical sentences in an UI. How can we expect users to submit natural sentences if the site itself uses unnatural ones?
    • Good point.
      Each UI language element has to be separately identified in Launchpad, the translation tool that is being used to create the different Tatoeba locales.

      Some of them are actually used as variables in other UI language elements in that tool, in order to avoid multiple (and possibly conflicting) translations of the same element.
      But that was done without consideration for the declination problem you mention, because the persons who did had no notion that these could be declined in other languages than the ones they knew.
      So although using variables in UI elements looks smart, at first sight, it should be proscribed, and each UI language element should be translatable separately.
      • I think ideally there should be a way to treat each variable as an array, and the number of elements in that array should be language-dependent.

        E.g., so that I can translate "French" as {[NOM] = "французский", [PREP] = "французском"}. And then so that it could be used in text as "%1 example sentence" => "Пример предложения на %1[PREP]".
        • That would indeed be an elegant solution. But I don't think Launchpad enables this. It's a one to one translation, alas...
    • Ideally, Tatoeba should be developed on a framework that handles internationalisation, with the management of locales, and not rely on an external tool for localisation.
    • Tatoeba uses gettext to localize text. The problem you describe can probably be solved by using the so-called contexts [1]. This allows to have different translations of the same original string for different contexts. For instance "French" in the context of a language list may have a different translation than "French" in the context of "French example sentences". Impersonator, do you think this would solve the problem?

      [1] https://www.gnu.org/software/ge.../Contexts.html
      • > Impersonator, do you think this would solve the problem?

        I believe no. Context should be language-dependent (e.g. languages X and Y may have 2 cases; but language X uses case A in strings 1, 2, 3 and case B in strings 4, 5, 6; while language Y uses case A in strings 1, 3, 5 and case B in 2, 4, 6), and gettext doesn't allow this.

        Of course, we can create all possible contexts (i.e. 1, 2, 3, 4, 5, 6), but this would mean making the translation extremely difficult.
  9. It seems like Tatoeba is having server problems again. What happened to the new server? Are we back on the old one?
    • It seems to be working fine right now, I went ahead and switched nginx's config from unix sockets which don't scale that well to tcp sockets.
<<
<
1234567 >>>