Wall (5,985 threads)
Before asking a question, make sure to read the FAQ.
We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.
9 hours ago
10 hours ago
10 hours ago
10 hours ago
11 hours ago
22 hours ago
2 days ago
2 days ago
2 days ago
There are so many doubled phrases! Wpuld there be a way to fix it?
There are a lot of near-duplicates. Especially in the English / Japanese sentences. However duplicates are only removed when they are exact duplicates. This is a policy decision based on the usefulness for language analysis tools, among other reasons.
In the long term it may be possible to tag near duplicates and optionally filter the data to not show them.
In a specific language? In normal time we have a script we run from time to time which merge duplicate, as for ressource reason, we can't have a real time check. But I don't think it's more than a hundred of duplicate, over the 500 000 sentences.
Someone might want to look into the mess that doanhuong made.
How do I add a new language on the menu?
if you mean request a language to be added to tatoeba then read about it here: http://tatoeba.org/eng/faq under "I'd like to request a new language."
hope I helped...
Sometimes the system thinks a Chinese sentence is written in simplified characters, when it's actually in traditional. This causes the pinyin to get messed up. (See http://tatoeba.org/eng/sentences/show/400579 for example) Is there a way to manually tell it what kind of characters you're using?
[not needed anymore- removed by CK]
I haven't seen any such policy nor do I think we need one. This isn't standard in most languages and I think it'd be fine to leave it to the contributor.
For those interested:
I disagree. Of course there is a numbers & currencies display standard for each country, otherwise, writing checks would not be possible as everyone could interpret amounts as they wish...
Operating systems and Software applications such as Microsoft Excel do enforce these formats.
Anyway, I think there is another issue with amounts sometimes written in letters and sometimes in numbers. There should be a Tatoeba convention on this matter.
As Tatoeba is about language and sentences, I am in favour of writing in letters, so learners can actually learn how these numbers write in the language.
> Operating systems and Software applications
> such as Microsoft Excel do enforce these formats.
They don’t. On the Format page you can choose any format you like. On the “Language and Regional settings” page in the Control Panel you can choose how these numbers should be displayed system-wide. E.g., in Russian Windows numbers are formatted with non-breaking spaces by default (1 000 = 1,000; 1,000 = 1.000).
All modern OS’es have these settings, and all well-written applications do use them.
Of course you are able to chose. Microsoft cannot guess what currencies and countries you are dealing with. But there definitely are country defaults.
For instance, the french default for $1,000.00 is 1.000,00 $
And these standards reflect the national standards enforced by national financial authorities and applied by banks, financial institutions, and, eventually, individuals who write checks and don't want to be charged 100 times the amount they meant to write...
In Russian there seems to be no single standard. Usually we write 10$, but in financtial literature one can find $10. Usually we write 1 000,0, but 1'000,0 can be used too.
There must be a banking standard. Russia is a well enough administered state...
Yes, of course there is one. But sentences on Tatoeba should reflect how things are written in different situations, not only in official documents.
I'll refer to Demetrius' comments on the standardisation issue as they reflect my views as well.
As for writing numbers out in words, I've favoured this as well -- especially for inflected languages like Icelandic (where the number "2" can be pronounced: "tveir, tvo, tveim, tveggja"). As I've started adding more information in the comments, I've wondered whether it might not be wiser to add the reading there as well seeing how large numbers just look odd in words. There is some value in writing the sentences out how they're normally written.
This sort of information would fit well into an annotation field for sentences once(?) that feature is added. Until then, I think I'm moving to the "standard" or writing numbers the way I'd write them in context, adding readings in comments and tagging the sentence with "cmt on reading":
** Massive tag-rename: Request for comments **
I've gone through the quotes tags and transliterated (and translated where appropriate) the non-English ones. The results are to be found at
It's a pretty raw (and long) file but I'd appreciate it if people could run through it, check particularly tags they may have created and comment if they have any particular suggestions.
There are several merger suggestions towards the bottom. Most are just the same name or work. I'll keep these in my list for when we get the tag translation feature.
There was one work that I couldn't find what was in English: Белый ослик. Perhaps Demetrius or Dorenda can lend me a hand?
One thing I wasn't so sure what to do about is latin variants. Take for example the Icelandic authors Halldór Laxness and Þórbergur Þráinsson. Since not everyone can enter accents on their machines and very few would manage looking Þórbergur upp even with the autocompletion I reckon we might want to go all the way and restrict ourselves to ASCII characters for names. The examples above would then "Halldor Laxness" and "Thorbergur Thrainsson". While the latter isn't terribly attractive, it would make life easier for users. Seeing how it could upset even more tag-creators than I already may have, I left these be in the rename file for now.
Finally, the quotes are where the prefixes seem to have gained most ground. I think we should make some guidelines on these and suggest we use "by " and "for ". Tags would start with a lower case and be separated from the person or work by a space.
Once we iron out the issues, I'll format this and send to Trang to run on the database.
** Last call for comments **
Renames are available here:
Once that is through, I'll add the translations and transliterations to my categorised tags list at
and move on to merging the duplicate tags.
I still believe we shouldn’t do this, at least until Tatoeba has a system of internationalisation for tags.
This would render the system much less useful for non-English speakers.
The current approact (tagging in the language of the sentence) allows anyone reusing data to show the name of the author near the quote, in brackets.
The approach you suggest forces authors to use conversion tables... And these lists have to be compiled for every language on their own.
IMHO that renders By- and From- tags much less useful.
Hmmm ... and I'm for doing this now specifically because we don't have a system of internationalisation yet.
As there is no way to distinguish a quotation tag from other tags (standardised prefixes help, but are by no means a reliable indicator) reusing data isn't straight forward (another issue is the accuracy of these attributions).
The current approach would lead to a great proliferation, far beyond the nearly a thousand we're up to. Once we finally decide to organise these on a database level, it will be a vastly more complicated matter. This is the reason why I started sorting the tags.
The approach I suggest is to use a single tag for a single concept. Once the tag internationalisation feature is introduced we can see about adding a field whereby one can submit a translation for a tag if it doesn't have one in the language a sentence is being translating to.
To aid users in the meantime, I'm willing to maintain a database of translations. I'll see if I can set up a mock version this week.
For anyone searching for a particular sentence, the search feature is going to do the job far better than tags will. The tags are really only useful for someone looking for a particular tag or a general sentence.
In the former case, I will grant you that multiple language versions would be more useful. This is why I'm not for tossing out the tags, and think these should be stored and made accessible. They'll furthermore be hugely useful once we start translating/transliterating the tags.
In the latter case, where someone is looking for a general sentence, tags can be genuinely useful. Both a structured list of tags (as the one I've started) and a single node for all translations will, I believe aid in this task.
My suggestion for the final system is to have tags displayed in the language set on the interface, the ability to over-ride it on pages displaying lists of tags, and defaulting to some value (English, in this transition period). Until we have such a system in place, I suggest we aim for simplifying the system in order to better gauge how we want to use it.
There are so many possible ways to use the tags. Standardisations such as having multiple wide-scope tags rather than single narrow ones ("second person", "plural" and "formal" rather than "second person plural/formal"), sentence cased names and particular prefixes are obviously useful. Whether other guidelines will be useful is something we can best gauge from how they're used. Doing so is exceedingly difficult in a sea of several thousand tags.
I started sorting the tags out just for myself as I was interested in seeing what sort of tags were out there. Both to get ideas from how others use them and to see what tags were out there so that I wouldn't be adding near-identical variants (e.g. "non-sentence" and "not a sentence", or "reproach", "reprimand", "scold", etc.). I don't see this happening once the problem grows much larger.
I like the idea of letting systems go and form themselves without people predicting how people will use them. At the same time, I think some oversight is useful. It, like everything else, is a balance of ideals. I think we'll do well to restrict the complexity of the system a bit so that we'll have a better idea of how we want to organise it ... or whether we want to do so at all!
Sorry for not replying earlier.
Actually, as far as I've understand the original idea, users were to add any tags they want, but the system will be organised by adding 2 variants of wording as equivalent.
What you offer is different from this idea.
But if there will be a database of translations... why not?
But this way, there will be only a defined set of tags, and adding new tags would be problematic. But well... if the tags will be established, this shouldn't be a problem (except for By- and From-, but I believe they should go into a sentence other kind of medatada one day, not into tags).
This way, we are preventing users from coining new tags. But, well... maybe they don't really need this. Actually, now I don't know.
But the translation DB will surely help.
> That's a long reply. Sorry...
I do learn a lot by reading a well-written English like yours. :)
> but the system will be organised by adding 2 variants of wording as equivalent.
I don't quite understand what you mean by this. Could you explain it a bit better?
> But this way, there will be only a defined set of tags, and adding new tags would be problematic.
Not quite. I'm not suggesting that we stop adding tags. Just that we'll restrict ourselves to English ones on Tatoeba until the translation feature gets sorted out.
Translating tags rather than just having tags in each language allows users to understand tags on sentences in languages they're not proficient in.
Merging these later would be a bit of work but in this transition period we should also be thinking about how we want to use tags and how we'll present them for browsing.
However we'll do this, I think we'll be well served by keeping tags in one language despite the fact that they'll be less useful to some and we'll lose input on the topic from some users.
My static page and proposed database of translated terms would just be a crutch until the tag translation feature is up and can be populated with the translations that we've gathered.
Thank you for your explanation.
> but the system will be organised by
> adding 2 variants of wording as equivalent.
Sysko said there will be a way to tell that some tag is an alias of another one.
This system can also be used for internationalised tags.
Ah, I see. Yes, there is the internal name that is the one sent to the server (as opposed to the name on the tag or it's ID).
As long as the database looks for sentences with that shared internal name, rather than the actual name or the tag ID, you'll get sentences tagged with all language versions.
This way you'd get sentences tagged with both "Анна Каренина" and "Anna Karenina". Still, this is only half a solution since tags on the Russian sentence would presumably be in Russian but not all users will understand all the tags.
Knowing which sentence is the most colloquial would be useful even to those who don't know the word "colloquial".
> Still, this is only half a solution since
> tags on the Russian sentence would presumably
> be in Russian but not all users will
> understand all the tags.
They will be understanable to the majority of people to whom Russian sentences will be of use.
The system you propose will make it much harder for non-English spreakers to add a tag. :o
> The system you propose will make it much
> harder for non-English spreakers to add a
> tag. :o
I think this is all going forwards FAR too fast. At present, all I would ask for is giving moderators the ability to globally merge, change and delete (empty) tags. (So I could change all 'delete' into '@delete', for example).
Tags are still officially in 'beta' and only available for trusted users, so there's no reason to rush on major changes.
I don't think anyone is suggesting any major changes. Monolingual tags were already discussed briefly and I thought generally agreed upon. This is essentially just a little cleanup.
> Monolingual tags were already discussed
> briefly and I thought generally agreed
I'm all for multi-lingual support in tags, but that's not the same as monolingual tags. Monolingual tags means everything has to be English before it can be displayed as anything else. You should be able to add a new tag in Icelandic without any hassle and have people translate it to other languages or merge into an existing multi-language tag group later.
I'm really proposing two systems. We agree that ultimately we'd like to have a nice, intuitive, multi-lingual system.
We should however use this transition period where the tags are still of limited use and only trusted users can add them, to discuss and design how best to use these tags and make the system as nice as possible. A proliferation of these in all different languages would make that task and the later task of matching translations unnecessarily difficult.
It's by no means a perfect solution. The problems can, however, be mitigated and I'm confident that it will result in a better result.
Bloody hell! That's a long reply. Sorry...
Essentially; I think we can reduce the inconvenience caused to non-English speaking users by maintaining a separate translated/transliterated list of tags.
I’m not sure whether we should translate By- and From- tags into English. Doing this will immediately make them less useful for non-English applications re-using Tatoeba’s data [unless someone's willing to maintain a database of tag translations ;o].
On the other hand, they’re used relatively rarely (compared to other tags anyway), so ability to view quotes of some person in all the languages shouldn't be very useful.
BTW, not all %NAME% tags should become By-%NAME%. E.g. Plato, Caesar (‘Platon is my friend, but...’ is said by Aristoteles, ‘Caesar non supra grammaticos’). I suggest these to be deleted from auto-delete list and dealt with manually.
I believe By- and For- should be converted into a DB fields describing a sentence and should be kept in the language of the sentence, since they're not really repeatable anyway.
Белый Ослик is ‘[A] little white donkey’, but I’m not sure whether it’s how this book(?) should be translated.
I think that translating the tags and managing these is a greater headache than the inconvenience of non-English speakers not being able to browse them. If anyone's interested, I could see about providing a translated version of my list.
Good catch on our dear Greek philosophers! Yes, these need to be looked at closer! Some tags should certainly exist as both "by Name" and "Name", as illustrated by this sentence that belongs in "Salvador Dalí" as well as "by Salvador Dalí":
I agree that it would be better if the tag table had a field for the tag type, but until we get that, we'll have to make do with a manual organisation. Still, I think it's fine to start out using tags without that field because it'll let us think about tags more from the ground-up; categorising them based on how we tag sentences, rather than tag sentences based on how we've categorised them.
My little page is my attempt to analyse that and hopefully get people to think more about what tags there are, what tags there should be and how we'd like this presented in the future. Hopefully these thoughts will be useful when the tags will be rethought.
I'm not quite sure what you mean by "not really repeatable", but if you're referring to the fact that not all tags should apply to all sentences it's linked to, I agree. Sometimes a tag will necessarily apply to all sentences linked together (as in the case with subjects and quotes). In other cases (parts of speech, grammar or possibly style) they won't match.
Still, I think that ultimately the language on tags should be in the language of the interface, rather than the sentence.
> Белый Осликis ‘[A] little white donkey’
:-D I like that.
Hmmm ... Any idea whether the book may have been translated? If not, and we can find no reference of it being discussed, then we'll have to fall back on the original.
If it's never discussed in English, it stands to reason that there isn't an English word for it.
> Белый Ослик is ‘[A] little white donkey’, but I’m not
> sure whether it’s how this book(?) should be translated.
> Hmmm ... Any idea whether the book may have been
> translated? If not, and we can find no reference of
> it being discussed, then we'll have to fall back on
> the original.
It's a story that was published in one book together with other stories. (What's a сборник in English? :))
I can't find any hints that it may have been translated into another language.
> I’m not sure whether we should translate By- and
> From- tags into English.
Me too. Well, I'm fine with translating them, as long as those translations will not replace the By- and From- tags with the authors and titles in the original language of the work. Otherwise it might become hard to find what was the original title.
сборник = collection (as in, a short story collection)
> I'm not quite sure what you mean by "not really repeatable"
I meant that By- yahs and (especially) From- tags are not really meant to be searched in the database.
They are for people to know who wrote the sentence, not to filter the sentences by this tag.
I doubt there will be another quote from Belyj oslik.
Yeah, I don't think either the "subject" or "quote" tags are terribly useful for what I consider the purpose of Tatoeba. I just don't see anyone coming here to find a sentence about an apple or from a work. I find it much more likely that people come to learn about the use of words in natural sentences.
Still, as long as people are interested in and willing to tag these, I'm happy to oblige. After all, as Paul has shown, there are lots of wild ways that one can use this thing.
However the information will ultimately be stored in the database will definitely be influenced by how people start using them. Whether as special quote tags, or some other field in the database, they'll still be searchable -- even though I agree with you that this neither is nor should be the prime focus of Tatoeba.
I made it down to Confucius. Made a new category for "by Julius Caesar" and "by Plato" and removed the "Caesar" and "Plato" tags.
One question: Do people reckon one should tag stuff like "veni, vidi, vici" as being /about/ Cæsar as well as by him? I guess it did refer to him originally...
Oh, I'll think about that later. Have to catch a bus...
I just finished going through the tags. I created new "by Name" tags for the few that had sentences by and about the person and updated the rename list.
What does the tag @translate en-fr stand for? Does that replace the former list "English and French don't match ?"
Sorry I didn't saw it before. When I cleaned the list "English and French don't match" a lot of pairs en-fr were unlinked. When I wasn't able to recreate the pairs, I tagged the sentence which missed a good translation in the other language. But they were 3 or 4 I think.
So what is the procedure now when English and French don't match through Japanese?
The list is empty, so you can create a new tag with the name you want, or go on with the list. I would prefer a tag, however.
Here I proposed @translations http://tatoeba.org/eng/wall/sho...0#message_2410
but nobody replied me...
i would like to know what Tatoeba policy is regarding the addition of long sentences from literary works. I wonder whether it makes any sense while Gutemberg and Wikisource are doing their job on this field.
Just let me know.
Examples cannot currently be over some number (500?) of characters. I would advise not adding examples with other 255 characters (for tradition's sake, and those with old software ;-).
There is an item on the 'To do list' to support long texts, so it's best to hold off till then if you want your latest book to be made available here.
Other than that you should feel free to add what you want to (as long as you can do so under the license terms of this site). The big difference between here and Gutemberg / Wikisource is the emphasis on translating the text. If you dump lots of text in one language only with no translations it might not be very welcome.
Scott states that old French can be keyed as normal French on Tatoeba. I think this is confusing without proper indications to the public. What is the current stance with Old English, Old German or...Old Chinese or Japanese ?
My view is that it is a different language altogether and should thus be defined as another language.
"Scott states that old French can be keyed as normal French on Tatoeba. I think this is confusing without proper indications to the public. " Not exactly. I said that Old French should be tagged. I think that it's important to identify what is modern and what is archaic, especially for language learners. I also don't object to using a different language for what is truly Old French (stuff from before the Middle Ages). Now, I don't want to get into the debate of defining what is Middle French, Old French and Modern French. It's certainly going to be difficult to draw clear lines between them.
And you're lucky with French because it's a relatively recent language, comparatively...
I will see with Trang, and we will answer you, but personnaly (not an official statement), I would say they should be considered as different languages, after for delimiting what is old and what is "modern", we can rely on preexisting works on this. (as it's sometimes teach in university, they should already have stated this)
Please delete this and all translations of this sentence.