Wall (5,912 threads)
Before asking a question, make sure to read the FAQ.
We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.
13 hours ago
13 hours ago
14 hours ago
16 hours ago
16 hours ago
2 days ago
2 days ago
2 days ago
Sometimes the system thinks a Chinese sentence is written in simplified characters, when it's actually in traditional. This causes the pinyin to get messed up. (See http://tatoeba.org/eng/sentences/show/400579 for example) Is there a way to manually tell it what kind of characters you're using?
[not needed anymore- removed by CK]
I haven't seen any such policy nor do I think we need one. This isn't standard in most languages and I think it'd be fine to leave it to the contributor.
For those interested:
I disagree. Of course there is a numbers & currencies display standard for each country, otherwise, writing checks would not be possible as everyone could interpret amounts as they wish...
Operating systems and Software applications such as Microsoft Excel do enforce these formats.
Anyway, I think there is another issue with amounts sometimes written in letters and sometimes in numbers. There should be a Tatoeba convention on this matter.
As Tatoeba is about language and sentences, I am in favour of writing in letters, so learners can actually learn how these numbers write in the language.
> Operating systems and Software applications
> such as Microsoft Excel do enforce these formats.
They don’t. On the Format page you can choose any format you like. On the “Language and Regional settings” page in the Control Panel you can choose how these numbers should be displayed system-wide. E.g., in Russian Windows numbers are formatted with non-breaking spaces by default (1 000 = 1,000; 1,000 = 1.000).
All modern OS’es have these settings, and all well-written applications do use them.
Of course you are able to chose. Microsoft cannot guess what currencies and countries you are dealing with. But there definitely are country defaults.
For instance, the french default for $1,000.00 is 1.000,00 $
And these standards reflect the national standards enforced by national financial authorities and applied by banks, financial institutions, and, eventually, individuals who write checks and don't want to be charged 100 times the amount they meant to write...
In Russian there seems to be no single standard. Usually we write 10$, but in financtial literature one can find $10. Usually we write 1 000,0, but 1'000,0 can be used too.
There must be a banking standard. Russia is a well enough administered state...
Yes, of course there is one. But sentences on Tatoeba should reflect how things are written in different situations, not only in official documents.
I'll refer to Demetrius' comments on the standardisation issue as they reflect my views as well.
As for writing numbers out in words, I've favoured this as well -- especially for inflected languages like Icelandic (where the number "2" can be pronounced: "tveir, tvo, tveim, tveggja"). As I've started adding more information in the comments, I've wondered whether it might not be wiser to add the reading there as well seeing how large numbers just look odd in words. There is some value in writing the sentences out how they're normally written.
This sort of information would fit well into an annotation field for sentences once(?) that feature is added. Until then, I think I'm moving to the "standard" or writing numbers the way I'd write them in context, adding readings in comments and tagging the sentence with "cmt on reading":
** Massive tag-rename: Request for comments **
I've gone through the quotes tags and transliterated (and translated where appropriate) the non-English ones. The results are to be found at
It's a pretty raw (and long) file but I'd appreciate it if people could run through it, check particularly tags they may have created and comment if they have any particular suggestions.
There are several merger suggestions towards the bottom. Most are just the same name or work. I'll keep these in my list for when we get the tag translation feature.
There was one work that I couldn't find what was in English: Белый ослик. Perhaps Demetrius or Dorenda can lend me a hand?
One thing I wasn't so sure what to do about is latin variants. Take for example the Icelandic authors Halldór Laxness and Þórbergur Þráinsson. Since not everyone can enter accents on their machines and very few would manage looking Þórbergur upp even with the autocompletion I reckon we might want to go all the way and restrict ourselves to ASCII characters for names. The examples above would then "Halldor Laxness" and "Thorbergur Thrainsson". While the latter isn't terribly attractive, it would make life easier for users. Seeing how it could upset even more tag-creators than I already may have, I left these be in the rename file for now.
Finally, the quotes are where the prefixes seem to have gained most ground. I think we should make some guidelines on these and suggest we use "by " and "for ". Tags would start with a lower case and be separated from the person or work by a space.
Once we iron out the issues, I'll format this and send to Trang to run on the database.
** Last call for comments **
Renames are available here:
Once that is through, I'll add the translations and transliterations to my categorised tags list at
and move on to merging the duplicate tags.
I still believe we shouldn’t do this, at least until Tatoeba has a system of internationalisation for tags.
This would render the system much less useful for non-English speakers.
The current approact (tagging in the language of the sentence) allows anyone reusing data to show the name of the author near the quote, in brackets.
The approach you suggest forces authors to use conversion tables... And these lists have to be compiled for every language on their own.
IMHO that renders By- and From- tags much less useful.
Hmmm ... and I'm for doing this now specifically because we don't have a system of internationalisation yet.
As there is no way to distinguish a quotation tag from other tags (standardised prefixes help, but are by no means a reliable indicator) reusing data isn't straight forward (another issue is the accuracy of these attributions).
The current approach would lead to a great proliferation, far beyond the nearly a thousand we're up to. Once we finally decide to organise these on a database level, it will be a vastly more complicated matter. This is the reason why I started sorting the tags.
The approach I suggest is to use a single tag for a single concept. Once the tag internationalisation feature is introduced we can see about adding a field whereby one can submit a translation for a tag if it doesn't have one in the language a sentence is being translating to.
To aid users in the meantime, I'm willing to maintain a database of translations. I'll see if I can set up a mock version this week.
For anyone searching for a particular sentence, the search feature is going to do the job far better than tags will. The tags are really only useful for someone looking for a particular tag or a general sentence.
In the former case, I will grant you that multiple language versions would be more useful. This is why I'm not for tossing out the tags, and think these should be stored and made accessible. They'll furthermore be hugely useful once we start translating/transliterating the tags.
In the latter case, where someone is looking for a general sentence, tags can be genuinely useful. Both a structured list of tags (as the one I've started) and a single node for all translations will, I believe aid in this task.
My suggestion for the final system is to have tags displayed in the language set on the interface, the ability to over-ride it on pages displaying lists of tags, and defaulting to some value (English, in this transition period). Until we have such a system in place, I suggest we aim for simplifying the system in order to better gauge how we want to use it.
There are so many possible ways to use the tags. Standardisations such as having multiple wide-scope tags rather than single narrow ones ("second person", "plural" and "formal" rather than "second person plural/formal"), sentence cased names and particular prefixes are obviously useful. Whether other guidelines will be useful is something we can best gauge from how they're used. Doing so is exceedingly difficult in a sea of several thousand tags.
I started sorting the tags out just for myself as I was interested in seeing what sort of tags were out there. Both to get ideas from how others use them and to see what tags were out there so that I wouldn't be adding near-identical variants (e.g. "non-sentence" and "not a sentence", or "reproach", "reprimand", "scold", etc.). I don't see this happening once the problem grows much larger.
I like the idea of letting systems go and form themselves without people predicting how people will use them. At the same time, I think some oversight is useful. It, like everything else, is a balance of ideals. I think we'll do well to restrict the complexity of the system a bit so that we'll have a better idea of how we want to organise it ... or whether we want to do so at all!
Sorry for not replying earlier.
Actually, as far as I've understand the original idea, users were to add any tags they want, but the system will be organised by adding 2 variants of wording as equivalent.
What you offer is different from this idea.
But if there will be a database of translations... why not?
But this way, there will be only a defined set of tags, and adding new tags would be problematic. But well... if the tags will be established, this shouldn't be a problem (except for By- and From-, but I believe they should go into a sentence other kind of medatada one day, not into tags).
This way, we are preventing users from coining new tags. But, well... maybe they don't really need this. Actually, now I don't know.
But the translation DB will surely help.
> That's a long reply. Sorry...
I do learn a lot by reading a well-written English like yours. :)
> but the system will be organised by adding 2 variants of wording as equivalent.
I don't quite understand what you mean by this. Could you explain it a bit better?
> But this way, there will be only a defined set of tags, and adding new tags would be problematic.
Not quite. I'm not suggesting that we stop adding tags. Just that we'll restrict ourselves to English ones on Tatoeba until the translation feature gets sorted out.
Translating tags rather than just having tags in each language allows users to understand tags on sentences in languages they're not proficient in.
Merging these later would be a bit of work but in this transition period we should also be thinking about how we want to use tags and how we'll present them for browsing.
However we'll do this, I think we'll be well served by keeping tags in one language despite the fact that they'll be less useful to some and we'll lose input on the topic from some users.
My static page and proposed database of translated terms would just be a crutch until the tag translation feature is up and can be populated with the translations that we've gathered.
Thank you for your explanation.
> but the system will be organised by
> adding 2 variants of wording as equivalent.
Sysko said there will be a way to tell that some tag is an alias of another one.
This system can also be used for internationalised tags.
Ah, I see. Yes, there is the internal name that is the one sent to the server (as opposed to the name on the tag or it's ID).
As long as the database looks for sentences with that shared internal name, rather than the actual name or the tag ID, you'll get sentences tagged with all language versions.
This way you'd get sentences tagged with both "Анна Каренина" and "Anna Karenina". Still, this is only half a solution since tags on the Russian sentence would presumably be in Russian but not all users will understand all the tags.
Knowing which sentence is the most colloquial would be useful even to those who don't know the word "colloquial".
> Still, this is only half a solution since
> tags on the Russian sentence would presumably
> be in Russian but not all users will
> understand all the tags.
They will be understanable to the majority of people to whom Russian sentences will be of use.
The system you propose will make it much harder for non-English spreakers to add a tag. :o
> The system you propose will make it much
> harder for non-English spreakers to add a
> tag. :o
I think this is all going forwards FAR too fast. At present, all I would ask for is giving moderators the ability to globally merge, change and delete (empty) tags. (So I could change all 'delete' into '@delete', for example).
Tags are still officially in 'beta' and only available for trusted users, so there's no reason to rush on major changes.
I don't think anyone is suggesting any major changes. Monolingual tags were already discussed briefly and I thought generally agreed upon. This is essentially just a little cleanup.
> Monolingual tags were already discussed
> briefly and I thought generally agreed
I'm all for multi-lingual support in tags, but that's not the same as monolingual tags. Monolingual tags means everything has to be English before it can be displayed as anything else. You should be able to add a new tag in Icelandic without any hassle and have people translate it to other languages or merge into an existing multi-language tag group later.
I'm really proposing two systems. We agree that ultimately we'd like to have a nice, intuitive, multi-lingual system.
We should however use this transition period where the tags are still of limited use and only trusted users can add them, to discuss and design how best to use these tags and make the system as nice as possible. A proliferation of these in all different languages would make that task and the later task of matching translations unnecessarily difficult.
It's by no means a perfect solution. The problems can, however, be mitigated and I'm confident that it will result in a better result.
Bloody hell! That's a long reply. Sorry...
Essentially; I think we can reduce the inconvenience caused to non-English speaking users by maintaining a separate translated/transliterated list of tags.
I’m not sure whether we should translate By- and From- tags into English. Doing this will immediately make them less useful for non-English applications re-using Tatoeba’s data [unless someone's willing to maintain a database of tag translations ;o].
On the other hand, they’re used relatively rarely (compared to other tags anyway), so ability to view quotes of some person in all the languages shouldn't be very useful.
BTW, not all %NAME% tags should become By-%NAME%. E.g. Plato, Caesar (‘Platon is my friend, but...’ is said by Aristoteles, ‘Caesar non supra grammaticos’). I suggest these to be deleted from auto-delete list and dealt with manually.
I believe By- and For- should be converted into a DB fields describing a sentence and should be kept in the language of the sentence, since they're not really repeatable anyway.
Белый Ослик is ‘[A] little white donkey’, but I’m not sure whether it’s how this book(?) should be translated.
I think that translating the tags and managing these is a greater headache than the inconvenience of non-English speakers not being able to browse them. If anyone's interested, I could see about providing a translated version of my list.
Good catch on our dear Greek philosophers! Yes, these need to be looked at closer! Some tags should certainly exist as both "by Name" and "Name", as illustrated by this sentence that belongs in "Salvador Dalí" as well as "by Salvador Dalí":
I agree that it would be better if the tag table had a field for the tag type, but until we get that, we'll have to make do with a manual organisation. Still, I think it's fine to start out using tags without that field because it'll let us think about tags more from the ground-up; categorising them based on how we tag sentences, rather than tag sentences based on how we've categorised them.
My little page is my attempt to analyse that and hopefully get people to think more about what tags there are, what tags there should be and how we'd like this presented in the future. Hopefully these thoughts will be useful when the tags will be rethought.
I'm not quite sure what you mean by "not really repeatable", but if you're referring to the fact that not all tags should apply to all sentences it's linked to, I agree. Sometimes a tag will necessarily apply to all sentences linked together (as in the case with subjects and quotes). In other cases (parts of speech, grammar or possibly style) they won't match.
Still, I think that ultimately the language on tags should be in the language of the interface, rather than the sentence.
> Белый Осликis ‘[A] little white donkey’
:-D I like that.
Hmmm ... Any idea whether the book may have been translated? If not, and we can find no reference of it being discussed, then we'll have to fall back on the original.
If it's never discussed in English, it stands to reason that there isn't an English word for it.
> Белый Ослик is ‘[A] little white donkey’, but I’m not
> sure whether it’s how this book(?) should be translated.
> Hmmm ... Any idea whether the book may have been
> translated? If not, and we can find no reference of
> it being discussed, then we'll have to fall back on
> the original.
It's a story that was published in one book together with other stories. (What's a сборник in English? :))
I can't find any hints that it may have been translated into another language.
> I’m not sure whether we should translate By- and
> From- tags into English.
Me too. Well, I'm fine with translating them, as long as those translations will not replace the By- and From- tags with the authors and titles in the original language of the work. Otherwise it might become hard to find what was the original title.
сборник = collection (as in, a short story collection)
> I'm not quite sure what you mean by "not really repeatable"
I meant that By- yahs and (especially) From- tags are not really meant to be searched in the database.
They are for people to know who wrote the sentence, not to filter the sentences by this tag.
I doubt there will be another quote from Belyj oslik.
Yeah, I don't think either the "subject" or "quote" tags are terribly useful for what I consider the purpose of Tatoeba. I just don't see anyone coming here to find a sentence about an apple or from a work. I find it much more likely that people come to learn about the use of words in natural sentences.
Still, as long as people are interested in and willing to tag these, I'm happy to oblige. After all, as Paul has shown, there are lots of wild ways that one can use this thing.
However the information will ultimately be stored in the database will definitely be influenced by how people start using them. Whether as special quote tags, or some other field in the database, they'll still be searchable -- even though I agree with you that this neither is nor should be the prime focus of Tatoeba.
I made it down to Confucius. Made a new category for "by Julius Caesar" and "by Plato" and removed the "Caesar" and "Plato" tags.
One question: Do people reckon one should tag stuff like "veni, vidi, vici" as being /about/ Cæsar as well as by him? I guess it did refer to him originally...
Oh, I'll think about that later. Have to catch a bus...
I just finished going through the tags. I created new "by Name" tags for the few that had sentences by and about the person and updated the rename list.
What does the tag @translate en-fr stand for? Does that replace the former list "English and French don't match ?"
Sorry I didn't saw it before. When I cleaned the list "English and French don't match" a lot of pairs en-fr were unlinked. When I wasn't able to recreate the pairs, I tagged the sentence which missed a good translation in the other language. But they were 3 or 4 I think.
So what is the procedure now when English and French don't match through Japanese?
The list is empty, so you can create a new tag with the name you want, or go on with the list. I would prefer a tag, however.
Here I proposed @translations http://tatoeba.org/eng/wall/sho...0#message_2410
but nobody replied me...
i would like to know what Tatoeba policy is regarding the addition of long sentences from literary works. I wonder whether it makes any sense while Gutemberg and Wikisource are doing their job on this field.
Just let me know.
Examples cannot currently be over some number (500?) of characters. I would advise not adding examples with other 255 characters (for tradition's sake, and those with old software ;-).
There is an item on the 'To do list' to support long texts, so it's best to hold off till then if you want your latest book to be made available here.
Other than that you should feel free to add what you want to (as long as you can do so under the license terms of this site). The big difference between here and Gutemberg / Wikisource is the emphasis on translating the text. If you dump lots of text in one language only with no translations it might not be very welcome.
Scott states that old French can be keyed as normal French on Tatoeba. I think this is confusing without proper indications to the public. What is the current stance with Old English, Old German or...Old Chinese or Japanese ?
My view is that it is a different language altogether and should thus be defined as another language.
"Scott states that old French can be keyed as normal French on Tatoeba. I think this is confusing without proper indications to the public. " Not exactly. I said that Old French should be tagged. I think that it's important to identify what is modern and what is archaic, especially for language learners. I also don't object to using a different language for what is truly Old French (stuff from before the Middle Ages). Now, I don't want to get into the debate of defining what is Middle French, Old French and Modern French. It's certainly going to be difficult to draw clear lines between them.
And you're lucky with French because it's a relatively recent language, comparatively...
I will see with Trang, and we will answer you, but personnaly (not an official statement), I would say they should be considered as different languages, after for delimiting what is old and what is "modern", we can rely on preexisting works on this. (as it's sometimes teach in university, they should already have stated this)
Please delete this and all translations of this sentence.
Moving from comments because the sentence has been deleted as a copyright violation.
Only one of the two dictionary entries had 'no-one' as an also note. Also the BBC agrees with me, and they're always right ;-)
BBC Learning English
"No-one is written with a hyphen between the two 'o's."
Theoretical question. Is it correct to write noöne? :)
> Is it correct to write noöne? :)
Not in English. 'noone' isn't correct in English either.
Being serious for a moment, no-one is more common in British English than in American English (as far as those two things exist), and is still less common than 'no one'. I think the only people to denounce it as 'incorrect', though, are those who decry 'To boldly go' as a split-infinitive and correct people who use 'whom' instead of 'who'.
Definitely not incorrect, just uncommon. Hyphenations is an area of consternation even for native readers and writers of English. For example, just the other day, I was wondering whether it was "non-sense" or "nonsense". In the end, I just looked up my trusty OED, which said "nonsense".
> I just looked up my trusty OED
And by "my trusty OED", I meant that OED that comes bundled with Mac OS X, so I look it up by using the ever-so-convenient Dashboard widget. =)
> Not in English. 'noone' isn't correct in English either.
I distinctly remember my high school English teacher going over this, and saying that "noone" was technically the correct way to do it. And everyone was like "What?!" But yea, no one does it that way, really. She didn't make us do it, either.
:-) I'd never thought of that. I reckon that theoretically it could be used (as in naïve to split the digraph).
Searching reveals some use on-line but since Wiktionary calls it obsolete, there are hardly enough advocates to claim common use.
Maybe I'm the only one who finds the auto-completion feature on the tag more of a burden than a benefit... but could it be possible to implement a "turn-off autocomplete" option?
It often either suggests to me tags that were entered erroneously and then deleted, or a whole list of tags from which the one I want is somewhere in the middle. The problem is that hitting the return key doesn't implement the tag I've typed, but rather selects the first tag on the list.
Really, really annoying...
I happen to really love the autocomplete. It works perfectly for me (type, arrow down to the correct entry, enter, tab, enter), makes it easier to find the precise name of the tag (well, once we get rid of the empty or duplicate ones) and reduces the likelyhood that people enter a new tag with a typo or in a slightly different form.
I haven't tried making "accidental" tags, but "proverbb" doesn't show up when entering "prov...". Is it possible that this is an issue with your browser, rather than the script.
If I were to make a suggestion for improving the tag form, I'd suggest a notice for creating tags. Something like when adding or deleting links, informing one that this particular tag currently doesn't exist and asking whether one really wants to create it.
> I haven't tried making "accidental" tags, but "proverbb" doesn't show up when entering "prov...". Is it possible that this is an issue with your browser, rather than the script.
That's because this was a hypothetical example. Try typing in 2nd, and see if you get "2nd personn" as the first choice.
But yes, I do also have an issue with my browser where it often won't let me scroll down with the arrow keys. That would be an individual problem.
"2nd p" brings "2nd personn", while "2nd P" brings normal variants.
It seems the problem is that tag autocompletion is case-sensitive, while tags aren’t.
I believe Sysko's going to make it non-case-sensitive at some point. But, yes, there are so many grammatical person tags:
The best way to fix your problem and make these tags more useful would be to merge them. I'll be done with the quotes tags soon, and was going to move on to this sort of stuff next.
The procedure for grammatical tags is pretty simple, IMO:
1) Replace all "(Language)" with "". (this one was my fault, as I didn't know you could sort tags by language at one point)
2) Decompose all the "[# Person] [formal/informal/plural/singular]" into two tags (one for the person, and one for the degree of formality, etc). I think that I'm prolly responsible for about 90% of the grammatical tags...
I agree with the re-tagging scheme but we'll have to hear from Trang or Sysko as to how best to implement this. With the number of sentences tagged, it will be best to run a database command on this.
It'll probably be something along the lines of:
1) tag every setnence in "X Y Z[/V] (W)" under "X Y", "Z" and "V"
2) remove the "X Y Z[/V] (W)" tag from its sentences and delete it.
In fact I've figured there's a problem in sorting, it should normaly propose "proverb" as first suggestion when you type "p" , but for a strange reason it's not working. So I think it's maybe a first reason why is not so pratical ?
for the enter key which select the first suggestion rather than what you've typed, it's was anyway something for "test". I don't know if you know, but you instead of typing as fast as you can, just press "escape". If other people prefer also to press down arrow/click on the first suggestion and then enter "enter key", I will change it, because anyway I've added it to make user's life easier ^^
I think the bug with the erroneous tags should probably be looked into, as well. For example, if I misspell "proverb" as "proverbb", and then delete the tag and enter the correct one, the autocomplete remembers the incorrect (even though no sentences have that tag). So next time if I want to tag, I'll get "proverbb" as the first choice. I guess I'll try using escape more often.
Sorry, sysko, I know you put a lot of work into it.
I find it annoying sometimes, but I can live with it.
Firefox has a good auto-complete feature anyway, so it doesn't do much for me.
> Firefox has a good auto-complete feature anyway, so it doesn't do much for me.
You can get them to work together? Because my Firefox autocomplete is gone now... I kind of want it back.
In fact, I often find myself racing against the system, and typing the tag I want as quickly as I can so that the auto-complete doesn't have time to activate...
If you see content which can possibly be extract from recent book / movie / lyrics or even if you're not sure, please tag them as "possible copyright violation", thanks :)
IANAL, but wouldn't a single sentence fall under the category of fair use/fair dealing?
No, because even for single sentences at school we used to study the following case:
In a guide for tourist there used to be a paragraph about the Champagne region (where I'm from, really nice place, you should come ^^), talking about wines etc. etc. A Champagne productor takes one these sentences and used them on his bottles, but soon the author from the guide bring suit. The judge state that this sentence even a single one could be "copyrighted", because the sentence contain enough "style" to be considered a something unique who has needed works.
So yep "I eat an apple" don't fall on this case, but I'm sure that people wanting to include a sentence are more likely to do it for a "with style" sentences.
Moreover the equivalent of "fair use" in French law is really precise and strict, it's only for educational/illustrating purpose if we locate the quote in the original book, precise the author etc.
So here I see 2 problems
1 - Will every single users know about this and even if they know, will they take time to add them the way the law says it should be? Do moderators have time to check that?
2 - Tatoeba content is licensed under the CC-BY license, which mean that commercial use are allowed, so does it break the "education/illustrating" purpose ?
To be honnest, I really don't have the time to check all of this, and we're not enough moderator to cover everything. And even if we had time, do we really need quote from books, for sure some are interesting etc. I mean does this deserve we took time to check the law AND take care about how contributors add quote (in the case it's legal) ?
From my perspective, a sentence will benefit Tatoeba if it illustrates the use of a word or phrase in a natural sentence and is linked to one or more corresponding sentences in other languages. Content from published sources may be helpful but certainly isn't necessary.
The thing is that you need to be a French lawyer specialising in copyright.
I think even if they do fall under the French equivalent of fair use they have to have the source noted here _and_ in any third party re-use of the Tatoeba project data. AFAIK, at present there is not a good way to do that (maybe Tags would be enough?).