Wall (5,984 threads)
Before asking a question, make sure to read the FAQ.
We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.
6 hours ago
10 hours ago
12 hours ago
17 hours ago
17 hours ago
2 days ago
3 days ago
Am I doing something wrong or is there no such function?
I want to search for English sentences that are NOT translated into Russian (so I could translations), and I don't see how to do it
Click on the English flag on the top right corner of the main page. This will give you ALL the English sentences available. Then on the right side you'll see options to narrow down the list. You want to choose the one "Show Sentences Not Directly Translated Into...", and then "Russian". Voilà.
As a side note, you should probably select "Show Translations in...", then "All Languages". Sometimes, there's already a good English-to-Russian translation, but it just hasn't been linked. For efficiency reasons, it might be better to just request that someone with the power to link link the two.
Question Regarding Transliterations:
How much work, and what exact steps, are needed to set up a transliteration system for a specific language? I ask because we have a number of languages now that can be written in multiple alphabets. For many of these, the task seems to be a very simple one, as a one-to-one letter correspondence between the alternative alphabets exists (e.g. Serbian). For some, it is only possible to do it in one way but not the other (unless a dictionary is available), but again, the task should not be a difficult one as the majority of the current entries are inputted on the good side of the one-way (e.g. Uighur, Uzbek, and - I think, but Demetrius could confirm - Tatar).
So, what would one have to do to realize this?
It's easy, but I'm too lazy to do this. ^^ I've started working on Uzbek, maybe I'll finish it someday...
Also, sysko has to do transliteration caching. It will allow making transliteration more time-intensive (dictionary searches...).
> one-to-one letter correspondence between
> the alternative alphabets exists (e.g. Serbian).
But injekcija = инјекциjа, not ињекциjа. So you need a dictionary when transcribing from Latin...
Latin > Arabic is the easiest (unless people omit ' and don't differenciate j/zh ^^).
Since we have no Latin Uyghur sentences, there is no rush.
Others require a LARGE dictionary of proper names, since Arabic has no capital letters. Cyrillic requires a dictionary of Russian loanwords.
> and - I think, but Demetrius could confirm - Tatar
No, this one is tricky in both directions. The hardest part is q and ğ. Usually к = k and г = g w/ front vowels, к = q and г = q with back ones.
But Arabic words break vowel harmony:
нигъмәт — niğmət ‘dish’ (ğ is marked as гъ), сәгать — səğət ‘clock’ (ğ is marked by changing vowel letter, the vowel quality is marked by the soft sign), сәгатем — səğətem ‘my clock’ (ğ is marked by a vowel letter w/out a soft sign)
Russian loanwords break vowel harmony in other way, they force the K and G even near back vowels.
Also, there are W and V:
В = V (вагон — vagon ‘carriage’), W (авыл — awıl ‘village’)
У = U (су — su ‘water’), W (тау = taw ‘mountain’)
Ү = Ü (күрү — kürü ‘see’), W (Мəскəү — Məskəw ‘Moscow’)
> But injekcija = инјекциjа, not ињекциjа. So you need a dictionary when transcribing from Latin...
Good point. But transliteration can't handle letter pair --> single letter correspondence? That would be much easier than a dictionary. Unless there are instances of "nj" that are нј and not њ, but this is never the case. Anyway, there are nearly no Latin Serbian submissions up to now, and so a one-way from the Cyrillic (if that's what it comes down to), is perfectly okay, IMO.
I disagree here. You'd need a big dictionary for Latin to Arabic. One particular example is n+g and ng (نگ and ڭ). The Latin "ng" could be transliterated as either. I think there are other cases, as well (personally, I don't much like the Latin Uighur...)
Arabic to Latin WITHOUT proper names is perfectly all right, in my opinion. It's not perfect, but it would still make a world of difference and people would usually be able to figure out what should be capitalized anyway.
> Good point. But transliteration can't handle
> letter pair --> single letter correspondence?
Of course it can. I mean that usually nj = њ, but in the word injekcija it's a morpheme boundary and it should be retained нј.
> Unless there are instances of
> "nj" that are нј and not њ,
> but this is never the case.
> I disagree here. You'd need a big dictionary
> for Latin to Arabic.
In fact, you don't need a dictionary at all for Latin>Arabic. Only for zh/j and ' if people omit these.
All the other directions you need a dictionary.
> The Latin "ng" could be transliterated as either.
No, as far as I know. Latin requires breaking these with ': ng and n'g.
> I think there are other cases,
> as well (personally, I don't
> much like the Latin Uighur...)
But it's indeed very easy to process.
I should have said "this is never the case, with the exception of a few foreign words". A small dictionary could be made for those, but the loss isn't great if the transliteration doesn't handle them properly.
> No, as far as I know. Latin requires breaking these with ': ng and n'g.
In that case, it's fine.
> I disagree here. You'd need a big dictionary
> for Latin to Arabic.
I meant Arabic to Latin.
I think what I can do,
have the automatic transliteration tool for each languages
each time a new sentences is added / updated or if missing, we call the tool and store it in a specific table of the database.
if it exists then we just retrieve it from the database.
and maybe add a special page for trusted users, to give them the possibility to edit the stored transliteration,
this way :
your dictionnary will not need to handle eeeeevery case, at least as soon as it handles most common case we can put it even if some particular sentences it will need a manual edit. This way we can complete the dictionnary step by step and make the feature sooner avaiblable.
And when a transliteration will be edited it will be flagged, this way if for some reason we update and regenerate the transliteration, we will not erase the manually edited one (as we can suppose they're right)
I think is the best thing we can do, as anyway for lot of languages, making a 100% correct transliteration tool is a dream (even with long running tool as mecab and adso we reach I think 90%)
OK, that is a good idea. I'll send my (imperfect) Uzbek transliterator on Monday.
for some Demetrius started to work on them, you can see with him what he already done, what is hardly doable etc.
For the other one, either give me a letter to letter/ word to word transliteration file (like origin[tab]transliterate) and I think it will not take me long to integrate it in tatoeba.
I've been seeing if I'll just grow accustomed to the new contribute setup but so far I haven't.
The random sentences had the drawback that occasionally you hit sentences you'd seen before. Getting the latest sentences has the advantage that you more easily spot multiple sentences that can be linked to your translation, but increases the times one bumps into sentences one's already seen (and not translated for whatever reason -- the exclude direct translations feature is great, by the way).
Another advantage of getting sentences in the order they were contributed in is that you translate whole batches of similar sentences making search results more more well rounded even if they cover fewer topics.
How about flipping the order on the contribute page around (or add the option), starting with the oldest sentences? That way one can work ones way through the batch, filtering out already translated sentences and starting after the ones one has chosen not to translate.
As a bonus, these are generally orphaned sentences and often need more attention than the ones active contributors are adding.
You can get orphan sentences by setting "translated into none" :)
For the "several random sentences" page, I will get it back soon, but for the moment this page is slow as hell and was really internally bad designed, but I figured to make it fast so it will soon be as before.
The page which show all the sentences in a language show newest, because this way it favourites collaboration, I add a sentence, it will get more chance to be translated than old one because it will appear on first page for a while (depending of this language activity, not sure "a while" will be very long for Russian or Esperanto :p). But you can go to the oldest one by going clicking on "last" :p (more seriously, yep we can add a revert order button)
Not quite sure what you mean by "translated into none". By "orphaned" I mean the sentences that aren't owned by anyone.
Going to the last page is certainly a workaround but showing the oldest first would make browsing and finding where one left off much easier (as the reference point wouldn't be moving constantly.
The collaboration argument is a very good one, but it would still be nice to be able to choose to work on the back-log. :-)
Not in this case - because the only difference is whether one word is in hiragana or in kanji.
Duplicate removal script not reassigning 'meaning' fields?
It looks like that when sentences like
are removed by the duplicate removal script the Japanese sentence it is attached to
does not automatically have the link to the old sentence location removed. It also doesn't automatically have the meaning field changed to point to the new sentence
Is this something recent, or could it have happen during older an older run of the duplicate removal script?
It must have happened since last Saturday (because I check every Saturday and Thursday).
I would check for links that go to sentences that don't exist.
ok I will check what goes wrong in the script
Some preventive measures?
timsa (http://tatoeba.org/eng/user/profile/timsa) have added lots of sentences that are either not translations (but may look like them for a learner), or too impolite/vulgar, or too colloquial...
I think we need some rules regarding the user behaviour.
I brought up a similar point before, but this is different. When a user productively contributes bad sentences, it is a major problem (it's not always easy to detect, and "volunteer" contributors don't have time to police each other), and so I agree with Demetrius that some sort of system needs to be in place.
I would agree with what Trang has said before, in that banning users isn't the solution (not unless Tatoeba makes strict application procedures, which would probably do more harm than good).
The only "smart", "automatic" way for these kinds of problems to be handled, IMO, is to set up some sort of quota system, where users can contribute more sentences as their "trustworthiness" increases (based on feedback from other users). I have also proposed this before as an automatic means of regulating sentence quality. But it's probably a pipe dream, as it may be hard to code, and probably couldn't be implemented for a long, long while...
> The only "smart", "automatic" way for these kinds of
> problems to be handled, IMO, is to set up some sort of
> quota system, where users can contribute more sentences
> as their "trustworthiness" increases (based on feedback
> from other users).
I'm not keen on that approach. There are a number of
technical and policy problems. On the technical side,
there are users who are the only posters in a certain
language (like Hindu). Nobody would be able to tell
whether they were trustworthy or not and there would be
no feedback to give them a larger quota! I also don't
want to discourage new users from being enthusiastic when
they start out.
You're right. Those are also the reasons why I'm a bit conflicted about it. My only comments:
1) On keen users
Yes, I think a lot of us have been there. I would have hated it if people told me I could, say, only contribute 10 sentences/day when I started. At the same time, when you have bad keen users, you do run into the double-edged problem. Perhaps moderators could play a key role here, and manually raise the quota for users who request it (under the condition that it be lowered back if the trust is betrayed).
2) On "exotic" languages
Similar comment. Perhaps a mod could raise the quota and let the user contribute, then lower it if doubt arises for whatever reason. Also the same double-edged problem, too. What if a user contributes completely bogus sentences in some exotic language? Well, that's less likely, but still...
I think we really just need a few more moderators - particularly moderators that can 'cover' a language that isn't well covered now. I can only really do a proper job of moderation in English and Japanese, for example.
Just out of curiosity, do you have a particular moderator:user ratio in mind?
I'll just say that given the amount of moderator stuff that goes on now, for the amount of moderators and users we have now, then I think we need about 50% more moderators.
My personnal opinion is the same as blay_paul, we need to find more people to be moderators and we need to provide them a set of pages/tools to ease them task. Maybe also add something which may be seen as controversial, the possibility for a moderator (maybe need two moderators), to remove the possibility for a user to add sentences during a short period of time, if a guy go crazy and begins too flood.
This way it's not banning, but it will force someway the guy to take contact with us and to solve the problem diplomaticaly.
After for exotic languages, I think the problem is still the same tough it's even harder to find moderators, but I think, maybe I'm just dreaming, the more exotic a language is, the more probable people will contribute with a good reason in mind, not for flooding or so, maybe with errors etc. but at least not intentionnaly.
When it comes to regulating behaviors, the community plays a very important role.
You can find most of the "rules" on how to behave properly in Tatoeba in the articles here:
If you have read the 3 articles, you know pretty much what is a good and a bad behavior, what is acceptable or not. When people don't follow the guidelines, all you can do is be patient and try to teach them.
There will always be people who don't understand the system, or who have a different point of view of the project. They will not behave the way we may expect them to, they will not be using Tatoeba as we may expect them to. It's impossible to prevent people from breaking the system (whether it's intentionally or not).
What we can (and will) do though is try to design a system that can detect unwanted behavior as early as possible. Then as sysko said, we just need to contact the person. Most people are cooperative and re-adjust their contributions when we simply and kindly ask them to. Those who don't cooperate are usually people who add a few sentences and simply never come back, and in such cases, we have moderators. Although, as blay_paul said, we need more moderators to cover all the important languages... If you think you can take this responsibility for the Russian part, let me know :)
Gruzilkin (http://tatoeba.org/eng/sentence...ser/Gruzilkin) have added some Bulgarian sentences marked as Russian. It's either auto-detection failure or intended action (he brought the number of Bulgarian sentences to 500 this way).
Can we reassign the language of these in a batch?
Everything I added was in Russian, I think auto-detection doesn't work well with Russian/Bulgarian
Please excuse me for my suspicious. ^^
BTW, he didn't add any genuinely Bulgarian sentences. ^_^
I can reassign in a batch if you tell me the criteria, all his contributions in Russian are in fact Bulgarian one?
[not needed anymore- removed by CK]
I'm guessing those two sets would probably merge through the german sentence...
As they're not strictly equal there will not be merged automatically. But yep a moderator can link them :)
Finding new sentences:
When there is a new vocabulary I want to learn, I look up example sentences in Tatoeba. I translate the sentences into my native language to get used to its collocations and nuances, and then I add some sentences to my SRS system (anki in my case).
But when I do not find a sentence here, I have great difficulty to find meaningful sentences on the web.
Even if I use the option "in the text of the page", I mostly get meaningless results such as headlines or single words, but not nice sentences. Does anyone have any ideas about how to find sentences in a more efficient way?
The above problem led me to the idea that users should be able to ask for sentences containing certain words, because native speakers might come up with sentences more easily.
Finding good example sentences in text is notoriously difficult.
A corpus-query system I use, called the Sketch Engine, has an option called GDEX ("good examples") which can automatically promote sentence to the top of the result list depending on how much they look like good examples. It uses a couple of heuristic measures such as sentence length, the presence or absence of punctuation, the presence or absence of anaphoric pronouns, and so on. It is far from perfect but it is a good start.
This sounds nice. Actually I used a similar system for looking up English sentences, namely the BYU system for browsing the British National Corpus, freely available for students:
But does such a system exist for Japanese corpora?
I mean, besides the search facility, the corpora are the other essential ingredient. That's why I tried to use Google, cause they do have the data.
I signed up for the 30 day free trial on Sketch Engine, but I couldn't find the GDEX option. I think it may not exist for the Japanese corpus. It also looks like the Japanese data is not very well filtered to eliminate spam sites and such. It's basically all taken off the web so it isn't much better than Google in that respect. If you search on a spam-worthy word like ヘルス you'll see how bad it can get (NOTE! Search result may not be work friendly.)
The GDEX option is well hidden in the Sketch Engine. When you're looking at a concordance, click "View Options" at the top left and then you'll see a checkbox titled "Sort good dictionary examples". This has to be checked for GDEX to be used.
As far as I know this option is available for every corpus in every language but it may not work so well on languages other than English because -- no surprise -- it was originally built for English.
Found it. It does actually work quite well.
for sure it's something we want for a long time, we will make it "good" and smart in the next BIG release (not the next one , the next BIG)
but maybe I can try to make something quick and dirty,
I will see, talk with trang , and give you an answer.
but in the end it will be done
> Does anyone have any ideas about how to find sentences
> in a more efficient way?
This is something I do a lot. Unfortunately I don't have
any magic solutions. Here's one approach that might work
for you, though.
1. Start with a word (e.g. 潔癖)
2. Look it up in the sort of dictionary that gives example
(Don't trust dictionary example sentence completely!)
Use the context given in the example sentence to narrow
down the Google search (e.g. search on "潔癖" + "手を洗う").
This will usually get you much better results than just
searching on the word alone.
Here are a few I found ...
I am not sure whether it is a bug or a curiosity ("feature"), but when you want to search for multiple Japanese words, the whitespaces have to be entered in English. Typing whitespace when using the Japanese IME leads to different results (always no results?)
hmmm I think it's because the search engine handle only "normal" space as word separator, can you give me this space (in an aswer to this message, between ) this way I will be able to convert them before sending the request to the search engine.
Sure. And here it is:[ ]. Lol, do you see the difference?
yep it's a full width space :)
Please do the same for non-breaking space [ ]. I sometimes use it to prevent dashes (—) from being moved to the next line; it should be treated in the same way as the ordinary space for search purposes.
There are lots of other spaces, but I’m not sure anyone has ever used these on Tatoeba: en quad [ ], em quad [ ], en space [ ], em space [ ], 3-per-em space [ ], 4-per-em space [ ], 6-per-em space [ ], figure space [ ], medium mathematical space [ ], punctuation space [ ], thin space [ ], hair space [ ], zero-width space 
ok thanks, it will be present in next release