Wall (7277 threads)
Astúcias
Before asking a question, make sure to read the FAQ.
We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.
sharptoothed
2 days ago
small_snow
2 days ago
frpzzd
4 days ago
LeviHighway
4 days ago
frpzzd
4 days ago
sharptoothed
5 days ago
LeviHighway
5 days ago
lingomaxim
5 days ago
frpzzd
5 days ago
LeviHighway
5 days ago
Strange errors on 413554 and 413553... (could the mods check?)
The view became messed up once the sentences were tagged.
Happens with any tag. It looks like there's a bug in the latest (just installed) version.
http://tatoeba.org/eng/tags/sho...cit_%28Uyghur_
Note that there are also tags that were left when the sentence they were attached to was deleted.
Sorry about that ^^ It's fixed.
http://tatoeba.org/eng/sentence...e+commence+pas
Thought I'd bring it up here. The english and japanese examples are slightly off in contrast with the other translations. blay_paul said he'd fix it but there's no sign of change yet.
Actually I changed the English so it matches the Japanese better and added two new alternative translations of the Japanese.
I don't know the other languages so I'll have to take your word about them.
*Paging Sysko*
I've decided it's a good idea to simplify the indexing a little by dropping the |1, |2, etc. notation. I basically had it for redundancy and now that the indexing has settled down it isn't as important as it was.
Could you do a global search and replace in the Japanese Index field of "|?(" with "(" ?
e.g. 為る|1(する) becomes 為る|1(する), but ちゃう|2 does not become ちゃう.
There are too many entries to change for the search feature on the Sentence Annotations page to work.
Thanks in advance, Paul
done, I've replaced all |?( by (
Great! You even read my mind for what I really meant to say, instead of what I actually wrote. ;-)
e.g. 為る|1(する) becomes 為る(する), but ちゃう|2 does not become ちゃう.
(Oops)
Sentences.csv export.
As before the sentences.csv export includes spurious \ symbols accompanying line breaks. Can you please filter for tabs and linebreaks in the sentence text!
Sentence 400602 included a tab character.
4197 seems to have contained two line breaks.
I can't tell whether it's possible to remove them by editing the text as a moderator because they don't display in the first place.
There are about 59 sentences with two line breaks in them
I've removed all line break in sentence, tab, and multiple spaces in the database
I just want to advertise a bit this list that I created for Japanese sentences to be checked by a native. Anyone can add sentences to it and of course natives are welcome to come and check the sentences:
http://tatoeba.org/eng/sentences_lists/show/131
Hey gang.
Registered to-day and have no idea as to my purpose as yet but will learn.
Welcome. If you know two languages, you can simply start translating sentences or adding your own.
Welcome to tatoeba.
Just some remarks to Scotts general introduction:
When adding new sentences, please also consider copyright issues.
For a general introduction you can watch the video on drumbeat.
If you have any specific questions, feel free to ask here on the wall for general questions, or at the specific sentence if you have issues with a sentence, its translations, doubt about whether a sentence in a foreign language is correct, etc. This is a community-based project after all. Please also fill out your profile so that we know which languages you speak, where you are from, etc.
Welcome :)
Even if you know only one and want to learn an other, just add sentences in your language (and check before if they does not exist with the search bar :) ) and ask in comments, how to translate this in language YYY
(btw I think we should add the video somewhere directly here)
Could we have the duplicate removal script run?
Done.
I still didn't have time to take care of the [F] and [M] inline tags though...
[F] and [M] tags have all been removed (and also trailing space)
For those of you keeping track of such things.
Since last I checked (last Saturday) WWWJDIC has had 8 new sentence pairs added and 30 removed. The low rate of sentence addition is largely because of the time required to do the indexing.
just curious, could the output from this parser be used somehow to automate indexing: http://www.jdictionary.com/parser ?
Short answer, no.
Longer answer, it might be of some use but it would also take a lot of time setting it up for the best results. It's probably not worth the effort. There are things that could be done to speed things up but they require volunteer(s) with the right know-how and quite a bit of free time.
It too bad that it takes so long. がんばって!
That parser saeb mention uses MeCab (as does Tatoeba), The original indices were generated using ChaSen (very similar to MeCab), but they have been massively massaged since then. The trouble with using MeCab is that it's too fine-grained, and will break up compound nouns, expressions, etc. which for indexing should be kept whole. A better tool would be WWWJDIC's text glosser. See:
http://www.csse.monash.edu.au/~...81%A7%E3%81%AB
for an example. Its output is more aligned to dictionary entries.
How should sentences like the following sentence be tagged:
http://tatoeba.org/deu/sentences/show/34979
It's not a "proverb" as the part to which the tag should refer is only "land of milk and honey".
"expression"? "phrase"? "saying" ?
idiom
thank you, I'll use idiom =).
I don't think there is an official answer yet, but you could always tag it as "uses idiom" or "includes proverb" or something.
I generally tag such sentences with the "idiom" tag. The only problem is that the idiom itself is not marked within the sentence. But I guess, most people can figure it out for themselves with the hint that it actually is an idiom.
at least for more obfuscated idiom, one can precise the "pure" form of the idiom in comment
BTW, Russian has a subtlier distinction:
* пословица is the proverb that is a full sentence
* поговорка is the proverb that is not a full sentence and is used in context
Belarusian and Ukrainian have this as well.
IMHO we should mark the distinction somehow when tagging, otherwise it'll be impossible to translate tags when they'll become translatable.
[not needed - removed by CK]
IMHO a more general search with tags would be more useful.
E.g. find sentences tagged "OK" and not tagged "easy" in Chinese ("OK -easy"?).
sure, it's planned but it's not in my priority list
Yep in fact this has been considered for a long time because has Scott said this is the major criticisms Tanaka corpus has, and so tatoeba too
The major problem is "how" do this, now we have tags , I can make the "ok" tag a special tag, which only trusted user who are not the sentence owner (no CK I'm not talking about you :P) can set
and maybe add a bit everywhere a "show only proofread" checkbox
> which only trusted user who are not the sentence owner
> can set
Not to be picky, but any trusted user could
* Un-own one of their sentences.
* Mark it OK
* Re-own it.
What I would suggest is that two different trusted users vet it as OK (that could include the sentence owner). One marks it "Checked by [name]" then anybody who is not [name] can change that to OK.
I'm even more picky ("never trust your user, even trusted user :P")
in the database we have a table which keep logs of action on sentences, so I will check the guy who "added" it, not the current owner :) (otherwise you would have not been able to correct a tanaka corpus sentence and set it to ok)
in fact to have your "need two user" we can consider that "owned" a sentence is the first proofread (and then when you adopt a sentence you will not be able to set the OK tag, like the guy who contributed this sentence)
and then if the sentence is owned and you're not the current owner nor the guy who added it then you will be able to set the OK tag
Your idea seems to be about some system of verification or vetting. I think it's a good idea. I don't know if this is planned but it would be great to have a system where trusted users can verify sentences as correct. Eventually you would end up with a bank of sentences guaranteed to be good and this could address one of the major criticisms directed at the Tanaka corpus as being unreliable and containing unnatural sentences.
at least I plan very soon to make the following things
* filter by language on a tag page (show only ok sentences in French etc.)
* have a nice page to list only sentences in XXXX not translated in YYYY