Wall (6,959 threads)
Tips
Before asking a question, make sure to read the FAQ.
We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.
sharptoothed
5 days ago
Cangarejo
5 days ago
Cangarejo
8 days ago
Thanuir
8 days ago
ondo
9 days ago
ddnktr
9 days ago
ondo
9 days ago
AlanF_US
13 days ago
Nandixer
14 days ago
cblanken
16 days ago
Hey language masters!
Two hundred and six sentences have been added containing the word "ovo", in Portuguese. You can access them by clicking on the following link:
http://tatoeba.org/eng/sentence...nd/indifferent
Hi, thanks for this, just one question about which I'm curious. How do you choose these words? (as you already did it for an other words some times ago)
I use to take very common words, but I have a question, sysko: how do you generate lists with missing words? I've seen lists in French and it would be really helpful if we could have lists in Portuguese, too (so that we could include sentences with missing words, obviously).
all explained on my blog :) http://blog.sysko.fr/ I needed a list of most frequent words in French and i made a simple "search and count" script against the french sentences in tatoeba. After reading. if you need more explanation just ask me :)
Well, I'm not a computer master, but I promise I'll read it and try my best to publish something like that. Excellent blog and content, sysko!
if you find me a frequency list, I should be able to take care of it, the process is quite long but now that i've documented it (first for myself :P) , it shouldn't take so long for me in "brain time"
Sure, I'll take care of it. Sorry for my ignorance, but all you need is a .txt with a word/line?
yes, the best would be a list with also the frequency so that we can generate "missing words among the 10 000 most frequents", and actually in the meantime I'm developing a predictive input method for alphabet-based languages, so with frequency, it could be used in both projects
Well, I've found some. I like this one:
http://www.mediafire.com/?tmzd9mylnvt
This one is also good, but the diacritics are missing...
http://alcor.concordia.ca/~vjor...e-Palavras.txt
I've read on a blog that these ones contain most of words in our language, but I was not able to open them.
http://natura.di.uminho.pt/down...ies/wordlists/
I cannot find frequency lists...
oh by the way the need to be under a free licence, public domain is the best (if you can find these information about the given lists), sorry to not have precised it.
I've just created a frequency list using texts by Machado de Assis, the most popular Brazilian writer. All his texts are in the public domain and can be accessed here:
http://machado.mec.gov.br/index...eta&Itemid=123
The frequency list can be downloaded here:
http://www.2shared.com/document...hado_freq.html
The non-frequency list can be downloaded here:
http://www.2shared.com/document...do_nofreq.html
I'm counting on you, sysko! Thank you :)
Hi, I know nothing about Portuguese, except a very quick look to the wikipedia page, but you list doesn't seems to contain character with diacritic such as ã, is it a result from the software that you're using that remove such diacritics ?
Yes.
ã, á, à, â became a
ê, é became e
í became i
õ, ó, ô became o
ú, ü becaume u
ç became c
We have no problem to understand words without these diacritics; I've already looked for words here in Tatoeba without them and they're listed normally...
http://static.tatoeba.org/missi..._por.html#stat :) I've written it in English as I don't speak Portuguese, shame on me ^^
the method used this time is slightly different (tough relying on the same bases) and explained here http://blog.sysko.fr/post/11
Good job, sysko!
The first words are probably names (since these words were taken from books), but the list is just perfect. So is the tutorial you've written. I'll do my best to use these missing words, so THANK YOU AGAIN!
yep the search engine also remove diacritics. I've asked because my script does no use the search engine as i run it on my laptop, so it was to know if i need to also add a "normalize" treatment in the script. :)
thank you I will produce the corresponding missing words page tomorrow.
Well, it won't be that easy... Look, I can create one. There are softwares able to do that. It's simple: you give the software texts (huge texts, by the way) and it generates frequency lists... It's legal!
By the way, this message is also for everyone in any languages, just contact me with the list and the language it is.
У меня есть словари фраз (трёхстрочник: китайский / японский - фонетика - русский перевод).
Как их транспортировать на Tatoeba? Безвозмездно.
Это разговорники и грамматические справочники в формате EXCEL
С уважением, Олег Филиппович
<Перевод для тех, кто не знает русского / Translation for those who can't read Russian>
I have phrase dictionaries (three-way: Chinese/Japanese—phonetic transcription—Russian translation).
How can I move them to Tatoeba? Gratis.
These are phrasebooks and grammatic references in the EXCEL format.
Yours sincerely, Oleg Filippovich
----
<Ответ / Answer>
[РУС] Сейчас импортировать большие объёмы данных может только администрация сайта, поэтому Вы можете прислать свои файлы по адресу team@tatoeba.org.
Однако важен ещё и вопрос авторских прав. Вы сами составляли эти словари? Если нет, нужно обязательно получить разрешение у составителя.
С уважением,
Дмитрий Кушнарёв
[АНГ] Right now only the administrators can import large amounts of data, so you can send your dictionaries to the team@tatoeba.org for them to be imported.
However, there are copyright issues. Have you compiled these dictionaries yourself? If not, a permission of the person compiling them is neccessary.
Yours sincerely,
Zmicier Kushnarou
Hey, guys!
For those who study Portuguese, I've added 120 sentences containing the word "caixa", which means "box", in English. If you're interested in translating them, just click on the following link and keep clicking on "next".
http://tatoeba.org/pt_BR/sentences/show/976298
**Missing word in French in tatoeba / creation of my developer blog
I've generated an html page here http://blog.sysko.fr/media/stat...nquant_fr.html of the words missing in French (note: even if "mot" is already in the dabatase but not "mots", "mots" will appear as a missing word, I will try in the future to improve this )
for people willing to generate this kind of file for other languages, I've describe how to do it here http://blog.sysko.fr/post/8
So as you've seen these days I've started a blog to share all the scripts/tips I use/found when dealing with tatoeba
Soon I will also use it to share how I'm developing the new version of tatoeba, how it's architectured, ideas of future development I have etc.
the blog is powered by cppcms, actually by cppcmsblog (Artyom, the guy who is behind cppcms provide the code), so the same framework that will be used for the new version (for those thinking that c++ is hazardous for a web application)
The content of the blog is under cc-by and I allow reuse in Tatoeba, so feel free to correct my broken English and to import some parts of posts into the database if you think they're useful. (As I'm not a native, I don't feel confident to import them myself)
Encore mis à jour:
À présent plus que 936 mots manquants. La liste est toujours accessible depuis http://static.tatoeba.org/mots_..._fra.html#stat
Moi, je vais ajouter une phrase avec le 'mot' "iii" qui semble être très fréquent avec 0.00381588% ;D mais je pense que ca devrait être une personne comme Louis XIV.
«iii» est en fait un code de notes de bas de pages. Je pense que la liste en question prend tous les «mots« tels qu'ils apparaissent dans les ouvrages, qu'ils soient des éléments de phrases ou pas, donc ça donne lieu à quelques absurdités, comme également «cf» qui est l'abréviation de «confère»...
ahh, j'ai seulement deviné parce-qu'il fallait écrire sûr ce chose amusant.
en fait «iii» est le chiffre romain pour 3 en minuscules...
ahhhhh très bonne déduction, je me demandais d'où pouvait venir le iii, exact. Cela semble évident maintenant ! (vu qu'en effet la liste à tout en minuscule)
On avance, on avance, on avance...
Il y a quelques mystères dans ta liste. Par exemple le mot "prélat" ne s'y trouve pas, alors qu'il n'est pas non plus dans Tatoeba...
Et il se trouve parmi les 10 000 plus fréquents? si oui je vais regarder cela.
ah non, sans doute pas ! J'avais compris que c'était tous les mots...
I've updated it, now there's only 1013 words missing (1111 last week)
J'accède toujours pas à ta page :(
Requested Range Not Satisfiable
None of the range-specifier values in the Range request-header field overlap the current extent of the selected resource.
Apache/2.2.9 (Debian) DAV/2 SVN/1.5.1 mod_fastcgi/2.4.6 PHP/5.2.6-1+lenny9 with Suhosin-Patch mod_wsgi/2.5 Python/2.5.2 Phusion_Passenger/3.0.1 Server at blog.sysko.fr Port 80
En attendant j'ai aussi mis la page ici http://static.tatoeba.org/mots_manquants_fra.html
ah super merci, c'est bon !
Et tu compares à quel référentiel de mots français ?
http://www.lexique.org/listes/liste_mots.php j'utilise la première liste
D'après le site, cette liste est basé sur leurs base "lexique 2"
http://www.lexique.org/outils/M...m#_Toc78013468
Sur cette page on trouve comment a été constitué leur corpus.
Si cela t'intéresse, on trouve également beaucoup d'autres listes sur le site.
hmmm voila qui est plus interessant, la je suis dans un taxi, en rentrant je creuse cela. en plus on a de la chance, la personne qui fait cppcms (ce que j'utilise pour le blog et bientot tatoeba) est a present sur tatoeba.
You have small problems with captcha... Fix URL rewriting or something like that.
BTW. Tatoeba needs a link to google-translate for thous who do not speak in dozen languages :-)
And welcome aboard, glad to have you there.
Actually it is great place to find linguistics examples...
This what I was looking for a long time for Boost.Locale's tutorials.
ok, I've seen your anwser on your blog, tomorow I will update my local copy, and try to fix the problem with captcha.
ton lien plante...
étrange, ici ça marche niquel que j'y aille depuis mon télephone sous androide, ou mon pc au travail ou à la maison. ça plante comment ? Erreur 503 , 404 ?
This webpage is not found
No webpage was found for the web address: http://blog.sysko.fr/media/stat...nquant_fr.html
Error 6 (net::ERR_FILE_NOT_FOUND): The file or directory could not be found.
hmmm étrange ce matin un de mes eleves a eu le meme probleme, aussi avec google chrome, après avoir recherché sur le moteur éponyme, il semblerait que plusieurs personnes aient des problèmes similaires. Car la, ma page est purement statique, il n'y a pas une ligne de javascript dedans. Les autres pages s'affichent-elles correctement ? Quid avec un autre navigateur?
(avec chromiun je n'ai pas ce probleme)
en fait j'ai réussi à y accéder la 1èer fois et plus depuis...
Je dois être filtré par la Grande Muraille...
Malheureusement il est hébergé en France, à moins qu'il ne faille d'abord faire preuve de bravitude pour y acceder, bravitude qui comme le sait tout un chacun, se conquiert en gravissant la dite Muraille...
Je n'utilise jamais d'émoticônes mais je vais faire une exception :D
Mais j'accède bien à la racine http://blog.sysko.fr/media/static/
Moi, je le vois...
Il marche bien avec Chrome et Firefox...
Si c'est des mots qui manquent dans tatoeba, pourquoi ils n'ont pas un % égal à zéro ?
pardon j'ai oublié de le préciser, c'est leur fréquence dans la langue française :)
Ok, je comprends mieux :)
Hola a todos, soy nuevo en esta página web y me gustaría poder ser útil en un futuro, por el momento tengo que aprender como funciona bien esta página web. merci. danke. thank you. 谢谢. yes. I love languages and I love people and cultures.
Welcome! You can take a look at the Contributor Guide to get started. It explains very well how we are building the corpus of sentences:
https://docs.google.com/documen...kE&hl=en&pli=1
And you can ask question in this wall, if you wish, even in Spanish.
Hello,
Can somebody who know Japanese tell me whether this sentence (868189)
生きるか死ぬか、それが問題だ。
Is split correctly/acceptable into words (done by software):
"生", "きるか", "死", "ぬか", "それが", "問題", "だ".
Is it good/good-enough/not-good-at-all?
Artyom
I'm not sure I understand what you want to.
But if I should split this sentence per word, I'll say :
生きる( alive ) - か ( or ) - 死ぬ ( dead ) - か - それ ( this ) - が ( particle ) - 問題 ( question/ problem ) - だ.
Thanks, so it seems that the software that did the text analysis
didn't do it well at all.
Hello,
can somebody tell me what has happened in this sentence?
http://tatoeba.org/eng/sentences/show/868539
According to the logs, the sentence has a different author and says a different thing in a different language...It's quite weird...
It's... weird indeed.
Look at the logs for this one.
=> http://tatoeba.org/sentences/show/868540
I guess that happened because the two sentences were added almost at the same time, and somehow the logs for sentence 868539 and sentence 868540 have been swapped...
Could a moderator delete my sentence Number 970141, please ?
It's a duplicate.
Thank you.
It looks like this has already been taken care of, but in the future, don't worry about it. Duplicates are merged periodically by a script. You just have to wait awhile.
And welcome to Tatoeba! :-)
Thanks :D !
Hi everyone
Underneath Japanese sentences there is unusually a second version where the kanji and digits have furigana (pronunciation) on top. It seems the second version were generated by a program, because when it comes to digits (numbers) I see a few mistakes.
Heinz
Exemple of these mistakes ?
Hi!
First I must say, I really appreciate the automatic generation of furigana (pronunciation) of Japanese text. Also it works most of the time very well.
It must have taken a lot of programming effort to do this.
But most numbers are treated as if they were phone numbers.
That is, each digit is treated as a one digit number.
Which is used for actual phone numbers, but wrong for numbers with multiple digits.
In English and in Japanese we are supposed to give the weight for each digit by indicating if they are tens, hundreds or thousands.
For example in English we say: my phone extension is 2468 (two, four, six, eight),
but we say: this is $2468 (two thousand four hundred and sixty eight Dollars).
The Japanese do the same.
As an example I tried two sentences.
Sentence nº971299
私の電話番号は2468です。(に、よん、ろく、はち)
Sentence nº971300
これは2468ドールです。
Should be but isn’t(にせん よんひゃく ろくじゅう はち)
In the old days this was written as: 二千四百六十八which makes the proper pronunciation obvious.
Here is another example.
Sentence nº127002
地下鉄運賃が2OO1年4月1日からほぼ11%値上げになります。
Here the year is wrong, 2OO1年(にOOいちねん)
the month is right with 4月(しがつ),
but the date is wrong 1日(いちにち).
The いちにち pronunciation means “a whole day”.
Correct would be 1日(ついたち) which means “the 1st of the month”.
Let me know if you understand or disagree.
Thank you.
Cheers
Heinz
Yeah you're effectively right, but it might be actually quite hard for a program to generate good furiganas.
Especially counter ( although I doubt if they use their own numbers like 一つ>1つ or 二人>2人 )
The best we can do now is ask for a correction of furiganas :/
I've commented about this before, so a few quick comments:
(a) the fully automatic generation of furigana, currently done by the MeCab morphological analyzer, is about as good as you are going to get without some manual intervention.
(b) it could be improved by changing lexicons - from the IPADIC to UniDic. This gives a finer-grained analysis and from my testing will eliminate some, but not all of the errors.
(c) the best fix will be to continue with the MeCab approach, but allow for an override, where people can add kanji/etc.->furigana sequences such as (2468,にせんよんびゃくろくじゅうはち) which will take precedence over the MeCab version.
I'm not sure this is a high priority, but I think if we keep on with the furigana, it's worth making sure it is correct, or can be corrected.
I wonder, do either Unidic or IPADIC make use of edict?
Short answer: no. (They are morpheme dictionaries, whereas EDICT et al. deal in lexemes. There is some intersection between the two.)
Longer answer: EDICT material was used a bit in the later revisions of IPADIC. I don't know in the case of UniDic; probably not.
[not needed anymore- removed by CK]
In a way, why use furigana in tatoeba if you have these tools ?
If you're on an other one computer, it may be good he can see furgianas without installing something.
I wonder, what about the creator of the sentence can himself change the furiganas ?
And by the way, it may "force" people to adopt japanese sentences, a bunch of them are orphew ^^
I just tested the Japanese language tool “IME” of Microsoft Word how it copes with number.
Using the test sentence: これは2468ドールです。
IME can convert this sentence into: これは二千四百六十八ドールです。
And if I ask for Hiragana it can convert the second one into:
これはにせんよんひゃくろくじゅうはちドールです。
I am really impressed.
Heinz
Tatoeba Racer has been updated: http://race.braulio.net.br/
Bacana, Bráulio. Ficou ainda mais interessante com as traduções!