menu
Tatoeba
language
Register Log in
language English
menu
Tatoeba

chevron_right Register

chevron_right Log in

Browse

chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio

Community

chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

search
clear
swap_horiz
search

Wall (6,959 threads)

Tips

Before asking a question, make sure to read the FAQ.

We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.

Latest messages feedback

sharptoothed

5 days ago

subdirectory_arrow_right

Cangarejo

5 days ago

subdirectory_arrow_right

Cangarejo

8 days ago

subdirectory_arrow_right

Thanuir

8 days ago

subdirectory_arrow_right

ondo

9 days ago

subdirectory_arrow_right

ddnktr

9 days ago

feedback

ondo

9 days ago

subdirectory_arrow_right

AlanF_US

13 days ago

feedback

Nandixer

14 days ago

feedback

cblanken

16 days ago

alexmarcelo alexmarcelo July 6, 2011 July 6, 2011 at 8:09:48 PM UTC link Permalink

Hey language masters!
Two hundred and six sentences have been added containing the word "ovo", in Portuguese. You can access them by clicking on the following link:
http://tatoeba.org/eng/sentence...nd/indifferent

{{vm.hiddenReplies[6955] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko July 6, 2011 July 6, 2011 at 8:44:10 PM UTC link Permalink

Hi, thanks for this, just one question about which I'm curious. How do you choose these words? (as you already did it for an other words some times ago)

{{vm.hiddenReplies[6956] ? 'expand_more' : 'expand_less'}} hide replies show replies
alexmarcelo alexmarcelo July 6, 2011 July 6, 2011 at 8:55:31 PM UTC link Permalink

I use to take very common words, but I have a question, sysko: how do you generate lists with missing words? I've seen lists in French and it would be really helpful if we could have lists in Portuguese, too (so that we could include sentences with missing words, obviously).

{{vm.hiddenReplies[6957] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko July 6, 2011 July 6, 2011 at 9:00:23 PM UTC link Permalink

all explained on my blog :) http://blog.sysko.fr/ I needed a list of most frequent words in French and i made a simple "search and count" script against the french sentences in tatoeba. After reading. if you need more explanation just ask me :)

{{vm.hiddenReplies[6958] ? 'expand_more' : 'expand_less'}} hide replies show replies
alexmarcelo alexmarcelo July 6, 2011 July 6, 2011 at 9:07:39 PM UTC link Permalink

Well, I'm not a computer master, but I promise I'll read it and try my best to publish something like that. Excellent blog and content, sysko!

{{vm.hiddenReplies[6959] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko July 6, 2011 July 6, 2011 at 9:09:14 PM UTC link Permalink

if you find me a frequency list, I should be able to take care of it, the process is quite long but now that i've documented it (first for myself :P) , it shouldn't take so long for me in "brain time"

{{vm.hiddenReplies[6960] ? 'expand_more' : 'expand_less'}} hide replies show replies
alexmarcelo alexmarcelo July 6, 2011 July 6, 2011 at 9:13:53 PM UTC link Permalink

Sure, I'll take care of it. Sorry for my ignorance, but all you need is a .txt with a word/line?

{{vm.hiddenReplies[6962] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko July 6, 2011 July 6, 2011 at 9:18:45 PM UTC link Permalink

yes, the best would be a list with also the frequency so that we can generate "missing words among the 10 000 most frequents", and actually in the meantime I'm developing a predictive input method for alphabet-based languages, so with frequency, it could be used in both projects

{{vm.hiddenReplies[6963] ? 'expand_more' : 'expand_less'}} hide replies show replies
alexmarcelo alexmarcelo July 6, 2011 July 6, 2011 at 9:34:10 PM UTC link Permalink

Well, I've found some. I like this one:
http://www.mediafire.com/?tmzd9mylnvt

This one is also good, but the diacritics are missing...
http://alcor.concordia.ca/~vjor...e-Palavras.txt

I've read on a blog that these ones contain most of words in our language, but I was not able to open them.
http://natura.di.uminho.pt/down...ies/wordlists/

I cannot find frequency lists...

{{vm.hiddenReplies[6964] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko July 6, 2011 July 6, 2011 at 9:40:42 PM UTC link Permalink

oh by the way the need to be under a free licence, public domain is the best (if you can find these information about the given lists), sorry to not have precised it.

{{vm.hiddenReplies[6965] ? 'expand_more' : 'expand_less'}} hide replies show replies
alexmarcelo alexmarcelo July 6, 2011 July 6, 2011 at 11:31:09 PM UTC link Permalink

I've just created a frequency list using texts by Machado de Assis, the most popular Brazilian writer. All his texts are in the public domain and can be accessed here:
http://machado.mec.gov.br/index...eta&Itemid=123

The frequency list can be downloaded here:
http://www.2shared.com/document...hado_freq.html

The non-frequency list can be downloaded here:
http://www.2shared.com/document...do_nofreq.html

I'm counting on you, sysko! Thank you :)

{{vm.hiddenReplies[6967] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko July 8, 2011 July 8, 2011 at 2:51:59 PM UTC link Permalink

Hi, I know nothing about Portuguese, except a very quick look to the wikipedia page, but you list doesn't seems to contain character with diacritic such as ã, is it a result from the software that you're using that remove such diacritics ?

{{vm.hiddenReplies[6970] ? 'expand_more' : 'expand_less'}} hide replies show replies
alexmarcelo alexmarcelo July 8, 2011 July 8, 2011 at 5:17:08 PM UTC link Permalink

Yes.
ã, á, à, â became a
ê, é became e
í became i
õ, ó, ô became o
ú, ü becaume u
ç became c

We have no problem to understand words without these diacritics; I've already looked for words here in Tatoeba without them and they're listed normally...

{{vm.hiddenReplies[6973] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko July 8, 2011 July 8, 2011 at 11:39:00 PM UTC link Permalink

http://static.tatoeba.org/missi..._por.html#stat :) I've written it in English as I don't speak Portuguese, shame on me ^^

{{vm.hiddenReplies[6978] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko July 9, 2011 July 9, 2011 at 12:18:58 AM UTC link Permalink

the method used this time is slightly different (tough relying on the same bases) and explained here http://blog.sysko.fr/post/11

{{vm.hiddenReplies[6979] ? 'expand_more' : 'expand_less'}} hide replies show replies
alexmarcelo alexmarcelo July 9, 2011 July 9, 2011 at 12:46:51 AM UTC link Permalink

Good job, sysko!
The first words are probably names (since these words were taken from books), but the list is just perfect. So is the tutorial you've written. I'll do my best to use these missing words, so THANK YOU AGAIN!

sysko sysko July 8, 2011 July 8, 2011 at 6:10:21 PM UTC link Permalink

yep the search engine also remove diacritics. I've asked because my script does no use the search engine as i run it on my laptop, so it was to know if i need to also add a "normalize" treatment in the script. :)

sysko sysko July 6, 2011 July 6, 2011 at 11:52:16 PM UTC link Permalink

thank you I will produce the corresponding missing words page tomorrow.

alexmarcelo alexmarcelo July 6, 2011 July 6, 2011 at 10:00:05 PM UTC link Permalink

Well, it won't be that easy... Look, I can create one. There are softwares able to do that. It's simple: you give the software texts (huge texts, by the way) and it generates frequency lists... It's legal!

sysko sysko July 6, 2011 July 6, 2011 at 9:10:14 PM UTC link Permalink

By the way, this message is also for everyone in any languages, just contact me with the list and the language it is.

Tolkach Tolkach July 2, 2011 July 2, 2011 at 2:14:51 AM UTC link Permalink

У меня есть словари фраз (трёхстрочник: китайский / японский - фонетика - русский перевод).
Как их транспортировать на Tatoeba? Безвозмездно.
Это разговорники и грамматические справочники в формате EXCEL

С уважением, Олег Филиппович

{{vm.hiddenReplies[6921] ? 'expand_more' : 'expand_less'}} hide replies show replies
Demetrius Demetrius July 6, 2011 July 6, 2011 at 11:36:31 AM UTC link Permalink

<Перевод для тех, кто не знает русского / Translation for those who can't read Russian>
I have phrase dictionaries (three-way: Chinese/Japanese—phonetic transcription—Russian translation).

How can I move them to Tatoeba? Gratis.

These are phrasebooks and grammatic references in the EXCEL format.

Yours sincerely, Oleg Filippovich

----

<Ответ / Answer>
[РУС] Сейчас импортировать большие объёмы данных может только администрация сайта, поэтому Вы можете прислать свои файлы по адресу team@tatoeba.org.

Однако важен ещё и вопрос авторских прав. Вы сами составляли эти словари? Если нет, нужно обязательно получить разрешение у составителя.

С уважением,
Дмитрий Кушнарёв

[АНГ] Right now only the administrators can import large amounts of data, so you can send your dictionaries to the team@tatoeba.org for them to be imported.

However, there are copyright issues. Have you compiled these dictionaries yourself? If not, a permission of the person compiling them is neccessary.

Yours sincerely,
Zmicier Kushnarou

alexmarcelo alexmarcelo July 5, 2011 July 5, 2011 at 3:03:13 AM UTC link Permalink

Hey, guys!
For those who study Portuguese, I've added 120 sentences containing the word "caixa", which means "box", in English. If you're interested in translating them, just click on the following link and keep clicking on "next".

http://tatoeba.org/pt_BR/sentences/show/976298

sysko sysko June 26, 2011 June 26, 2011 at 3:24:12 AM UTC link Permalink

**Missing word in French in tatoeba / creation of my developer blog

I've generated an html page here http://blog.sysko.fr/media/stat...nquant_fr.html of the words missing in French (note: even if "mot" is already in the dabatase but not "mots", "mots" will appear as a missing word, I will try in the future to improve this )

for people willing to generate this kind of file for other languages, I've describe how to do it here http://blog.sysko.fr/post/8

So as you've seen these days I've started a blog to share all the scripts/tips I use/found when dealing with tatoeba

Soon I will also use it to share how I'm developing the new version of tatoeba, how it's architectured, ideas of future development I have etc.

the blog is powered by cppcms, actually by cppcmsblog (Artyom, the guy who is behind cppcms provide the code), so the same framework that will be used for the new version (for those thinking that c++ is hazardous for a web application)

The content of the blog is under cc-by and I allow reuse in Tatoeba, so feel free to correct my broken English and to import some parts of posts into the database if you think they're useful. (As I'm not a native, I don't feel confident to import them myself)

{{vm.hiddenReplies[6834] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko July 5, 2011 July 5, 2011 at 12:26:43 AM UTC link Permalink

Encore mis à jour:

À présent plus que 936 mots manquants. La liste est toujours accessible depuis http://static.tatoeba.org/mots_..._fra.html#stat

{{vm.hiddenReplies[6942] ? 'expand_more' : 'expand_less'}} hide replies show replies
jakov jakov July 6, 2011 July 6, 2011 at 5:00:10 PM UTC link Permalink

Moi, je vais ajouter une phrase avec le 'mot' "iii" qui semble être très fréquent avec 0.00381588% ;D mais je pense que ca devrait être une personne comme Louis XIV.

{{vm.hiddenReplies[6947] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic July 6, 2011 July 6, 2011 at 5:11:45 PM UTC link Permalink

«iii» est en fait un code de notes de bas de pages. Je pense que la liste en question prend tous les «mots« tels qu'ils apparaissent dans les ouvrages, qu'ils soient des éléments de phrases ou pas, donc ça donne lieu à quelques absurdités, comme également «cf» qui est l'abréviation de «confère»...

{{vm.hiddenReplies[6950] ? 'expand_more' : 'expand_less'}} hide replies show replies
jakov jakov July 6, 2011 July 6, 2011 at 5:34:10 PM UTC link Permalink

ahh, j'ai seulement deviné parce-qu'il fallait écrire sûr ce chose amusant.

{{vm.hiddenReplies[6951] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic July 6, 2011 July 6, 2011 at 5:56:40 PM UTC link Permalink

en fait «iii» est le chiffre romain pour 3 en minuscules...

sysko sysko July 6, 2011 July 6, 2011 at 5:10:05 PM UTC link Permalink

ahhhhh très bonne déduction, je me demandais d'où pouvait venir le iii, exact. Cela semble évident maintenant ! (vu qu'en effet la liste à tout en minuscule)

sacredceltic sacredceltic July 5, 2011 July 5, 2011 at 1:01:24 AM UTC link Permalink

On avance, on avance, on avance...

sacredceltic sacredceltic July 2, 2011 July 2, 2011 at 10:07:10 AM UTC link Permalink

Il y a quelques mystères dans ta liste. Par exemple le mot "prélat" ne s'y trouve pas, alors qu'il n'est pas non plus dans Tatoeba...

{{vm.hiddenReplies[6924] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko July 2, 2011 July 2, 2011 at 11:55:29 AM UTC link Permalink

Et il se trouve parmi les 10 000 plus fréquents? si oui je vais regarder cela.

{{vm.hiddenReplies[6925] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic July 2, 2011 July 2, 2011 at 12:13:56 PM UTC link Permalink

ah non, sans doute pas ! J'avais compris que c'était tous les mots...

sysko sysko June 28, 2011 June 28, 2011 at 11:45:38 AM UTC link Permalink

I've updated it, now there's only 1013 words missing (1111 last week)

{{vm.hiddenReplies[6871] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic June 28, 2011 June 28, 2011 at 12:10:24 PM UTC link Permalink

J'accède toujours pas à ta page :(

Requested Range Not Satisfiable

None of the range-specifier values in the Range request-header field overlap the current extent of the selected resource.

Apache/2.2.9 (Debian) DAV/2 SVN/1.5.1 mod_fastcgi/2.4.6 PHP/5.2.6-1+lenny9 with Suhosin-Patch mod_wsgi/2.5 Python/2.5.2 Phusion_Passenger/3.0.1 Server at blog.sysko.fr Port 80

{{vm.hiddenReplies[6872] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko June 30, 2011 June 30, 2011 at 3:49:53 AM UTC link Permalink

En attendant j'ai aussi mis la page ici http://static.tatoeba.org/mots_manquants_fra.html

{{vm.hiddenReplies[6893] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic June 30, 2011 June 30, 2011 at 8:55:10 AM UTC link Permalink

ah super merci, c'est bon !
Et tu compares à quel référentiel de mots français ?

{{vm.hiddenReplies[6894] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko June 30, 2011 June 30, 2011 at 9:43:39 AM UTC link Permalink

http://www.lexique.org/listes/liste_mots.php j'utilise la première liste

D'après le site, cette liste est basé sur leurs base "lexique 2"
http://www.lexique.org/outils/M...m#_Toc78013468

Sur cette page on trouve comment a été constitué leur corpus.

Si cela t'intéresse, on trouve également beaucoup d'autres listes sur le site.

sysko sysko June 28, 2011 June 28, 2011 at 1:18:31 PM UTC link Permalink

hmmm voila qui est plus interessant, la je suis dans un taxi, en rentrant je creuse cela. en plus on a de la chance, la personne qui fait cppcms (ce que j'utilise pour le blog et bientot tatoeba) est a present sur tatoeba.

artyom artyom June 26, 2011 June 26, 2011 at 9:15:09 PM UTC link Permalink

You have small problems with captcha... Fix URL rewriting or something like that.

BTW. Tatoeba needs a link to google-translate for thous who do not speak in dozen languages :-)

{{vm.hiddenReplies[6851] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko June 26, 2011 June 26, 2011 at 9:46:23 PM UTC link Permalink

And welcome aboard, glad to have you there.

{{vm.hiddenReplies[6854] ? 'expand_more' : 'expand_less'}} hide replies show replies
artyom artyom June 26, 2011 June 26, 2011 at 9:49:58 PM UTC link Permalink

Actually it is great place to find linguistics examples...

This what I was looking for a long time for Boost.Locale's tutorials.

sysko sysko June 26, 2011 June 26, 2011 at 9:45:50 PM UTC link Permalink

ok, I've seen your anwser on your blog, tomorow I will update my local copy, and try to fix the problem with captcha.

sacredceltic sacredceltic June 26, 2011 June 26, 2011 at 5:54:18 PM UTC link Permalink

ton lien plante...

{{vm.hiddenReplies[6840] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko June 26, 2011 June 26, 2011 at 6:54:43 PM UTC link Permalink

étrange, ici ça marche niquel que j'y aille depuis mon télephone sous androide, ou mon pc au travail ou à la maison. ça plante comment ? Erreur 503 , 404 ?

{{vm.hiddenReplies[6841] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic June 26, 2011 June 26, 2011 at 6:57:31 PM UTC link Permalink

This webpage is not found
No webpage was found for the web address: http://blog.sysko.fr/media/stat...nquant_fr.html
Error 6 (net::ERR_FILE_NOT_FOUND): The file or directory could not be found.

{{vm.hiddenReplies[6842] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko June 26, 2011 June 26, 2011 at 7:23:57 PM UTC link Permalink

hmmm étrange ce matin un de mes eleves a eu le meme probleme, aussi avec google chrome, après avoir recherché sur le moteur éponyme, il semblerait que plusieurs personnes aient des problèmes similaires. Car la, ma page est purement statique, il n'y a pas une ligne de javascript dedans. Les autres pages s'affichent-elles correctement ? Quid avec un autre navigateur?

(avec chromiun je n'ai pas ce probleme)

{{vm.hiddenReplies[6843] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic June 26, 2011 June 26, 2011 at 7:30:24 PM UTC link Permalink

en fait j'ai réussi à y accéder la 1èer fois et plus depuis...
Je dois être filtré par la Grande Muraille...

{{vm.hiddenReplies[6846] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko June 26, 2011 June 26, 2011 at 7:39:56 PM UTC link Permalink

Malheureusement il est hébergé en France, à moins qu'il ne faille d'abord faire preuve de bravitude pour y acceder, bravitude qui comme le sait tout un chacun, se conquiert en gravissant la dite Muraille...

{{vm.hiddenReplies[6848] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic June 26, 2011 June 26, 2011 at 7:44:09 PM UTC link Permalink

Je n'utilise jamais d'émoticônes mais je vais faire une exception :D

sacredceltic sacredceltic June 26, 2011 June 26, 2011 at 7:31:09 PM UTC link Permalink

Mais j'accède bien à la racine http://blog.sysko.fr/media/static/

alexmarcelo alexmarcelo June 26, 2011 June 26, 2011 at 7:24:13 PM UTC link Permalink

Moi, je le vois...

{{vm.hiddenReplies[6844] ? 'expand_more' : 'expand_less'}} hide replies show replies
alexmarcelo alexmarcelo June 26, 2011 June 26, 2011 at 7:26:58 PM UTC link Permalink

Il marche bien avec Chrome et Firefox...

Quazel Quazel June 26, 2011 June 26, 2011 at 11:31:12 AM UTC link Permalink

Si c'est des mots qui manquent dans tatoeba, pourquoi ils n'ont pas un % égal à zéro ?

{{vm.hiddenReplies[6835] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko June 26, 2011 June 26, 2011 at 3:37:36 PM UTC link Permalink

pardon j'ai oublié de le préciser, c'est leur fréquence dans la langue française :)

{{vm.hiddenReplies[6838] ? 'expand_more' : 'expand_less'}} hide replies show replies
Quazel Quazel June 26, 2011 June 26, 2011 at 3:42:49 PM UTC link Permalink

Ok, je comprends mieux :)

MecklyBver MecklyBver July 4, 2011 July 4, 2011 at 2:41:42 AM UTC link Permalink

Hola a todos, soy nuevo en esta página web y me gustaría poder ser útil en un futuro, por el momento tengo que aprender como funciona bien esta página web. merci. danke. thank you. 谢谢. yes. I love languages and I love people and cultures.

{{vm.hiddenReplies[6939] ? 'expand_more' : 'expand_less'}} hide replies show replies
brauliobezerra brauliobezerra July 4, 2011 July 4, 2011 at 4:22:34 AM UTC link Permalink

Welcome! You can take a look at the Contributor Guide to get started. It explains very well how we are building the corpus of sentences:

https://docs.google.com/documen...kE&hl=en&pli=1

And you can ask question in this wall, if you wish, even in Spanish.

artyom artyom July 3, 2011 July 3, 2011 at 8:29:57 AM UTC link Permalink

Hello,

Can somebody who know Japanese tell me whether this sentence (868189)

生きるか死ぬか、それが問題だ。

Is split correctly/acceptable into words (done by software):

"生", "きるか", "死", "ぬか", "それが", "問題", "だ".

Is it good/good-enough/not-good-at-all?

Artyom

{{vm.hiddenReplies[6932] ? 'expand_more' : 'expand_less'}} hide replies show replies
Quazel Quazel July 3, 2011 July 3, 2011 at 11:09:10 PM UTC link Permalink

I'm not sure I understand what you want to.
But if I should split this sentence per word, I'll say :

生きる( alive ) - か ( or ) - 死ぬ ( dead ) - か - それ ( this ) - が ( particle ) - 問題 ( question/ problem ) - だ.

{{vm.hiddenReplies[6936] ? 'expand_more' : 'expand_less'}} hide replies show replies
artyom artyom July 4, 2011 July 4, 2011 at 9:22:44 AM UTC link Permalink

Thanks, so it seems that the software that did the text analysis
didn't do it well at all.

Shishir Shishir July 3, 2011 July 3, 2011 at 7:11:30 PM UTC link Permalink

Hello,

can somebody tell me what has happened in this sentence?

http://tatoeba.org/eng/sentences/show/868539

According to the logs, the sentence has a different author and says a different thing in a different language...It's quite weird...

{{vm.hiddenReplies[6933] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG July 3, 2011 July 3, 2011 at 10:18:07 PM UTC link Permalink

It's... weird indeed.

Look at the logs for this one.
=> http://tatoeba.org/sentences/show/868540

I guess that happened because the two sentences were added almost at the same time, and somehow the logs for sentence 868539 and sentence 868540 have been swapped...

NomadSoul NomadSoul July 2, 2011 July 2, 2011 at 12:11:22 AM UTC link Permalink

Could a moderator delete my sentence Number 970141, please ?
It's a duplicate.
Thank you.

{{vm.hiddenReplies[6919] ? 'expand_more' : 'expand_less'}} hide replies show replies
Zifre Zifre July 2, 2011 July 2, 2011 at 3:28:02 AM UTC link Permalink

It looks like this has already been taken care of, but in the future, don't worry about it. Duplicates are merged periodically by a script. You just have to wait awhile.

And welcome to Tatoeba! :-)

{{vm.hiddenReplies[6922] ? 'expand_more' : 'expand_less'}} hide replies show replies
NomadSoul NomadSoul July 2, 2011 July 2, 2011 at 3:29:25 AM UTC link Permalink

Thanks :D !

Heinz Heinz July 1, 2011 July 1, 2011 at 10:39:38 AM UTC link Permalink

Hi everyone
Underneath Japanese sentences there is unusually a second version where the kanji and digits have furigana (pronunciation) on top. It seems the second version were generated by a program, because when it comes to digits (numbers) I see a few mistakes.
Heinz

{{vm.hiddenReplies[6917] ? 'expand_more' : 'expand_less'}} hide replies show replies
Quazel Quazel July 2, 2011 July 2, 2011 at 1:42:22 AM UTC link Permalink

Exemple of these mistakes ?

{{vm.hiddenReplies[6920] ? 'expand_more' : 'expand_less'}} hide replies show replies
Heinz Heinz July 2, 2011 July 2, 2011 at 12:31:25 PM UTC link Permalink

Hi!
First I must say, I really appreciate the automatic generation of furigana (pronunciation) of Japanese text. Also it works most of the time very well.
It must have taken a lot of programming effort to do this.

But most numbers are treated as if they were phone numbers.
That is, each digit is treated as a one digit number.
Which is used for actual phone numbers, but wrong for numbers with multiple digits.
In English and in Japanese we are supposed to give the weight for each digit by indicating if they are tens, hundreds or thousands.

For example in English we say: my phone extension is 2468 (two, four, six, eight),
but we say: this is $2468 (two thousand four hundred and sixty eight Dollars).
The Japanese do the same.

As an example I tried two sentences.
Sentence nº971299
私の電話番号は2468です。(に、よん、ろく、はち)
Sentence nº971300
これは2468ドールです。
Should be but isn’t(にせん よんひゃく ろくじゅう はち)
In the old days this was written as: 二千四百六十八which makes the proper pronunciation obvious.

Here is another example.
Sentence nº127002
地下鉄運賃が2OO1年4月1日からほぼ11%値上げになります。
Here the year is wrong, 2OO1年(にOOいちねん)
the month is right with 4月(しがつ),
but the date is wrong 1日(いちにち).
The いちにち pronunciation means “a whole day”.
Correct would be 1日(ついたち) which means “the 1st of the month”.

Let me know if you understand or disagree.
Thank you.

Cheers
Heinz

{{vm.hiddenReplies[6927] ? 'expand_more' : 'expand_less'}} hide replies show replies
Quazel Quazel July 2, 2011 July 2, 2011 at 6:47:32 PM UTC link Permalink

Yeah you're effectively right, but it might be actually quite hard for a program to generate good furiganas.
Especially counter ( although I doubt if they use their own numbers like 一つ>1つ or 二人>2人 )

The best we can do now is ask for a correction of furiganas :/

{{vm.hiddenReplies[6928] ? 'expand_more' : 'expand_less'}} hide replies show replies
JimBreen JimBreen July 3, 2011 July 3, 2011 at 4:11:03 AM UTC link Permalink

I've commented about this before, so a few quick comments:
(a) the fully automatic generation of furigana, currently done by the MeCab morphological analyzer, is about as good as you are going to get without some manual intervention.
(b) it could be improved by changing lexicons - from the IPADIC to UniDic. This gives a finer-grained analysis and from my testing will eliminate some, but not all of the errors.
(c) the best fix will be to continue with the MeCab approach, but allow for an override, where people can add kanji/etc.->furigana sequences such as (2468,にせんよんびゃくろくじゅうはち) which will take precedence over the MeCab version.
I'm not sure this is a high priority, but I think if we keep on with the furigana, it's worth making sure it is correct, or can be corrected.

{{vm.hiddenReplies[6929] ? 'expand_more' : 'expand_less'}} hide replies show replies
Scott Scott July 3, 2011 July 3, 2011 at 9:39:51 PM UTC link Permalink

I wonder, do either Unidic or IPADIC make use of edict?

{{vm.hiddenReplies[6934] ? 'expand_more' : 'expand_less'}} hide replies show replies
JimBreen JimBreen July 4, 2011 July 4, 2011 at 1:53:53 AM UTC link Permalink

Short answer: no. (They are morpheme dictionaries, whereas EDICT et al. deal in lexemes. There is some intersection between the two.)

Longer answer: EDICT material was used a bit in the later revisions of IPADIC. I don't know in the case of UniDic; probably not.

CK CK July 3, 2011, edited October 30, 2019 July 3, 2011 at 7:06:57 AM UTC, edited October 30, 2019 at 5:23:17 AM UTC link Permalink

[not needed anymore- removed by CK]

{{vm.hiddenReplies[6931] ? 'expand_more' : 'expand_less'}} hide replies show replies
Quazel Quazel July 3, 2011 July 3, 2011 at 11:13:22 PM UTC link Permalink

In a way, why use furigana in tatoeba if you have these tools ?
If you're on an other one computer, it may be good he can see furgianas without installing something.

I wonder, what about the creator of the sentence can himself change the furiganas ?
And by the way, it may "force" people to adopt japanese sentences, a bunch of them are orphew ^^

Heinz Heinz July 6, 2011 July 6, 2011 at 11:00:21 AM UTC link Permalink

I just tested the Japanese language tool “IME” of Microsoft Word how it copes with number.
Using the test sentence: これは2468ドールです。
IME can convert this sentence into: これは二千四百六十八ドールです。
And if I ask for Hiragana it can convert the second one into:
これはにせんよんひゃくろくじゅうはちドールです。
I am really impressed.
Heinz

brauliobezerra brauliobezerra July 1, 2011 July 1, 2011 at 5:18:05 AM UTC link Permalink

Tatoeba Racer has been updated: http://race.braulio.net.br/

{{vm.hiddenReplies[6914] ? 'expand_more' : 'expand_less'}} hide replies show replies
alexmarcelo alexmarcelo July 1, 2011 July 1, 2011 at 10:05:28 PM UTC link Permalink

Bacana, Bráulio. Ficou ainda mais interessante com as traduções!