Wall (6,960 threads)
Tips
Before asking a question, make sure to read the FAQ.
We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.
morbrorper
2 hours ago
marafon
5 days ago
CK
5 days ago
sharptoothed
10 days ago
Cangarejo
10 days ago
Cangarejo
14 days ago
Thanuir
14 days ago
ondo
14 days ago
ddnktr
15 days ago
ondo
15 days ago
I'm going to make some maintenance operations on the database, so tatoeba will be down during some minutes, it shouldn't be long.
if the number of sentences is decreasing for the following hours it's normal , the duplicates are being removed
Awesome, indeed. Noticed a pretty deep dip. Do you know how many duplicates it merged?
around 13 000.
Wonderful, just very wonderful :-(((
You, sysko, do spend your time to take away 13 000 duplicates - but it seems that up to now you did not attack the cause of all these 13 000 duplicates. You wrote to me, some time ago, you wanted to make a program so that all translations linked via other languages would be linked - but I still can't see that. I have a look every week or so.
It is a pity that people translated 13 000 sentences - just to see them thrown away.
I am waiting. I would like to promote Tatoeba. But I won't write anything good about Tatoeba, when people translating sentences in Tatoeba will see them thrown away afterwards.
writing that new script actually did take me 20 minutes of my own personnal time, moreover the duplicates were already here, so handling it didn't delay the new version.
the main reason of the delay is that my personal life is __really__ busy, I'm really sorry to not be able to work as much as would like on tatoeba, but I do have job obligations too :(
A gigantic waste of time
If it is two minutes per translation, then it's 26 000 minutes for 13 000 translations, or 433 hours of work of the Tatoeba contributors thrown away, more than two month's work.
How about thinking about a rapid solution of the problem?
> A gigantic waste of time
I would like to suggest another point of view about this :)
Imagine we would do this instead: whenever a user adds a translation or new sentence, we check whether it already exists or not, and if it exists, instead of adding it, we just add the necessary links. Would you feel this is as much of a waste of time?
I'm going to guess not so much (at least I would feel that way). But well, in both case, people still have to spend time typing their translations... The duplicate removal script leads to the same kind of result as what I described, except it does it with some delay.
You also have to take into account that many (...most? ...all?) members in Tatoeba translate for practice and/or for learning. I can't speak in the name of everyone, but I can say at least that as far as I'm concerned, I wouldn't feel I've wasted my time if 10% or more of my contributions happened to be duplicates and were "merged". Because when I contribute... or rather contributed, I also got to practice and to learn a lot, and it was much less boring than doing typical translation homework for my language classes.
Of course, if others don't feel that way, I can perfectly understand, and they are free to stop contributing (they even should), until we release the new version. I know it's been a while since we've talked about it, but it's a pretty big task and it's not surprising that it's taking that long.
Now, regarding your question of a quick solution. Well, the duplicate removal script was the quick solution.
Another "meantime" but less quick solution is to get more people link sentences... but I think people have much more fun translating than linking.
Someone could also try making lists of "sentences that could potentially be linked but are not linked". All the information needed here to make such a list is here:
http://tatoeba.org/eng/download...mple_sentences
And from there we could probably do something to display the lists in some way and get people to create links more quickly.
And we still need a feature that allows a multi-links translation, so that those speaking more than 2 languages can add a translations that links to more than one language.
Finally, regarding "How could we help you with your personal life?", we simply (or not so simply) need to get more people involved in the project as a whole. As Swift mentioned it, I started writing about the matter this weekend: http://blog.tatoeba.org/2011/04...s-to-help.html
And I will be writing more...
Anyway, I hope that gives you a better picture :)
Out of curiosity, how much time would it take to do something as brute as run a similar_text() comparison between a new translation and all the existing sentences already in the database? Really, really long, I'm guessing?
I did it for you, the raw request (so even without the treating of it)
select 1 from sentences where text = "I love you."
1 row in set (1,24 sec)
so with 2000 contribution, you've lost 1 hours a day only for pure checking, moreover it does not solve the problem that you still need to type the sentence first before knowing the sentence already exist.
Could you add a column to the table with a MD5 hash? It would take a long time to generate for all existing sentences but it would greatly speed up searching for duplicates.
yup we can, actually md4 will be enough as we're not looking for secure hash, it's just i decided to use the few time I have to develop the new version, it's just i didn't realize few moths ago that there as an inherent tricky bug in mysql when the previous script was run with high load and concurency.
Now with this script that will be run at least once a week and that does not require modification in the current php code, i think the result will more or less the same than doing the check in real time with a hash (as I've said to FeuDrenais this kind of solution still require the user to type and validate the sentence)
but if Trang or someone wants to code it, sure.
That's good speed though. If you partitioned by language and string length it probably would go fast enough not to bother a regular user too much.
But I agree that it doesn't really make sense if the duplicate script is going to do the job anyway.
the problem is that does not affect you but everyone. and on a server used more than 100% of it's capacity it means one hour less of doing other stuff.
because it does not mean "me and only me is going to wait 1.24 second nore" but "during 1.24s the server will do nothing except looking for my sentence"
> A gigantic waste of time
Are you joking, or are you seriously critizing sysko? Because I don't think this is a fair argument to criticize by.
You could also say things like "if everyone gave a dollar a day to such and such charity, starvation in such and such country would not be a problem". And it'd be true (maybe), and you could make similar arguments for lots of situations. But sysko isn't a machine, and I don't think he thought of optimizing Tatoeba to within 1% global accuracy when he coded it.
In short: Thanks for all the work, sysko. I don't think 13,000 duplicates is as catastrophic as some people are making them out to be. We waste many more minutes each day on far less productive things. "Work thrown away"? Not at all.
Je pense également que si nous disposons, comme c'est de nouveau le cas, d'une bonne procédure de dédoublonnage, alors la création de doublons n'est pas si grave et elle crée des liens nouveaux. Je crée moi-même des doublons à cet effet.
Si des contributeurs se désolent que leurs phrases disparaissent, il est de leur responsabilité de contrôler que leurs phrases n'existent pas déjà dans la base avant de les insérer. Le guide du contributeur l'indique clairement.
Si par la suite, une nouvelle version permet de détecter un doublon dès sa création, ce sera un plus, mais ça n'empêchera pas le créateur d'un doublon de l'avoir tapé pour rien de toutes manières, puisqu'il faut qu'il existe pour être détecté...Donc le temps de création serait de toutes manières perdu.
La procédure de regarder, si il y a déjà une phrase traduite dans une certaine langue, est plutôt difficile. Je l'ai fait de temps en temps - mais cela dure et c'est compliqué. Est-ce que sacredceltic l'a déjà fait quelques fois pour certaines phrases?
Pour moi la solution est simple : Je ne traduis pas pour le moment. Et je n'envoie plus de messages sur Tatoeba dans les listes en espéranto.
Il s'agit d'ailleurs de beaucoup de phrases en espéranto. Je me souviens d'une liste de plus de 3000 doublons en espéranto. Plus probablement une grande partie des 13 000 phrases jetées il y a quelques jours. La proportion de FeuDRenais devrait considérer les langues. Quelle est le nombre de phrases en espéranto jetées jusqu'à présent?
> La proportion de FeuDRenais devrait considérer les langues.
"Toutes les langues sont égales sur Tatoeba."
>Est-ce que sacredceltic l'a déjà fait quelques fois pour certaines phrases?
Je ne suis pas accrédité pour lier des traductions, mais lorsque j'avais cette possibilité, je le mettais à profit très largement entre l'anglais, le français, l'allemand, l'espagnol et...l'espéranto!
Maintenant, je crée souvent des doublons car c'est mon seul moyen de lier des phrases pour éviter qu'elles ne se retrouvent dans mes listes de phrases non traduites, ce qui m'énerve par-dessus tout...
Je préfère avoir des doublons qui seront fusionnés par la suite que de retrouver éternellement les mêmes phrases stupides dans mes listes à traduire...
D'ailleurs, pour faire un doublon, il suffit de faire copier-coller...ça ne me prend pas beaucoup plus de temps, c'est juste un peu plus bizarre...
+1
Also, (840000-13000)/840000 = 98.5% efficiency. I don't know what everyone's standards are, but that's pretty damn good, IMO.
(or is the total number of duplicates over all time >> 13000?)
>(or is the total number of duplicates over all time >> 13000?)
No, it's just the number since the last time a deduplication procedure was run, a few months ago...
I wonder what the worst case number is.
Let's just go nuts, and say that it's, I dunno, 80,000. So, roughly 10% of your contributions are duplicates.
I would say that *even then*, if you told me that 1 out of every 10 of my translations was a link instead of a brand new thing, I wouldn't throw a fit and start complaining about mass inefficiency. It's really not that big of a deal...
So the question would be: How could we help you with your personal life? :-) Or with your job? Or, more realistic: It's now at least six months that I began to write messages on the wall and to you and to Trang about the problem. If you don't have time to solve that - is there no other solution? Someone else who could do it?
To constantly throw away user's contributions is just not very kind to them. Do you and Trang really want to continue that periodically? Do you want me to put a message on the wall about the subject every week or so? What solution do you have besides waiting for Godot?
Well, there are two issues here: The scope of the problem and the solutions to it.
Regarding the scope, I don't think these 13 000 translations took an average of a minute to translate. Generally the duplicates are the sort of simple sentences that are likely to be added twice to the corpus. Something like "I'm waiting for Godot" rather than the longer sentences.
From the logs, I reckon it's closer to a few seconds. Still a considerable amount of time and I don't think anyone is terribly happy with the situation.
At the same time, as Muiriel and I have pointed out, should the same sentence be entered twice as A and A' and then subsequently translated to different languages B and C, then merging A and A' will leave a single sentence A, with links to both B and C where nothing is lost.
It's also possible that B and C are entered in different languages and each of these is translated into the same language with sentences A and A'. When A and A' are merged, the end result is again a sentence A with links to B and C, indirectly linking the two latter. The inefficiency only arises if it takes longer to translate a sentence than to find, verify and link two sentences that don't share a translation. I think that's rather uncommon.
The final case is when sentence A is translated into B, which is translated into C and then back into the original language with sentence A'. Again here, there is value in that last translation as it is effectively provides a link between C and A. With the current Tatoeba system, I think making a quick duplicate is actually the simplest way to link such sentences (even with Zifre's awesome Greasemonkey script).
So, the problem isn't so great, in my mind, but it would still be nice to do something about it. The duplicate script currently takes a long time to run and slowed the server down considerably yesterday.
As Trang mentioned in her post from yesterday: http://blog.tatoeba.org/2011/04...s-to-help.html , there's still a while until collaborative work on the new version can begin. At the same time, anyone is free to take part in maintaining the current version or developing new features for it. One idea would be to create a "safe" contribution feature that would first check for related translations. This would perfectly fit translators who spend more time on translations.
But as Trang noted, there are loads of things related to Tatoeba that people can do, and the more people actively join the various "departments" the more pressure there will be on sysko to use his days off to code in some dingy little office than prance around the streets or hills in the fresh spring air. ;-)
Ough... I think I was going to conclude this by pulling these threads together into some nice conclusion, but this has been way to long and I need to get to my work so that I can enjoy some of the Icelandic spring (did I mention we had snow yesterday...?).
Great. Thanks for taking care of this!
Currently Tatoeba is awfully slow. Is this due to the duplication removal operation, or is there another issue?
yup the script is currently updating the logs, sorry for the inconvenience, next time it should be a lot faster, as here the sentences hasn't been deduplicated for months
Awesome!
How does this handle comments, links, etc.? Will links made by the system show up in the logs as if they were made by whoever made the link on the duplicate sentence that is being removed?
at the end (when i will post the message saying it's finished:P) to make it simple everything will be moved to the sentence that is kept
so if you post a comment on a duplicate it will be moved to, same for link, and the history at the right it will be shown as if all the link were made to only this sentence
bug: the kept sentence appears twice in the link list
not a bug the script is not finished yet
we will regenerate the link list/sentence list just after :)
awesome job so far. just wondering, is there a way one could know which duplicates sentences he/she owned ?
I just got a spam message from this user:
http://tatoeba.org/eng/user/profile/angel
Thanks for reporting it on the Wall. I also received it, and already told it to Trang :)
** A quick note on derivative works **
The Creative Commons licenses[1] currently popular with the kids these days all have an attribution clause which I understand we cannot honour. In the case of content from Wikipedia, one could link to the article and author list to satisfy the attribution clause[2] in the comments to the sentence in question, as well as each translation of it.
That would, however, only work for the corpus as it's accessible through the tatoeba.org interface; it wouldn't allow us to distribute these sentences (including the translations) in the downloadable CSV files in its current form.
Now, this isn't really such a big deal -- in particular if compared with the mountain of pain tracking all the different licenses across the database. It just means that we have to come up with our own original sentences. We do that every day, so it shouldn't be too difficult. If it is, one can always use one's time translate existing sentences.
Bottom line: Don't add anything with any sort of license. Just make your won sentences or translate others.
Now, most people reading this are probably well aware of the issue, but everyone should take a little bit of care when they come across a sentence which looks like it may be from a copyrighted source. It's a lot better if it gets caught soon, before anyone wastes time on translations that we're going to have to delete in the end.
For more on licensing issues, see Trang's blog post from January this year: http://blog.tatoeba.org/2011/01...d-content.html
[1] http://creativecommons.org/licenses/
[2] http://wikimediafoundation.org/wiki/Terms_of_Use
Swift, I've translated it into Spanish, because I think this is an important issue :)
Traducción en español del comentario de Swift:
La licencia Creative Commons [1] que es tan popular entre los niños de hoy en día tiene una cláusula de atribución que a mi entender no podemos cumplir. En el caso de contenido de la Wikipedia, se podría indicar la dirección del artículo y el autor para satisfacer dicha cláusula [2] dejando un comentario en la frase en cuestión, al igual que en cada traducción de dicha frase.
Sin embargo, solo funcionaría en el corpus siempre que se acceda a él por medio del interfaz de tatoeba.org; no permitiría distribuir estas oraciones (incluyendo sus traducciones) en los archivos CSV descargables en su forma actual.
En este momento no es un gran problema – espacialmente si lo comparamos con el problema que sería tener que rastrear todas las diferentes licencias por toda la base de datos. Sólo significa que tenemos que crear nuestras propias frases originales. Lo hacemos cada día, de modo que no debería ser muy difícil. Si lo es, siempre se puede emplear el tiempo traduciendo frases que ya existan.
Por ultimo: No añadáis nada que tenga alguna clase de licencia. Sólo cread vuestras propias frases o traducid otras que ya formen parte del corpus.
Ahora, la mayor parte de la gente que haya leído esto serán conscientes de este problema, pero todo el mundo debería tener un poco de cuidado cuando se crucen con una frase que parezca que pueda provenir de una fuente con derechos de autor (copyright). Es mejor que se encuentre cuanto antes, antes de que alguien pierda el tiempo haciendo traducciones que vamos a tener que acabar eliminando.
Para más información acerca de temas de licencia, véase la entrada de enero de este año en el blog de Trang:
http://blog.tatoeba.org/2011/01...d-content.html
Tatoeba could have a special way of inserting CC sentences with an attribution clause by including a field where one would enter the attribution.
I don't know if the Tatoeba admins are interested in doing this, but it could be possible.
I can not tag my sentence. No input field in the "Tag" block. Is it because I'm a "user", and no right to do so?
Yes, you have to be a so-called "trusted user". See this FAQ entry: http://tatoeba.org/eng/faq#add-tag
Thank you!!
Hola, ya hay oraciones en "asturianu" subidas por Duernu http://tatoeba.org/spa/sentences/of_user/Duernu
Si pudiese añadirse el "asturianu" a la lista de idiomas disponibles en Tatoeba... :)
http://upload.wikimedia.org/wik...turias.svg.png
Thanks in advance
http://tatoeba.org/eng/wall/sho...3#message_5873
http://tatoeba.org/eng/faq#new-language
Por favor, sigue los pasos ilustrados :)
Ya he hecho la bandera cuando Duerno la pidió.
De todos modos, puedes seguir añadiendo frases en asturiano si las pones en una lista (puedes crear una aquí http://tatoeba.org/eng/sentences_lists/index).
Estupendo. Gracias.
Hola! Más preguntas por aquí...
Veo que tenemos una oración principal y las traducciones abajo, por lo que al principio supuse que que la principal era la original. Sin embargo, si abro una de esas traducciones entonces la que antes era la principal se convierte en una traducción más.
¿Cómo puedo saber cuál es la original? Dicho de otra forma, si por ejemplo veo que dos oraciones no se corresponden exactamente, ¿cómo sé cuál es la que está mal traducida?
Espero haberme explicado...
¡Hola! Lo puedes saber eso en primer lugar al mirar el número de la frase que te estas viendo, y la que viste antes o que verás después. Adémas, el historico por la derecha tambien le puede ayudar a uno, así que sepa cuáles frases se escribieron al principio.
Es verdad, no me había fijado en que están numeradas. ¡Muchas gracias!
How about adding IPA as a language?
Perhaps, in a far-flung postapocalyptic future, we can offer IPA transcriptions of our audio recordings.
Seeing that transcribing into IPA is a rather tedious task, I don't think adding IPA would clutter up the system too much. It would be more of a task for the linguists amongst us anyway. As far as different language varieties go, I don't see them as a problem. To my mind, having transcriptions in multiple varieties is an asset rather than a burden. One the other hand, there's no need for completeness. One transcription is still better than no transcription at all.
I think this might be better implemented as a separate layer, similar to, and possibly as an extension to, the current audio examples, but with some information about the dialect.
I agree.
Hi. I'm new here! I've been browsing the web for a few hours and I'm really impressed.
I come from meneame.net too (as a matter of fact, I was the one who uploaded the "meneo" XD).
Congratulations. Felicidades a todos, estoy seguro de que llegaréis muy lejos :)
Is that a reddit-like site?
Yep, but in Spanish. http://www.meneame.net/
Actually It's like Digg
I came from meneame too.
Me too! (^_^)
> as a matter of fact, I was the one who uploaded
> the "meneo" XD
Oh, so it was you who almost made our server crash :P Just kidding, thank you!! :D
And welcome to all the new members who found us via meneame.net and who are reading this. I hope you guys will stick around :)
antes de nada felicitaros me parece un proyecto impresionante, mi admiración y aplauso. Soy programador y llevo pensando en este sistema hace años. Yo llegué a las siguientes conclusiones que espero que sirvan para algo:
1)Además de que las personas puedan introducir frases, necesita algún sistema automático para indexar la web y capturar frases y proponer a los usuarios que las traduzcan (las más frecuentes por ejemplo)
2)Necesita un entorno fácil para traducir textos largos que en definitiva luego serán un conjunto de frases nuevas introducidas en el sistema.
3)Algún tipo de "metafrases" o lenguaje formal o expresiones regulares simple de tal forma que se pueda especificar "soy amigo de <nombre propio>"
4)Categorías de palabras. Por ejemplo la categoría <color> que puede sustituirse por cualquiera de los colores
5)Integración de traductor de palabras que proponga sugerencias y un diccionario a los términos
6)Cierto metalenguaje: No es lo mismo un texto obtenido de una transcripción /subtitulo de un film que una frase de un libro. Sería muy poderoso que Tatoeba fuera el sistema empleado para traducir subtítulos.
7) Establecer el concepto "Tatoeba ready": Cualquier texto que tiene todas sus frases descritas en al menos un lenguaje y pasa inmediatamente un proceso de traducción automático. ¿Es mi twitt traducible al ruso o al japonés?¿Es una entrada de la wikipedia Tatoeba ready?
8) Sistema para ir escribiendo y saber si el sistema valida el texto en uno (o varios) idiomas determinados. Ej: Quiero que mi texto sea traducible al Árabe pero yo no sé Árabe, y acepto encajar mis frases para que sea traducible al Árabe. Se presupone que con el tiempo Tatoeba tendrá millones de frases
9)Caso de uso "Esquema de feedbak": Alguien crea un texto(periódico, blog, etc), una vez finalizado lo introduce en Tatoeba y el sistema le dice las frases que ya existen y las que faltan para una traducción completa (por ejemplo al inglés), entonces el usuario introducirá las frases que faltan haciendo que su texto sea traducible por Tatoeba. Posteriormente puede publicar su texto en su web dejando un enlace a la traducción que genera Tatoeba
10) El mismo caso anterior puede ser empleado para un traductor de un lenguaje minoritario: Un traductor quiere traducir un texto del New York Times al Gallego y solo ha de introducir las frases que faltan, finalmente cualquiera que acceda a Tatoeba a partir del texto original, disfrutará de la traducción en gallego.
11) Cierta análisis del origen de las frases. Las frases tienen su contexto. Sería interesante que en frases célebres aparezca el autor. Si una frase aparece en un libro de José Saramago sería interesante saberlo, como también si esa frase es empleada en suramerica pero no en España, o es del siglo XV o si pertenece a una canción. Etiquetas que identifiquen Lenguaje coloquial, formal, educado, etc
12) En base a todo esto analizar, finalmente Tatoeba pueda analizar un texto y poder mostrar cierto conocimiento: "Es lenguaje coloquial, las frases son muy frecuentes, los terminos son poco empleados, etc"
13) Tatoeba como contenedor de las tablas de traducciones usuales en los programas, de tal forma que si mi quiero traducir mi programa al arabe las pueda tomar de Tatoeba. Típicas frases como "Cerrar sesión" o "Ir a la ayuda" o "Reiniciar programa". Finalmente mi programa solo tendría que tomar una tabla de indices almacenada en Tatoeba para funcionar en otro lenguaje. Gnome como sistema ¿Tatoeba READY?
Gracias por compartir tu ideas, oscarpi :) No tengo mucho tiempo ahora, pero intentaré responderte esta semana.
Os escribí en Español, como decía que se podía escribir en cualquier lenguaje ... Gracias por prestarme atención :-)
Suerte en vuestro proyecto
(I've google translated it, shame on me), let's make it the first wall message "Tatoeba ready" :)
(really sorry that I don't speak a single word of Spanish)
Me parece excelente tu aporte. Ojalá se pudieran implementar prontamente al menos algunas de tus ideas.
Buenas! Tengo una duda respecto a la creación de oraciones, espero estar haciéndola como toca y donde toca:
He creado tres oraciones en total, las dos primeras son traducciones y la tercera propia. Pues bien, esta última no aparece en ningún lado excepto consultando mi perfil. ¿Solo pueden aportarse oraciones nuevas contribuyendo antes con varias traducciones? ¿O es que he hecho algo mal?
Tu tercera frase ("Si fuera invisible no tendría que vestirme.") aparecerá el la búsqueda, por ejemplo, de "vestirme", "invisible" etc. después de un poquito de tiempo. El database se actualiza una vez cada semana, si me recuerdo bien por el sábado, su actualización no es todavía automática :)
Ahh! Entendido. Muchas gracias!