Wall (7,120 threads)
Tips
Before asking a question, make sure to read the FAQ.
We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.
sharptoothed
2 days ago
sharptoothed
2 days ago
TATAR1
2 days ago
AlanF_US
3 days ago
sharptoothed
4 days ago
Shanaz
7 days ago
Qaztat
7 days ago
TATAR1
7 days ago
Tartar
7 days ago
menaud
10 days ago

✹✹ Stats & Graphs ✹✹
Tatoeba Stats, Graphs & Charts have been updated:
https://tatoeba.j-langtools.com/allstats/

Tatoeba website has been running very slowly since yesterday's shutdown. Do any of you have similar experience?

It is slow for me too.

Yes, it's been slow for me too.

Not just slow. Mostly I get the message „Tatoeba is currently unavailable. We are sorry for the inconvenience. You can check our blog or Twitter for more information.”
But nor the blog nor Twitter gives any information. Yesterday I hardly did 10 % of the work I used to do in one day.

There was a little problem indeed! Tatoeba should run smoothly now.

サクッ..サク🤭動いています。ありがとうございます。

I think adopting-unadopting of sentences doesn't really work.
One must beg in order his(/her) unadopted sentence(s) to be adopted.
(I may suppose that's why the sentences aren't checked.)
And I think simply unadopted sentences remain rather unchecked.
(As for me, I don't like adopt sentences added by other ones.)

I don't understand, Maaster. I adopt and correct sentences in „my languages” regularly. If an autor unadopts sentences, there is no need to be begged to change them. You can use a simple link to find them all:
https://tatoeba.org/eo/activiti..._sentences/epo
(change "epo" to the code of your language).

Some sentences have this info "This sentence is original and was not derived from translation."
Is this information anywhere in the downloadable data?
thank you!

It's in the sentences_base file.

You're right. It's right there. Sorry, I just didn't see it.

Is the format of transcriptions (japanese if that makes any difference) explained anywhere? (nothing in the Wiki, afaik)
I found three different cases (there may be more):
A: [Kanji|Reading] which makes sense
B: [Kanji1Kanji2|Reading1|Reading2] which is probably short for [Kanji1|Reading1][Kanji2|Reading2]
C: [Kanji1Kanji2|Reading] which probably means the two Kanji combined have this reading
is this correct?
And can I expect to always find something that either fits A, B or C?
That is, can I expect to *never* find something like [Kanji1Kanji2Kanji3|Reading1|reading2], i.e. a number of Kanji and readings which are not equal (in that case, how would I know whether Reading1 belongs to Kanji1Kanji2 or just Kanji1?
I hope my ad-hoc syntax makes sense.

I assume you're asking this question because you want to transform the data programmatically (otherwise you could just handle edge cases whenever you encounter them). If my assumption is correct, it might be easiest to look at Tatoeba's own code for Japanese transcriptions. (Note that Tatoeba is AGPL-licensed, in case that's an issue for you.)
The validation code for user-provided furigana is here: https://github.com/Tatoeba/tato...ption.php#L220 but I think it might not apply to those that are generated automatically using MeCab.
The testcases might also be helpful: https://github.com/Tatoeba/tato...onTest.php#L27
If you just want to display furigana using HTML <ruby> tags, our code for that is here: https://github.com/Tatoeba/tato...naTrait.php#L9 To be honest, it's not written in an easily readable manner, but I think what it does is basically to assume without validation that there are at least as many kanji as there are readings, and if there is a kanji without reading (|| or end of list) it will merge it with the preceding kanji until the numbers are equal.
So [Kanji1Kanji2Kanji3|Reading1|reading2] would be equivalent to [Kanji1|Reading1][Kanji2Kanji3|reading2], I think.

Yes, ruby is a good example. This looks good, thanks!
I will take a look at the code, especially the one where it handles unequal numbers of Kanji and readings.

When you search on Tatoeba.org, it only shows 1000 results. That is, it shows a maximum of 10 pages. It says the total number of results, but it only shows 1000. How can I fix this?

this is a technical limitation to not overload the server

I also wonder if this limit of 1000 sentences is too low.
I use this feature to find recently added sentences (in German) and sometimes the last 1000 sentences don't even cover one day.
The limit doesn't apply to sentences of specific users. Some of them own a huge amount of sentences (> 700000).
Currently, displaying or even re-sorting these is reasonably fast.

✹✹ Stats & Graphs ✹✹
Tatoeba Stats, Graphs & Charts have been updated:
https://tatoeba.j-langtools.com/allstats/

In searching for some solution to my problem with displaying text on Tatoeba in italics, I tried downloading another browser. From what Google offered me, I chose Brave. To my big surprise, it displays normally on it; the italics are gone.
It seems that something went wrong with setting on my basic browsers (Edge, Google Chrome, even Firefox), resulting in the issue.

There is a shield in the Brave browser. If it is on, the text shows normally. However, if I disable it, the italics appears even there. Very strange.

https://support.brave.com/hc/en...while-browsing indicates that the shield combines various blocking features that you can also toggle individually using the advanced controls. My guess is that you have a non-standard system font that shows up as italics and the font fingerprinting protection in Brave, when enabled, is preventing the browser from loading it.
In Firefox, by right-clicking the italic text and selecting "Inspect", you should be able to open a panel with three columns, the rightmost of which shows "Layout" initially, but one of the other options is "Fonts", which should show you which font is being used.

Thank you for your insight, Yorwba. I checked that and found out that Noto sans italic font is used. If I disable it, the site displays normally. However, it works only temporarily, until next launching. I just need to make out how to change it permanently.
I'm glad to understand the problem and for now, I'm ok with running Tatoeba on Brave.

Hi,
I'm reaching out to inquire about importing thousands of bilingual English-Santali sentences into the Tatoeba database. I have a large collection of sentences in two languages that I'd like to contribute to the platform. Could you please provide guidance on the recommended format for preparing the sentence files, the process for uploading them to the database, and any specific requirements or guidelines for ensuring data quality and consistency? I'd greatly appreciate any assistance or documentation to help me import my sentence collection efficiently.
Thanks
Prasanta Hembram

Hello, this sounds awesome, but Tatoeba does not support mass import of sentences just yet. This is because we lack ressources to implement a proper import system. If you know how to program, you are welcome to contribute such system. If you know anybody who is willing to implement an import system, you can ask them. If you want to get notified about any progress on that matter, you can mention your interest on this Github issue thread https://github.com/Tatoeba/tatoeba2/issues/1762
As for importing sentences in general, you should care about the license of the data you want to contribute. It should be legal to re-use the data, as Tatoeba will publish it under Creative Commons CC-BY.
As for the data quality, the sentences should follow these rules https://en.wiki.tatoeba.org/art...h-explanations There is no particular expectations in terms of consistency, because Tatoeba already receives contributions from various people, without are not really following any consistency guidelines.
As for the data format, since we don’t have the tool to import just yet, there is not requirement yet, but I think CSV or TSV should be okay.

Hi, @gillux. Thank you for the information. I have some basic programming knowledge, but I'm not confident in my ability to contribute to the development of an import system. Will refer someone. I think for now, only admins can do mass import and is used rarely ?? and only way to contribute right now is to add/translate sentences one by one.

> I have some basic programming knowledge, but I'm not confident in my ability to contribute to the development of an import system.
I think that creating an import system is a complex task, too. Not only on the technical level, but also on the social level, as one can see from the discussions on the GitHub issue page. I think that such an import system needs to designed collaboratively, so you are more than welcome to share your ideas.
> I think for now, only admins can do mass import and is used rarely ??
Admins used to be able to do some kind of basic mass import, but, for technical reasons, not anymore.
> and only way to contribute right now is to add/translate sentences one by one.
That is correct.

✹✹ Stats & Graphs ✹✹
Tatoeba Stats, Graphs & Charts have been updated:
https://tatoeba.j-langtools.com/allstats/