Perete (7.389 subiecte)
Sugestii
Înainte de a pune o întrebare, asigurați-vă că ați citit Întrebările frecvente.
Ne propunem să menținem o atmosferă pozitivă pentru discuții civilizate. Vă rugăm să citiți regulile noastre împotriva comportamentului necorespunzător.
gillux
ieri
LeviHighway
ieri
LeviHighway
ieri
gillux
ieri
gillux
ieri
Thanuir
ieri
LeviHighway
ieri
Seael
acum 2 zile
rul
acum 3 zile
gillux
acum 3 zile
I'm having issues with search – my usual searches aren't returning anything before over a week ago, even when sorting by last created. Or very little.
For example, this is search is of my sentences that are in the "Translated by Tatoebans" list. It was updated Saturday, and I had sentences added to it - I verified this with the list interface – but only sentences from over a week ago show up in search.
https://tatoeba.org/en/sentence...rd_count_min=1
Hi rul, thanks for reporting this issue, I just had a look into it. I confirm there is a problem, I see a difference between what the search returns and what’s actually in the corpus.
It was a actually a rare and temporary problem. We may fix the root cause in the future, but for now I have simply reindexed the affected sentences on the "Translated by Tatoebans" list. Your search should now return the latest sentences.
Thanks, but you seem to have made the problem worse. Now even the list view itself only shows sentences that are 9+ days old, and the actual list was supposed to have been updated 4 days ago. The sentences I had that were supposed to show up in search don't even show up on the view for the list at all anymore.
The list view is correct. I think it was not updated correctly this time.
I have compared the web server logs when the list was updated on May 16th and May 9th. The exact same set of sentences were added and removed on that two days. In other words, the update on May 16th ran fine, but it had no effect because the sentences it added were already on the list and the sentences it removed were already not on the list.
And the root cause is probably that the weekly exports did not run last week. o_O
EDIT: As it turns out, I didn’t pay attention as I was patching the kernel against the latest CVE, and I rebooted the server right when the weekly export started… Let’s just try again next Saturday and sorry for the inconvenience.
Alright, thanks. I seem to remember having seen more recent sentences in the list view, but I have no way of proving it, and I probably just misunderstood.
One of the items I keep on https://tatoeba.org/es/vocabulary/of/Seael is "hantavirus" but it says there are 0 sentences with it in Spanish despite I created #13902897 ,containing it, two days ago.
Same with other words like "cataplasma"... OK, so let's wait until Saturday, then... Thanks!
@Seael Having to wait for Saturday is only required to get the list "Translated by Tatoebans" properly updated.
There is actually a bigger problem that’s currently preventing many sentences from showing up in search results. Even the sentences that I manually reindexed yesterday for @rul are not showing up any more. I am investigating the issue. https://github.com/Tatoeba/tatoeba2/issues/3291
https://tatoeba.org/zh-tw/sente...rd_count_min=1
The current written system used in the Minnan corpus is mixed. Pure Pe̍h-ōe-jī (Latin) sentences and pure Chinese-character sentences each make up about half of the corpus. I’ve been wondering whether it would make sense to split Minnan into two separate entries: one using only Pe̍h-ōe-jī, and the other using only Chinese characters. Another possibility would be to set either Pe̍h-ōe-jī or Chinese characters as an alternative script.
However, the problem is that there are currently no tools capable of converting Chinese-character Minnan into Pe̍h-ōe-jī, or vice versa. In addition, Minnan pronunciation is complicated, and most characters have multiple readings, which makes automatic conversion very difficult.
Another issue is that, in actual usage among the public, mixed writing combining Chinese characters and Pe̍h-ōe-jī is very common. I’m not sure how this situation should be handled, or which writing system for Minnan Tatoeba should ultimately include in its corpus.
Kieltä tuntematta: tuleeko vastaan jotain ongelmia, jos molempia merkkilajeja voi käyttää yksinään tai sekaisin, jos kieli kerran näin tekee?
The main drawback of mixing both writing systems is that it is difficult to search for words. You need to make two searches, one for each script, in order to find all potential sentences. And you need to know how to write the word in both scripts.
Good question. It could be interesting to know what led to the decision of using POJ only on Minnan Wikipedia https://zh-min-nan.wikipedia.org/wiki/Th%C3%A2u-ia%CC%8Dh
I guess we can have a transcription system where users manually enter alternative scripts (without autogeneration), as long as there is a some way to validate consistency between the sentence and the alternative script, and of the alternative script alone.
As for mixing both scripts within a single sentence, I am not sure if we want to explicitly allow it. On the one hand, if it reflects real world usage of the language, it should be on Tatoeba. On the other hand, it makes it difficult to classify and search. Can you clarify if a typical sentence would include a majority of Chinese characters and a just few POJ, or if it's 50/50, or the opposite?
For your information, Tatoeba already has quite a few languages that use two scripts without automatic conversion. Just to name a few:
- Arabic/Latin:
https://tatoeba.org/sentences/s...ta&sort=random
https://tatoeba.org/sentences/s...hg&sort=random
- Cyrillic/Latin
https://tatoeba.org/sentences/s...fn&sort=random
https://tatoeba.org/sentences/s...rp&sort=random
Some contributors add both scripts and link both sentences to one another. Not ideal, and not officially recommended, but it helps with searching.
The Minnan Wikipedia was founded in 2004, at a time when there was no unified standard for writing Minnan in Chinese characters — the situation was fairly fragmented, with no academic consensus. Pe̍h-ōe-jī, developed by Western missionaries, offered a consistent and well-documented alternative, which is likely the main reason the Hokkien Wikipedia adopted it as its sole writing system.
The drawback, however, is significant: the vast majority of Minnan speakers have never learned POJ, and writing in Chinese characters has been the traditional practice for centuries.
In 2009, Taiwan's Ministry of Education published an official recommended character set for Taiwanese Minnan, which remains in use by government institutions today. That said, it has attracted considerable debate in academic circles, and many are reluctant to adopt it. This is partly because Minnan speakers are not exclusively Taiwanese — speakers in other regions have no particular reason to accept a Taiwan-specific standard — and even within Taiwan itself, the standard remains contested.
There have been proposals to add Chinese-character articles to the Minnan Wikipedia, but nothing has come of them, largely for the reasons above.
As for how mixed writing works in practice: most core Minnan vocabulary is of Sinitic origin, and for these words the character spellings are generally uncontroversial. The disputed cases — function words, sentence-final particles, and loanwords — are where some writers switch to POJ.
The proportion varies considerably and is hard to generalize: a given sentence might consist almost entirely of Sinitic vocabulary, or it might be dominated by particles and loanwords, in which case the romanized portion would be much larger.
https://tatoeba.org/zh-tw/sente...f_user/tsunhua
BTW, 472 of 482 Minnan sentences are by tsunhua. And it seems that they add one sentence in Chinese characters and then add the same sentence in POJ for every sentence they add.
Copied from the Tatoeba group on Telegram (because I think that no admin is still active on Telegram):
"When if ever Tatoeba will be able to capture audio directly on site?"
@gillux is really the only person who can answer that question directly. However, I just want to insert my opinion that even if someday we are able to capture audio on the site, it shouldn't be made audible until it has been reviewed. Audio differs from text in two ways:
(1) A native speaker can spot errors in text virtually immediately, but listening to audio takes time.
(2) Audio can suffer from background noise and lack of clarity, which is not a factor with text.
I can easily imagine a large number of submissions of poor quality damaging the usefulness of the site if they "went live" immediately.
> "When if ever Tatoeba will be able to capture audio directly on site?"
This year hopefully!
I think I was waiting to get something working before making any announcement here, but since you brought up the topic, let me explain what has been going on behind the scenes.
I have been in contact with Hugo from Lingua Libre for some time on a
different channel. Hugo was part of the Shtooka project back in the days, along with its creator Nicolas. Later, they worked on a cloud version, which was rebranded as Lingua Libre. So essentially Lingua Libre and Shtooka are just the same piece of software at two different points in time.
A quick introduction of Lingua Libre. Lingua Libre [1] is the name of a recording tool, but also a Wikimedia France project used to gather recordings. It uses Wikimedia Commons for audio storage, and members of the Wikimedia community help connecting with native speakers to have their voice recorded. The recordings are used by other Wikimedia projects such as Wiktionary or Wikipedia, so they mainly focus on recording words.
I connected with Hugo in early 2025. Hugo was actually astonished to learn that Tatoeba audio contributors still rely on the good old Shtooka. We quickly figured out there was room for collaboration. Lingua Libre has a strong recorder and is starting to support sentences in addition to words, but it needs open text content, so it could benefit from Tatoeba’s linguistically diverse corpus. Tatoeba lacks an easy-to-use recorder and audio support is rather basic, so it could benefit from Lingua Libre’s tooling and Wikimedia Foundation’s infrastructure and "aura".
In 2025, Hugo has been working on a new version of the recorder that makes it easier for other projects to reuse. This new version has been in "beta test" for some time now and I think it will become their new official recorder soon.
At some point in late 2025, Hugo and I tried to apply to a Microsoft grant [2] to kickstart collaboration between Tatoeba and Lingua Libre, but our application got rejected. This means it will take more time to get things done but we will get there eventually.
After discussing with Trang and CK, I drafted an initial technical plan to allow using Lingua Libre to record Tatoeba sentences, and you can follow the progress on GitHub [3]. Basically, it is much harder to make two pieces of software collaborate than to develop everything in-house, but I believe it will pay in the long run. Generally speaking, Tatoeba and Lingua Libre share common goals of creating open and diverse linguistic ressources and preserving endangered languages, so I believe our project and communities should be more connected and aware of what the other party is doing. I wish the recorder could be the first step in that direction.
[1] https://lingualibre.org/
[2] https://www.microsoft.com/en-us...-voices-in-ai/
[3] https://github.com/Tatoeba/tatoeba2/issues/3183
Wow! Great to hear that!
Conținutul acestui mesaj contravine regulilor noastre și, prin urmare, a fost ascuns. Este afișat numai pentru admini și pentru autorul mesajului.
Conținutul acestui mesaj contravine regulilor noastre și, prin urmare, a fost ascuns. Este afișat numai pentru admini și pentru autorul mesajului.
https://tatoeba.org/en/sentences/show/13898325
My sentence should be deleted. Entered in wrong place. Sorry.
Can't see how to delete it myself.
If you want your own sentence be deleted, you can edit and replace the text with DELETE.
I unlinked it from the Turkish. Now just copy and paste it to the right place.
I found that transcriptions can be edited on the All My Sentences page. However, there is no option to edit transcriptions on the individual sentence page.
Could you please consider adding a transcription editing option to the sentence page?
目前僅有進階參與者可以編輯注音 (transcription)。
https://zh-tw.wiki.tatoeba.org/...d-contributors
我不太確定您說的「All My Sentences」上可以編輯注音是怎麼回事,一般來說普通參與者是不可以編輯注音的。
不好意思,可能我的英文表達有誤。我已經知道只有高級編輯者才能編輯轉寫這項規定。但我發現我可以在個人主頁查看自己的個人句子,而且能夠在此修改注音、繁簡字轉換與振假名。我也看過其他人的頁面,無法修改他人句子的轉寫。
如果一般編輯者無法編輯轉寫,那這算不算系統漏洞呢?
比如這條句子的振假名是我編輯的。
https://tatoeba.org/zh-cn/sentences/show/13880609
似乎確實是這樣。
如果您願意進一步說明情況,我可以將問題上報至 GitHub,您也可以自行上報:https://github.com/Tatoeba/tatoeba2/issues
不過我剛查看過,這條規範僅說明高階編輯者可編輯其他成員尚未審核的轉寫內容,卻沒有明確一般編輯者對於自身句子轉寫的編輯權限。個人認為Tatoeba預設是開放使用者編輯自己语句轉寫的。
因此我認為這項功能本身是正常合理的,只是入口過於隱蔽,必須進入個人主頁的句子頁面才能編輯,我也是偶然才發現。
所以我維持原本的建議,希望平台能為這項功能設置更明顯、更好找的操作入口,或是補上更清楚的使用說明(或許現有說明我尚未詳細閱讀),至少不用使用者自行摸索才能找到轉寫編輯功能。
可以說明一下一般參與者怎樣可以編輯自己句子的轉寫嗎?我剛測試了一下,沒發現可以編輯轉寫的方法。
進入個人主頁→在主頁右側傳送訊息給xxx的選項上方有使用者名稱→點擊即可展開→選擇句子選項→若存在轉寫句子(如中文、日文),轉寫處就會出現鉛筆圖示→點擊即可編輯修改轉寫內容
他人頁面同樣可以開啟查看,確實如平台規定,我沒有權限編輯修改其他人句子的轉寫內容。他人頁面的鉛筆圖示顏色會較淺,以此提示無法編輯。
我確實無法復現這個操作。不論是建議還是 Bug,都最好上報至 GitHub,那邊有人會處理系統問題。您可以自行上報,我也可以代您上報:
https://github.com/Tatoeba/tatoeba2/issues
New feature? word type specification
As far as I can see, there's no way to do this with the current adv search settings. Pls let me know if there is.
I'm working from Turkish to English and I want to search for "resmi" meaning official.
However, "resmi" can also mean painting or picture so those are being picked up too.
I just want the adjective, hence the feature idea.
Thx 🌸
PS I just realised doing the search in reverse finds only "official" but interestingly it also produces ENG sentences with no TUR equivalents.
This is not what you asked for, but you can search English sentences containing "official" with Turkish translations this way:
https://tatoeba.org/zh-tw/sente...rd_count_min=1
By setting "limit to" and specifying "Language:" in the Translation field, you can limit the search results to only show sentences with translation(s) in that language.
New feature? I cannot find this anywhere.
It would be handy to be able to save the search criterion.
See the "create search template" button in the advanced search.
Can CC BY-NC-ND 4.0 sentences be added to Tatoeba? If so, is there any admin here who would be willing to bulk add some sentences with this license to the database for me, if I provide a high-quality data file?
For context, I am looking at the following Palauan-English dictionary:
https://scholarspace.manoa.hawa...d3e754/content
I have been in contact with one of the maintainers of the website https://tekinged.com and (with his permission) have been transferring many of the site's volunteer-written Palauan sentences to Tatoeba. The site also has many sentences taken from this dictionary, and out of caution I have held off from adding those sentences to Tatoeba for now, but it would be cool if they could be added as well.
No, CC BY-NC-ND 4.0 content cannot be added to the Tatoeba Corpus.
Our content is re-distributed with a less restrictive license.
Note that even CC-BY content cannot be used because users of our downloadable data cannot properly give the required "BY" credit.