Muro (5 773 fadenoj)
Antaŭ ol starigi demandon, bonvolu legi la oftajn demandojn.
Ni strebas konservi sanan etoson en civilizitaj diskutoj. Bv. legi niajn regulojn kontraŭ malbona konduto.
antaŭ 9 horoj
antaŭ 10 horoj
I have a suggestion. Since natives are tagged, so the sentences in a language added by non-natives can be automatically added to a list, which can be reviewed by willing natives. What do you think of my idea?
What's New on Tatoeba? - Your weekly recap °11
※ The Activity Timeline page has been reworked. Besides sentences added, it now also displays the number of sentences deleted, links added and links removed. Have a look :) https://tatoeba.org/fra/contrib...meline/2020/04 Thanks to all who were involved in the discussion and Aiji for implementing this.
※ Transcriptions have been added to the new design. Thanks to TRANG for her work.
※ The "Contact user" button has been moved out the side menu of profile pages. Those pages will be reworked, but until then, that should help us to contact each other more easily. Thanks to Ricardo14 for the idea and gillux for the modification.
※ AndiPersti continued his work to properly sanitize characters in Tatoeba's contributions, solving some issues due to special characters (or more precisely, prevent these issues from happening). Thanks to him for his work.
※ A bug where a reply to a message on the wall wasn't nested to the correct level until we refreshed the page has been corrected. Thanks to AlanF_US for reporting and Aiji for solving the issue.
※ And as usual, thanks to all our developers who added some correction, optimization and all the little things that we don't see but have their importance :)
ON THE WALL
※ Thanuir asked for example sentences using the words "good-man" and "good-wife" and discussed how compound words are handled on Tatoeba https://tatoeba.org/fra/wall/show_message/34707
※ Ibdx discussed about adding more sentences containing words whose meanings are often looked for and built a list of words that are often searched https://tatoeba.org/fra/wall/show_message/34645
CONTRIBUTIONS AND LANGUAGES
※ 16 048 sentences have been added this week.
※ On Julien_PDC's request, Old French has been added to Tatoeba, bringing the number of languages to 356! https://en.wikipedia.org/wiki/Old_French
As usual, thanks to Ricardo14 and gillux for coordinating this.
※ And as usual, thanks to all the members who helped translating the website.
If you'd like to help to the development of Tatoeba, report issues, or are just curious, have a look at the GitHub repository: https://github.com/Tatoeba/tatoeba2
If you want to help us translate the website to your language, you can join us on Transifex: https://www.transifex.com/tatoe...ite/dashboard/ and check this article on the wiki https://en.wiki.tatoeba.org/art...ce-translation
Fun fact: JoMo, Esperantist and polyglot, accomplished the first Guiness Record for a multilingual concert, where he sang 22 songs in 22 languages.
Last week recap: https://tatoeba.org/fra/wall/show_message/34686
See this recap on the blog: https://blog.tatoeba.org/2020/0...kly-recap.html
Take a look at our new detailed timeline.
This is the full month of March.
The following isn't as colorful, but I put the same data in sortable table that you can play around with.
Also, just for fun, I merged all the timeline pages together.
You can see from March 31, 2020 back to September 9, 2007 all on this page.
To be relevant, a corpus of example sentences should cover the words that are of most interest in each language.
However, even in English, Tatoeba is still far from achieving this. For example, the word 'compliance' is one of the top 50 most searched for words on the Linguee online dictionary but only has 6 sentences on Tatoeba. These sentences are only translated into seven languages and none of them into languages as important as Spanish, Portuguese or French. At the same time, 'dog' is only the 9357th most searched word on Linguee but has more than 5000 sentences on Tatoeba.
So it seems that we contributors, spend too much time creating and translating sentences that will never be read by non-contributing Tatoeba users. On the other hand, we probably don't pay enough attention to the words that are the most requested such as 'relevant', 'scope', enhance' or 'furthermore'.
Perhaps a system that better balances the supply and demand of sentences could be set up. Starting from the most searched words on Linguee or Tatoeba, we could suggest the most underused words to contributors who would like to maximize their impact on the corpus enhancement. For example, for a given language, we could highlight the 100 most requested words that do not have at least 20 sentences translated in the ten most important languages of the corpus.
Has a similar proposal ever been debated? Do you think it would be worth implementing?
Data source : https://www.linguee.com/english...ish/1-200.html
I think that it would be quite worthwhile to find a way to cover frequently requested but underused words. However, I believe that it would be best to do this through coordinated communication rather than by trying to change the site's software. It takes a long time to agree upon and implement software changes. Furthermore, if this initiative is successful, it will remove some of its own basis for existence, another reason we wouldn't want it "baked into" our software.
If you want to easily try searching for these 200 words and phrases, I've set up a page for you to easily do so.
You can get additional links on the page by entering your own native language's code at the end of the following URL, instead of "fra".
By scraping tatoeba.org, I identified the 20 most highly searched words on Linguee that are underused on Tatoeba (less than 20 sentences including the given word).
linguee_rank word nb_sentences
17 scope 19
23 compliance 6
44 leverage 12
54 furthermore 11
85 default 14
103 retention 2
107 facilitate 10
141 procurement 16
150 pending 13
205 stakeholder 2
207 vendor 16
259 invoice 17
266 alignment 11
268 incentive 15
277 framework 17
281 align 11
282 equity 7
290 gauge 19
294 venue 7
295 mitigate 12
Maybe CK could build a table from this data so that users can easily add new sentences and translations ?
If it gains traction, we could update this list on a weekly basis.
Note that it's possible to get all sentences as a single file from the downloads page https://tatoeba.org/eng/downloads and thanks to rumpelstilzchen's work it's now also easy to only download sentences in a certain language.
"Scraping tatoeba.org" sounds like you wrote a script, so using that file might be a more comfortable way to collect frequency statistics.
Good point. On the other hand, by scraping the data, you benefit from the Tatoeba advanced search based on text stemming right away.
You'll notice that "scope" actually has more than 19 sentences.
You forgot to include the orphans.
Orphans are excluded automatically on the assumption that they are less trustworthy than owned sentences, even though some are likely better than some non-native owned sentences.
Orphans are included in the exported data.
Haluaisiko joku englanninkielinen adoptoida, kenties muokkaamisen jälkeen, loputkin näistä orvoista?
A few things to consider:
(1) While words with many different senses, like "set", need a wide variety of sentences, even ten sentences is pretty good for getting an idea of how to use a word like "mitigate", as long as the sentences are not near-duplicates of each other. Of course, there are a lot of near-duplicates in our sentences, and the number keeps increasing.
(2) The stemmer misses some correspondences. Thus, a search for "retention" misses "retain", and a search for "retain" finds 36 hits.
(1) It is true that it is difficult to define a threshold suitable for both monosemic and polysemic words. It may be preferable to start with a low threshold at 10 and then raise it to 20 in a second round.
(2) If the stemmer is imperfect, it is probably better to use the exact match. This is the option chosen on this page: https://tatoeba.org/eng/Vocabul..._sentences/eng . Another advantage is that it makes it easier to process the data in the script.
With these parameters, we get the following list for English:
23 compliance 7
44 leverage 9
103 retention 1
107 facilitate 7
141 procurement 1
205 stakeholder 1
206 comprise 9
228 evaluation 9
266 alignment 4
268 incentive 9
281 align 4
282 equity 8
294 venue 5
295 mitigate 2
299 liability 9
314 preliminary 6
321 hub 9
337 offset 2
343 amend 4
344 retrieve 9
I think Linguee and Tatoeba are too different projects to be directly compared.
You are talking about "words that are of most interest". But to who? It depends on the person, on your mother tongue and culture, on what you are looking for etc. Does having more people querying a word really make that word more important? (open question)
You are talking about "languages as important as Spanish, Portuguese or French", but what makes them so important? Again, it depends on the person, on your mother tongue and culture, on what you are looking for.
Linguee is good at answering the demand of the market of translations, but that is a specific and restricted view on languages. For example, Linguee will help me a lot if I have to translate legal papers, but it won’t help me if my job is to create subtitles. Tatoeba has a broader approach that makes it, by design, less efficient on specific fields.
If I am a translator working with English, Spanish or French, I’d rather use Linguee than Tatoeba to help me with a translation job. In contrast, if I am a beginner learning Esperanto, or even Portuguese, I’d rather use Tatoeba. Or if I’m looking for sentence pairs I can legally re-use in my online courses or learning app, I’d rather use Tatoeba.
Note that I am not trying to denigrate your comment. To the contrary, I think you are raising important questions. I just want to show that you (as well as everyone, including me) have a bias about what is "relevant" or "important" or "should be on Tatoeba".
You are right, Linguee is different from Tatoeba. If I used Linguee's rankings, it's for lack of a better one. Of course, if Tatoeba was mainly used to translate Estonian subtitles into Esperanto, then more Estonian sentences would have to be created. ^^
By the way, it would also be interesting to know which words are the most searched for on Tatoeba. Is such data compiled somewhere?
In any case, the vocabulary related to the business world is not very covered by the contributors, and this is detrimental to the completeness of the corpus. I am pretty confident that for a developer "looking for sentence pairs I can legally re-use in my online courses or learning app", exhaustiveness or at least large diversity is not a detail.
We have logs of all queries made to the search engine up to one year ago, but it’s just raw data and it’s not publicly available. I can anonymize this data and put it online if you want to play around with it.
Having access to this data would be great. I would enable me to update my table with even more pertinent data.
I compiled the search query logs into a CSV file: https://downloads.tatoeba.org/s...ueries.csv.bz2
Note that for technical reasons each query is reported twice, so you want to divide numbers by two.
I didn’t include empty searches. I also only included queries on page 1 (that is to say, browsing page 2, 3, 4 etc. of the search results is not counted as actual queries).
Here are all those queries counted.
You'll have to divide the number by 2 yourself.
https://all.imfast.io/queries_c...umber_by_2.zip (13 MB file)
I sorted them by counts, then by language.
count + tab + language_code + tab + query
Also, here are files with only English.
Only the English queries, with the actual counts (already divided by 2)
https://all.imfast.io/eng-queries.zip (1.7 MB)
Only the English queries without spaces, with the actual counts (already divided by 2)
https://all.imfast.io/eng-queri...out-spaces.zip (4 MB)
Note that some lines with "eng" were not actually English.
I also converted all queries to lowercase before making the counts.
Thank you gillux for responding so quickly. I have started to analyze the data and I can already confirm that the ranking of the English queries is very different from the Linguee one. Here is the top10:
There is still a lot of data cleaning to be done to be able to find the queries that don't get sentences on Tatoeba, but I think we will come up with something usable. I'll keep you posted.
From an English learner's point of view, lists such as the NGSL are likely much more useful.
Here is an easy way to browse sentences from the Tatoeba Corpus in 2014.
At the time I built this set of pages, List 907 had a little over 300,000 sentences and 123,278 of these had audio.
There are now over 750,000 sentences on List 907 and over 490,000 English sentences with audio. My plan is to rebuild these pages after I get to 500,000 English sentences with audio.
Eri kohdeyleisö, luultavasti. NGSL luultavasti aloittaa yleisimmistä sanoista (jotka on helppo oppia mistä tahansa), sinnä missä aloitusviestissä mainittu lähde näyttää käsittävän sanoja, joiden merkitys ihmisille on epäselvä. Oletettavasti niistä kiinnostuneet ihmiset ovat jo päässet aloittelijasanojen ongelmista ohi.
I'm not exactly sure what you're saying since machine translation doesn't always correctly translate things, but perhaps you missed several of the other lists on that site, and the fact that my 2014 set of pages listed some of the upper-level word families numbered-sequentially after the main general list word families.
NEW ACADEMIC WORD LIST
TOEIC WORD LIST
BUSINESS SERVICE LIST
Briefly, for those who don't want to read the whole NGSL website...
A person mastering the 2,368 "word families" (not just single words) should be able to understand up to 92% of what they read.
A word family is something like this.
friend, friends, friendly, unfriendly, friendless, friendship
If you also can master the additional 960 word families on the NAWL, you should be able to handle college textbooks, academic journals, etc. About 92% coverage, since you will also need to learn vocabulary specific to your field of study.
The Business Service List is similar to the NAWL, except it's aimed at business English. An additional 1,700 "word families" over the 2,368 on the NGSL should give you about 97% coverage.
The main purpose of such lists is to help people studying English focus on the basic "word families" that are most frequently used, so people can get to the point where they can understand a very high percentage as quickly as possible.
(This is just a brief explanation. Read the website for more details.)
Selvä. Vilkaisin sivustoasi ja siinä näkyi pääosin yleislistan sanoja. Kyllä ne muutkin listat sieltä näyttävät löytyvän, mutta niidenkin sanasto näyttää aika yleiseltä, mikä lienee niiden tarkoituskin.
And of course, you are welcome to add more sentences based on your idea of what words should be better covered on Tatoeba. ^^
That's what I started doing.
With this suggestion, I'm just trying to contribute by publishing landmarks for those who lack inspiration and would like to make sure that they create or translate sentences that people want to read.
Just two points:
- You could add vocabulary items for words that you want more represented. Right now, the vocabulary feature is difficult to use efficiently (but it will change!) but that way you can directly follow the number of sentences for each of the words you're interested in and see what sentences exist. While your limit was 20, the limit for "sentences wanted" is 10, meaning that words appearing in more than ten sentences will not appear on the "wanted" page.
- Please do not add sentences from Linguee directly ^^ (except those whose source is compatible with Tatoeba)
I'm glad you mention this feature and learn that it will be improved soon. I'm already using it and I like it a lot. While browsing it for different languages, I noticed that it could perhaps be improved in two ways.
- Sometimes the requested expressions are formulated in such a way that it is impossible to answer the request: spelling mistakes, use of the equal sign to specify the meaning of the word... By limiting the duration of the request, these complicated expressions could be evacuated naturally.
- I also noticed that for some languages, this feature is not very used. Maybe the lists could be automatically filled with the underused expressions searched on Tatoeba at least x times during the last y days by different users. This option would have the advantage to better connect simple users with the corpus contributors and to be applicable for all languages.
Thanks to your feedback, I have created a new tool that can be found at https://tatominer.imfast.io.
It will help contributors who want to diversify Tatoeba's vocabulary while paying special attention to expressions that are often searched for by users of bilingual dictionaries.
This is nice. Would it be possible to provide a drop-down to select the language? Most people will probably only be interested in one language at a time. While it's possible to sort by language and then page to the place where sentences in the desired language starts, it's a cumbersome process.
There is a search box on the top-right corner of the table that enables you to filter content. For example, typing 'eng' will filter all English sentences.
That works, but it's not the behavior I would have expected. For one thing, searching for "eng" also brings up non-English words containing "eng".
That's the default for the datatables jQuery plugin.
If you want to find English sentences, use the sort on the language column and then browse for "eng.".
I was just experimenting with this plugin a couple of days ago to display words from the New Academic Word List (NAWL).
I moved it it to here.
New Academic Word List (NAWL) with Definitions in English and Japanese
And, also created this one.
New General Service Word List (NGSL) with Definitions in English and Japanese
> That's the default for the datatables jQuery plugin.
Whether or not it's the default for the underlying software, expected behavior for a UI is for a search field with typed text to search for text that can vary freely, while a drop-down is used to search for constrained text. Not only does using the same search field for both violate my expectation, but as I said, it mixes up the results. If I search for "eng", I find, in addition to the English words, the French word "engendrer". Similarly, a search for "fra" brings up a Portuguese and a German term that happen to contain the string "fra". It's only a matter of luck that this particular vocabulary list and list of language identifiers don't overlap more.
Naturally, the tool can be used in its current form, and I appreciate the fact that Ibdx set it up for us. I'm just suggesting that separating the specification of the language from the specification of the entry would make the tool more usable.
I may have a look at it some time to time. At what frequency do you plan to update the data? (If you plan it at all)
That's a good tool to get some inspiration sometimes.
[EDIT] Also note that some of your entries are incorrect, but that's not an error on your side. For example, there are obvious ones like "linguee" or "so" and the ones erroneous because of linguee, like "oeuvre" that should be "œuvre"
Updating this web page is not really an issue for me since the process is fully automated. As both Tatoeba download page and Linguee rankings are updated weekly, a new version every Saturday seems doable.
As for the few errors, you rightly noticed that they come from the Linguee data. Sadly, I don’t see any easy way to spot them programmatically for now.
I added sentences for 50 of the words.
Thank you, the quality of your sentences is of great benefit to Tatoeba.
Thanks! Your list is very helpful. I ended up writing a sentence for each phrase. I hope others will do the same.
One thing I note is that the phrase "portugues" (in Portuguese) is on the list. This is simply a misspelled version of "português". I would expect the correctly-spelled version of a language's name within that language to be fairly common, so I would be surprised if it would appear on the list. I didn't encounter any English misspellings, though I did run into the rare words "wether" and "therefor", which may be frequent in queries only as misspellings of the more common "whether" and "therefore".
I also noticed that there are too many misspelled words among the phrases with zero occurrences in the Tatoeba corpus. This is the reason why I will cut all of them in a next version of the table.
I would like to have some example sentences using the words 'good-man' and 'good-wife'. Based on context the meaning might differ from that of 'a good man/wife'.
1. I added them to my vocabulary. However, the search engine seems to consider 'good man' and 'good-man' as identical expressions. Is there a way to add 'good-man' to my vocabulary?
Remark: In Finnish, when writing a compound word, if the first word ends and the last word begins with the same vowel, we write it thusly: 'ala-aste', 'linja-auto'. Writing 'ala aste' would mean two separate words (and typically be a mistake), though there might be cases where it would be correct language and would have a different meaning. As such, it would be nice if the search engine did not confuse these kinds of expressions, which mean different things.
2. I would appreciate examples of 'good-wife' and 'good-man' in the corpus. Examples illustrating whether these are the same as 'a good wife' (or man) would be particularly appreciated.
As you discovered, the search engine currently doesn’t make any difference between 'good-man' and 'good man'. If the hyphen were to be treated like a normal character, it means sentences with 'good-man' wouldn’t show up when searching for 'good,' 'man' or any word sharing the same stem.
That said, our search engine (Manticore) has a feature I believe is exactly what we need: blended characters . This feature would allow a sentence containing 'good-man' to be found by searching for 'good-man' as well as 'good' or 'man'.
Now my question is: should we treat the hyphen character as a blended character in all languages by default, or only in Finnish? I feel like other languages such as French or English could benefit from it, but I wonder if it could cause any harm in other languages I don’t know.
By the way, in Finnish, isn’t the colon character also used in a similar way, when you want to decline an abbreviation or something? I’m talking about this: https://en.wikipedia.org/wiki/C...ffix_separator
The Wikipedia article seems correct. For the definitive source, see https://www.kielikello.fi/-/kaksoispiste- . Note that colon is not used to combine separate words. (Colon also has other uses such as with quotes.)
Example of genitiv:
metri – metrin
m – m:n
Norwegian (bokmål) uses some kind of dash to the same effect, so that the definite singular of tv is tv-en.
The same kind of dash is used to bind together words in some compound words, it looks like: https://no.wikipedia.org/wiki/Bindestrek
I think that blended characters would not help me here, but it might otherwise be a fine idea. I am not sufficiently knowledgeable about Manticore to have a strong opinion here.
On an unrelated note: Gmail indicated the email notification of your response as suspect. It was sent from the address firstname.lastname@example.org .
fra: noreply <email@example.com>
dato: 3. apr. 2020, 09:20
emne: Tatoeba - gillux has replied to you on the Wall
sendt av: gmail.com
signert av: gmail.com
sikkerhet: Standardkryptering (TLS) Finn ut mer
>On an unrelated note: Gmail indicated the email notification of your response as suspect. It was sent from the address firstname.lastname@example.org .
That happened with me too.
I recorded the issue: https://github.com/Tatoeba/tatoeba2/issues/2255
** Stats - 2020-04-04 - Native Speakers with Contributions **
Or include the 3-letter code for your native language like this for more links.
Here is the same data without the translation links, but with links to the members' profiles instead.
This uses the DataTables jQuery plugin, so you can easily search members' usernames.
** Stats: Members Manually Editing Transcriptions **
** 2020 Daily Contributions Counts by Langauge **
Date Range of the Data: 2020-01-01 00:00:00 UTC to 2020-04-03 23:59:59 UTC
Here is a variation on the above.
This lists the language that had the most contributions of each day between 2020-01-01 and 2020-04-03.
Hi there !
Where I am currently, my connection is random and not guaranteed. Also, I will not necessarily be able to respond to comments from those who ask me for corrections. The confinement does not allow me to go to repair my computer and I keep my mobile for emergency communications. I hope the best for everyone.
** Stats: 2020-04-04-and-2019-04-04.png **
For those who wonder what’s with the contributions from FixHashesCommand, first I would like to apologize for polluting the "latest contributions" table. This is just a one-time maintenance operation that affected 2960 sentences.
It’s the result of rumpelstilzchen’s work to fix duplicate merging issues with some sentences. FixHashesCommand is a bot that edited the sentences in order to normalize the text that otherwise would have been stored in different but visually identical ways.
The dev team doesn't stop for anything! Thanks, everyone!
La enhavo de tiu mesaĝo kontraŭas niajn regulojn a> kaj pro tio estas kaŝita. Ĝi aperas nur al administrantoj kaj la aŭtoro de la mesaĝo.