menu
Tatoeba
language
Register Log in
language English
menu
Tatoeba

chevron_right Register

chevron_right Log in

Browse

chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio

Community

chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

search
clear
swap_horiz
search
gillux gillux September 21, 2014, edited September 21, 2014 September 21, 2014 at 10:13:12 PM UTC, edited September 21, 2014 at 10:13:48 PM UTC link Permalink

Hello,

I need help from for people who know the Tibetan language. I’m trying to make the search function into Tibetan to work. To better explain my request, I’ll first to explain a little bit how the search function is working.

Let’s say you have the sentence “This costs: 10$.” The search engine needs to extract the words “this”, “costs” and “10” so that people can find this sentence by searching for one of these keywords. It does this by having a list of characters that are part of words. This list includes a to z but not punctuation and currency symbols. Let’s say we include the colon in that list. Then searching for “costs” wouldn’t return that sentence ; only “costs:” would.

For languages like Tibetan it’s actually a little bit more complex since it doesn’t have word boundaries, but the idea is the same. I need someone who knows which characters are parts of real words in Tibetan to review this chart [1] and give me the character codes (numbers like 0F0A, 0F2E…). Thank you.

[1] http://www.unicode.org/charts/PDF/U0F00.pdf

{{vm.hiddenReplies[20452] ? 'expand_more' : 'expand_less'}} hide replies show replies
Objectivesea Objectivesea September 22, 2014, edited September 22, 2014 September 22, 2014 at 3:25:35 AM UTC, edited September 22, 2014 at 3:46:52 AM UTC link Permalink

I think this might be a more difficult challenge than it seems at first. I don't know much about Tibetan, but in common with other Indic scripts like Devanagari, Bengali, etc., the basic Tibetan consonant-plus-inherent-vowel glyph will be modified with vowels other than the inherent 'a'. The printed form will also change when it is preceded or followed by another consonant to form a cluster.

However, if it will help you, the basic consonants are:
0F40 0F41 0F42 0F43 0F44
0F45 0F46 0F47 0F49
0F4A 0F4B 0F4C 0F4D 0F4E
0F4F 0F50 0F51 0F52 0F53
0F54 0F55 0F56 0F57 0F58
0F59 0F5A 0F5B 0F5C 0F5D
0F5E 0F5F 0F60 0F61 0F62
0F63 0F64 0F65 0F66 0F67
0F68 0F69 0F6A 0F6B 0F6C

And the vowels:
0F71 0F72 0F73 0F74 0F75
0F76 0F77 0F78 0F79 0F7A
0F7B 0F7C 0F7D 0F7E 0F7F
0F80 0F81 0F82 0F83 0F84

We then have the subjoined consonants:
0F90 0F91 0F92 0F93 0F94
0F95 0F96 0F97 0F99
0F9A 0F9B 0F9C 0F9D 0F9E
0F9F 0FA0 0FA1 0FA2 0FA3
0FA4 0FA5 0FA6 0FA7 0FA8
0FA9 0FAA 0FAB 0FAC 0FAD
0FAE 0FAF 0FB0 0FB1 0FB2
0FB3 0FB4 0FB5 0FB6 0FB7
0FB8 0FB9 0FBA 0FBB 0FBC

It may help to know that a regular consonant can be transformed to a subjoined consonant by adding 0050 hex or 0080 decimal.

You may want the numbers as well:
0F20 0F21 0F22 0F23 0F24
0F25 0F26 0F27 0F28 0F29