Saturday 27 March 2010

Q: When is A not A? - A: When it's an aabafili.

I received a spreadsheet from the police with a list of their offence codes in Dhivehi, for loading into the database.  All of the lists are structured as a code (hidden from the user) and English and Dhivehi descriptions.  The description field displayed in any drop down list depends on which language the application is set to.  The user can switch betwen English and Dhivehi, although free text fields must be completed in English.  We did at one stage consider allowing either language but it gets impossibly messy when the Faruma font doesn't display latin characters and Dhivehi runs right to left.

Anyway, when I started to look at this spreadsheet in some detail, I noticed a problem.  The cell contents displayed as divehi but in the edit bar, i.e. what was actually stored in the cell, they were in western (latin) characters.



The contents of cell A10 are displayed as Dhivehi but input and stored as latin characters.

It's taken a few weeks to fully understand the what, why and how of this, which I think I now do.  There are a number of strands to this.

Firstly is the keyboard mapping, that is the correspondance between the characters (and ASCII Codes) physical keyboard and the Dhivehi characters they produce when that language option is selected.

Next, the font we are using, P_Faruma, along with a  number of other dhivehi fonts will display the unicode Thaana character set and the ascii character set as the equivalent Thaana characters.  A consequence of this is that you can't dispay a latin font based language, such as english with these fonts. Some of the standard Windows TrueType fonts such as Arial, will display only the unicode character set (as far as I can see), so that ASCII 64 is always A, whereas Faruma displays ASCII 64 as the equivalent Thaana character on the dhivehi phonetic keyboard layout (selected through Regional Settings), which is ާ or Aabaafili, the double a sound and unicode character 07A7.

Dhivehi Phonetic (soft) keyboard layout

As I've already noted, this mapping of latin to Thaana characters isn't fixed and depends on the keyboard mapping.  My first introduction to this was on a blog by Jawish Hameed, who published the java source for a transcoder he has developed to convert between latin and thaana.  When I used his mappings in a VB program to do a similar job, I only got about an 80% match between the latin and Thanna in the spreadsheet data, with quite a few of the filis (vowels) being transposed.  My first suspicion was that maybe this was a java  thing - perhaps linux keyboards may use a different mapping.  Then again it may have nothing to do with the OS but relate to mapping of the latin characters on the particular physical keyboard to the dhivehi ones.  For instance my keyboard is a UK QWERTY layout, which for the main alphabet characters will be the sames as US QWERTY but would be different to say a German or French keyboard.  The mapping I describe below isvalid for a US/UK QWERTY keyboard and Windows Dhivehi Phonetic (as far as I know!)

At this point I set about documenting the mapping for Dhivehi phonetic on a qwerty keyboard and once I had modified my VB code for this, I got 100% match between the latin based dhivehi displayed in the spreadsheet cells and the true unicode dhivehi.  The mapping is shown below.  There are a few special characters to deal with.  Dhivehi reads right to left and the Dhivehi equivalents of question mark, comma and semi-colon are mirrored, as are open and closed bracket.  In common with arabic, numbers run left to right whilst dates are read right to left and so formatted as yyyy mmm dd. 

A big thanks must go to Jaa.  The code below is mine not his but the initial mapping in his code set me on the right track. 

Some other useful resources are:

http://tlt.its.psu.edu/suggestions/international/bylanguage/thaanachart.html

http://en.wikipedia.org/wiki/T%C4%81na

http://unicode.org/charts/PDF/U0780.pdf


The following is good in VB6, VB.net, Access Basic.  This version handles thaana strings without any numbers, whose digits will get reversed.  I have another version which does a more complete job, though one thing I have yet to crack is coding for /- used in the representation of currency values.

Function ThaanaAsciiToUnicode(ByVal strIn As String) As String


'copyright (C) 2010 Tony Bennett

'You are free to use the code for personal or commercial purposes as long as you retain the copyright notice  

'The pheonetic keyboard layout mappings are:
'h' -> '1920','S' -> '1921', 'n' -> '1922', 'r' -> '1923', 'b' -> '1924', 'L' -> '1925',
'k' -> '1926', 'w' -> '1927','v' -> '1928', 'm' -> '1929',
'f' -> '1930', 'd' -> '1931', 't' -> '1932', 'l' -> '1933',
'g' -> '1934', 'N' -> '1935', 's' -> '1936', 'D' -> '1937', 'z' -> '1938', 'T' -> '1939',
'y' -> '1940', 'p' -> '1941', 'j' -> '1942', 'c' -> '1943',
'X' -> '1944', 'H' -> '1945','K' -> '1946', 'J' -> '1947',
'R' -> '1948', 'C' -> '1949', 'B' -> '1950', 'M' -> '1951',
'Y' -> '1952', 'Z' -> '1953', 'W' -> '1954', 'G' -> '1955',
'Q' -> '1956', 'V' -> '1957','a' -> '1958', 'A' -> '1959',
'i' -> '1960', 'I' -> '1961', 'u' -> '1962', 'U' -> '1963',
'e' -> '1964', 'E' -> '1965', 'o' -> '1966', 'O' -> '1967',
'q' -> '1968',

'plus the special characters
',' -> '1548', ';' -> '1563', '?' -> '1567', ')' -> '0041',
'(' -> '0040', 'Q' -> '65010'

'note that ? , ; ( ) are mirrored in thaana

Dim strOut As String
Dim c As String
Dim i As Integer
Dim j As Integer

'Dhivehi Phonetic <-> QWERTY mappings
Const AsciiChars1 = "hSnrbLkwvmfdtlgNsDzTypjcXHKJRCBMYZWGQVaAiIuUeEoOq"

'The special characters
Const AsciiChars2 = ",;?)(Q"

strOut = ""

If Len(strIn) > 0 Then

'check if any chars in unicode.
'if they are, assume this is unicode dhivehi and skip the conversion
For i = 1 To Len(strIn)
  If AscW(Left(Trim(strIn), 1)) > 255 Then
    AsciiToUnicode = strIn
    Exit Function
  End If
Next i

For i = Len(strIn) To 1 Step -1

  c = Mid$(strIn, i, 1)
  
  'need to do a case senstive instr  j = InStr(1, AsciiChars1, c, vbBinaryCompare)


  If j > 0 Then
    strOut = strOut + ChrW(1919 + j)
  Else
    j = InStr(1, AsciiChars2, c, vbBinaryCompare)
    If j > 0 Then
      Select Case j
        Case 1 ',
          strOut = strOut + ChrW(1548)
        Case 2 ';
          strOut = strOut + ChrW(1563)
        Case 3 '?
          strOut = strOut + ChrW(1567)
        Case 4 ')
          strOut = strOut + ChrW(41)
        Case 5 '(
          strOut = strOut + ChrW(40)
        Case 6 'Q
          strOut = strOut + ChrW(65010)
        End Select
      Else
        strOut = strOut + Mid$(strIn, i, 1)
      End If
    End If
  Next i
End If

AsciiToUnicode = strOut

End Function

No comments:

Post a Comment

Note: only a member of this blog may post a comment.