Saturday 27 March 2010

Q: When is A not A? - A: When it's an aabafili.

I received a spreadsheet from the police with a list of their offence codes in Dhivehi, for loading into the database.  All of the lists are structured as a code (hidden from the user) and English and Dhivehi descriptions.  The description field displayed in any drop down list depends on which language the application is set to.  The user can switch betwen English and Dhivehi, although free text fields must be completed in English.  We did at one stage consider allowing either language but it gets impossibly messy when the Faruma font doesn't display latin characters and Dhivehi runs right to left.

Anyway, when I started to look at this spreadsheet in some detail, I noticed a problem.  The cell contents displayed as divehi but in the edit bar, i.e. what was actually stored in the cell, they were in western (latin) characters.



The contents of cell A10 are displayed as Dhivehi but input and stored as latin characters.

It's taken a few weeks to fully understand the what, why and how of this, which I think I now do.  There are a number of strands to this.

Firstly is the keyboard mapping, that is the correspondance between the characters (and ASCII Codes) physical keyboard and the Dhivehi characters they produce when that language option is selected.

Next, the font we are using, P_Faruma, along with a  number of other dhivehi fonts will display the unicode Thaana character set and the ascii character set as the equivalent Thaana characters.  A consequence of this is that you can't dispay a latin font based language, such as english with these fonts. Some of the standard Windows TrueType fonts such as Arial, will display only the unicode character set (as far as I can see), so that ASCII 64 is always A, whereas Faruma displays ASCII 64 as the equivalent Thaana character on the dhivehi phonetic keyboard layout (selected through Regional Settings), which is ާ or Aabaafili, the double a sound and unicode character 07A7.

Dhivehi Phonetic (soft) keyboard layout

As I've already noted, this mapping of latin to Thaana characters isn't fixed and depends on the keyboard mapping.  My first introduction to this was on a blog by Jawish Hameed, who published the java source for a transcoder he has developed to convert between latin and thaana.  When I used his mappings in a VB program to do a similar job, I only got about an 80% match between the latin and Thanna in the spreadsheet data, with quite a few of the filis (vowels) being transposed.  My first suspicion was that maybe this was a java  thing - perhaps linux keyboards may use a different mapping.  Then again it may have nothing to do with the OS but relate to mapping of the latin characters on the particular physical keyboard to the dhivehi ones.  For instance my keyboard is a UK QWERTY layout, which for the main alphabet characters will be the sames as US QWERTY but would be different to say a German or French keyboard.  The mapping I describe below isvalid for a US/UK QWERTY keyboard and Windows Dhivehi Phonetic (as far as I know!)

At this point I set about documenting the mapping for Dhivehi phonetic on a qwerty keyboard and once I had modified my VB code for this, I got 100% match between the latin based dhivehi displayed in the spreadsheet cells and the true unicode dhivehi.  The mapping is shown below.  There are a few special characters to deal with.  Dhivehi reads right to left and the Dhivehi equivalents of question mark, comma and semi-colon are mirrored, as are open and closed bracket.  In common with arabic, numbers run left to right whilst dates are read right to left and so formatted as yyyy mmm dd. 

A big thanks must go to Jaa.  The code below is mine not his but the initial mapping in his code set me on the right track. 

Some other useful resources are:

http://tlt.its.psu.edu/suggestions/international/bylanguage/thaanachart.html

http://en.wikipedia.org/wiki/T%C4%81na

http://unicode.org/charts/PDF/U0780.pdf


The following is good in VB6, VB.net, Access Basic.  This version handles thaana strings without any numbers, whose digits will get reversed.  I have another version which does a more complete job, though one thing I have yet to crack is coding for /- used in the representation of currency values.

Function ThaanaAsciiToUnicode(ByVal strIn As String) As String


'copyright (C) 2010 Tony Bennett

'You are free to use the code for personal or commercial purposes as long as you retain the copyright notice  

'The pheonetic keyboard layout mappings are:
'h' -> '1920','S' -> '1921', 'n' -> '1922', 'r' -> '1923', 'b' -> '1924', 'L' -> '1925',
'k' -> '1926', 'w' -> '1927','v' -> '1928', 'm' -> '1929',
'f' -> '1930', 'd' -> '1931', 't' -> '1932', 'l' -> '1933',
'g' -> '1934', 'N' -> '1935', 's' -> '1936', 'D' -> '1937', 'z' -> '1938', 'T' -> '1939',
'y' -> '1940', 'p' -> '1941', 'j' -> '1942', 'c' -> '1943',
'X' -> '1944', 'H' -> '1945','K' -> '1946', 'J' -> '1947',
'R' -> '1948', 'C' -> '1949', 'B' -> '1950', 'M' -> '1951',
'Y' -> '1952', 'Z' -> '1953', 'W' -> '1954', 'G' -> '1955',
'Q' -> '1956', 'V' -> '1957','a' -> '1958', 'A' -> '1959',
'i' -> '1960', 'I' -> '1961', 'u' -> '1962', 'U' -> '1963',
'e' -> '1964', 'E' -> '1965', 'o' -> '1966', 'O' -> '1967',
'q' -> '1968',

'plus the special characters
',' -> '1548', ';' -> '1563', '?' -> '1567', ')' -> '0041',
'(' -> '0040', 'Q' -> '65010'

'note that ? , ; ( ) are mirrored in thaana

Dim strOut As String
Dim c As String
Dim i As Integer
Dim j As Integer

'Dhivehi Phonetic <-> QWERTY mappings
Const AsciiChars1 = "hSnrbLkwvmfdtlgNsDzTypjcXHKJRCBMYZWGQVaAiIuUeEoOq"

'The special characters
Const AsciiChars2 = ",;?)(Q"

strOut = ""

If Len(strIn) > 0 Then

'check if any chars in unicode.
'if they are, assume this is unicode dhivehi and skip the conversion
For i = 1 To Len(strIn)
  If AscW(Left(Trim(strIn), 1)) > 255 Then
    AsciiToUnicode = strIn
    Exit Function
  End If
Next i

For i = Len(strIn) To 1 Step -1

  c = Mid$(strIn, i, 1)
  
  'need to do a case senstive instr  j = InStr(1, AsciiChars1, c, vbBinaryCompare)


  If j > 0 Then
    strOut = strOut + ChrW(1919 + j)
  Else
    j = InStr(1, AsciiChars2, c, vbBinaryCompare)
    If j > 0 Then
      Select Case j
        Case 1 ',
          strOut = strOut + ChrW(1548)
        Case 2 ';
          strOut = strOut + ChrW(1563)
        Case 3 '?
          strOut = strOut + ChrW(1567)
        Case 4 ')
          strOut = strOut + ChrW(41)
        Case 5 '(
          strOut = strOut + ChrW(40)
        Case 6 'Q
          strOut = strOut + ChrW(65010)
        End Select
      Else
        strOut = strOut + Mid$(strIn, i, 1)
      End If
    End If
  Next i
End If

AsciiToUnicode = strOut

End Function

Wednesday 24 March 2010

SQL Collation for Dhivehi

When I originally set up the database back last September, I used the default latin collation for SQL Server, Latin1_General_CI_AI.  I knew at the time I would need to consider a different collation for the dhivehi content but then forgot about this issue until the other day when I tried a search against fields holding dhivehi as unicode.  My first search returned all records in the table, irrespective of the filter string.  For instance searching the dhivehi firstname field for the dhivehi Bart, ބާރޓް
in a table containing records for {Bart, Homer, Lisa, Maggie, Marge} as { ބާރޓް, ހޯމަރ , މެގީ , މާރޖް , ލިސާ } using

SELECT * FROM tblPeople WHERE D_FirstName LIKE N'%ބާރޓް%'

records were returned for all the Simpson family.  I was sort of expecting something like this might happen, though if I'd had to put money on it, I would have guessed I'd get no matching records.

By changing the collation on this field to one of the dhivehi set supported by SQL Server 2008, only the record for bart was returned.


The choice of collation took some consideration.  The set of divehi collations in SQL Server 2008 numbers over 30 variations:

We can reduce this list by half by using the 100s in preference to the 90s. According to sql 2008 books online:

SQL Server 2008 has introduced new collations that are in full alignment with collations that Windows Server 2008 provides. These 80 new collations are denoted by *_100 version references. They provide users with the most up-to-date and linguistically accurate cultural sorting conventions.


We don't want to use the BIN collations.

Dhivehi doesn't have cases.  A blog at http://www.jawish.org/blog/plugin/tag/how+to, referring to MySQL suggests a collation of utf8_unicode_ci, which isn't divehi specific but does indicate that case insensitive is good enough. Also, a blog at http://blogs.msdn.com/michkap/archive/2005/05/11/416293.aspx suggests that case sensitive can lead to problems.  So case insensitive seems a good option.

The filis are stored as separate characters rather than accented versions of consonants.  The choice between accents sensitive and insensitive probably makes no difference but accent insensitive seems a safe bet.

Kanatype indicates whether the data in the dataset is kanatype sensitive, which distinguishes between the two types of Japanese kana characters: Hiragana and Katakana. This is of no relevance to Dhivehi, so insensitive sounds a good option.

Width isn't an issue but insensitive seems the safest option.

This brings the choice down to

Divehi_100_CI_AI Divehi-100, case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive

To test this I created the table

CREATE TABLE [dbo].[tblPeople](
[PersonId] [int] IDENTITY(1,1) NOT NULL,
[E_Firstname] [nvarchar](40) NULL,
[E_Lastname] [nvarchar](40) NULL,
[D_FirstName] [nvarchar](40) COLLATE Divehi_100_CI_AINULL ,
[D_LastName] [nvarchar](40) COLLATE Divehi_100_CI_AI NULL)
GO

populated d_firestname with the simpsons o=in dhivehi and repeated the earlier query

SELECT * FROM tblPeople WHERE D_FirstName LIKE N'%ބާރޓް%'

which now just returns Bart's record.

Note, it is important to prefix the search string with N to indicate a unicode string.

A ship the size of an island

The P &O liner Aurora arrived off Male' last night.  At almost 900' long, it's bigger than the next door island of Funadhoo.



The P & O Aurora moored opposite the President's Jetty on Bodu Thakurufaanu Magu, Male'

It's not a pidgin

There's a road close to the hotel called Filigas Magu (Filigas Street).  At the end of it is the only petrol (gas) station on Male'.  Initially, I wondered if it was pidgin but it turns out to be just an amusing coincidence.  Filigas is a kind of tree.  Fili is also the name given to the 'squiggles' above and below the main characters in dhivehi, which represent the vowel sounds.

Not pidgin


Pigeon