Wednesday 24 March 2010

SQL Collation for Dhivehi

When I originally set up the database back last September, I used the default latin collation for SQL Server, Latin1_General_CI_AI.  I knew at the time I would need to consider a different collation for the dhivehi content but then forgot about this issue until the other day when I tried a search against fields holding dhivehi as unicode.  My first search returned all records in the table, irrespective of the filter string.  For instance searching the dhivehi firstname field for the dhivehi Bart, ބާރޓް
in a table containing records for {Bart, Homer, Lisa, Maggie, Marge} as { ބާރޓް, ހޯމަރ , މެގީ , މާރޖް , ލިސާ } using

SELECT * FROM tblPeople WHERE D_FirstName LIKE N'%ބާރޓް%'

records were returned for all the Simpson family.  I was sort of expecting something like this might happen, though if I'd had to put money on it, I would have guessed I'd get no matching records.

By changing the collation on this field to one of the dhivehi set supported by SQL Server 2008, only the record for bart was returned.


The choice of collation took some consideration.  The set of divehi collations in SQL Server 2008 numbers over 30 variations:

We can reduce this list by half by using the 100s in preference to the 90s. According to sql 2008 books online:

SQL Server 2008 has introduced new collations that are in full alignment with collations that Windows Server 2008 provides. These 80 new collations are denoted by *_100 version references. They provide users with the most up-to-date and linguistically accurate cultural sorting conventions.


We don't want to use the BIN collations.

Dhivehi doesn't have cases.  A blog at http://www.jawish.org/blog/plugin/tag/how+to, referring to MySQL suggests a collation of utf8_unicode_ci, which isn't divehi specific but does indicate that case insensitive is good enough. Also, a blog at http://blogs.msdn.com/michkap/archive/2005/05/11/416293.aspx suggests that case sensitive can lead to problems.  So case insensitive seems a good option.

The filis are stored as separate characters rather than accented versions of consonants.  The choice between accents sensitive and insensitive probably makes no difference but accent insensitive seems a safe bet.

Kanatype indicates whether the data in the dataset is kanatype sensitive, which distinguishes between the two types of Japanese kana characters: Hiragana and Katakana. This is of no relevance to Dhivehi, so insensitive sounds a good option.

Width isn't an issue but insensitive seems the safest option.

This brings the choice down to

Divehi_100_CI_AI Divehi-100, case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive

To test this I created the table

CREATE TABLE [dbo].[tblPeople](
[PersonId] [int] IDENTITY(1,1) NOT NULL,
[E_Firstname] [nvarchar](40) NULL,
[E_Lastname] [nvarchar](40) NULL,
[D_FirstName] [nvarchar](40) COLLATE Divehi_100_CI_AINULL ,
[D_LastName] [nvarchar](40) COLLATE Divehi_100_CI_AI NULL)
GO

populated d_firestname with the simpsons o=in dhivehi and repeated the earlier query

SELECT * FROM tblPeople WHERE D_FirstName LIKE N'%ބާރޓް%'

which now just returns Bart's record.

Note, it is important to prefix the search string with N to indicate a unicode string.

No comments:

Post a Comment

Note: only a member of this blog may post a comment.