Sunday, 1 November 2009

A searching problem

In any database system which records details about people it's important to be able to search against a person's name as the primamry means of pulling back their record. A large database may contain records for people having the same name in which case you would then refine the search with other details such as date of birth.
In English and most western languages there are few varitions to the spelling of a name


In the Dhivehi language there can be subtle variations to the spelling of a name, which can make searching for an exact match a problem. To begin with words in Dhivehi can be written across five lines, with the consonants in the middle, vowels above and modifiers or accents above or below. A particulr problem which was described to me this week by one of the police IT guys is the use of the dot. The dot was introduced into written dhivehi about 30 years ago and is used to subtley modofy the sound of a letter, think of the grave and acute accents in French but more so.

If they apprehend someone and take down his name incorrectly by missing a dot, they may not find him in the database. For instance if they apprehend John Smi.th but search for the equally common name smith, either he will not appear in the results list or they will match on the wrong person. John Smith may have no previous convictions, whereas John Smi.th may be a serial offender. At the very least the police end up creating two records for the same person and in the worst case, he gets a sentance less than he otherwise might.

This problem isn't unfamilar in databases of English names and a way round it is to do wildcard searches where you include a character which matches against any single character or string of characters. So a seach for say Sm*th or even Sm* would match on Smith and Smi.th. Clearly the less specific the search, the greater number of false matches. if we just searched for * we would find everyine in the database!

The problem is further compounded in Dhivehi when the first and last names may be transposed, although once again this isn't entirely unfamiliar in UK databases which store the names of, in particular, people of asian descent. It is normal to store first, middle and last names in separate fields in a database, in which case searches have to match on say

Firstname = 'Smith' OR lastname = 'Smith'

If we have to include middle name(s) in the search, then the search becomes correspondingly more complex and will take longer.




Pigeon and Pig.eon












For this reason, the police system stores the full name in a single field and they have suggested we do the same, which is an idea that I'm not enirely comfortable with. In a large database, doing wildcard searches instead of exact match searches is going to impose a performance overhead. Then again we are only looking at a dataset of around 1000 cases a year since the early 1990s.

Every Maldivian has a national ID number, assigned at birth and carries an ID card which inlcudes their full name in Dhivehi and English, their 'permanent address', which is the address when they were firsr registered, a photo and more recenlty, fingerprint data. The national ID database stores names as first, middle, last and common name, the latter being the name the individual commonly uses. So the police approach of a single name field is also at odds with the national ID database.

The final consideration in determining a solution to this problem relates to the merging of data from the police and social worker databases, which I intend to blog about separately.

No comments:

Post a Comment

Note: only a member of this blog may post a comment.