Monday 16 November 2009

Bothersome Datasets

Of the half dozen agencies who will use the CPD, there are two who will provide the bulk of the information to be stored in it. Both already have databases which record information relating to child protection, delinquency and other behavioural issues and gender based violence. One of the agencies plans to replace their existing database with this new one, whereas for the other, the CP database will run in parallel with their own, holding more detailed information about Child Protection cases.

Both database have been in existence since the early 1990s and the suggestion is that about 75% of the cases in the one database are also recorded in the other. The remaining 25% of the former would be cases of say domestic violence or bad behaviour which were only investigated by one agency. It seems possible, though it has yet to be confirmed, that there may be cases in the other agency's system which should be migrated to the CP database and which have no equivalent record on the other system.

Merging datasets is a challenge at the best of times; where fields overlap you have to make a decision which data to use, based on factors such as quality or completeness. The task which lies ahead would be daunting enough if that were the only issue. We have two additional challenges to overcome:

1) One dataset is recorded in Dhivehi, whilst the other is in English

2) There is no common reference. Each database has its own case reference number for uniquely identifying a case, so the same case in the two databases have different reference numbers and neither holds a cross reference to the other. The people in the case, or at least some of them will be the same. However, while one system has links to the National Registration database and records the National Id of case subjects, the other system doesn't, so we can't use that to match people across the two datasets. We could attempt a match on names, although variations in spelling are not uncommon, and one would expect there to be a start date for the case which would at least be similar. Nevertheless, it looks like this is going to require a large amount of matching cases by eye unless there are documents or some other information providing a common unique identifier.


Currently I think the best option is to import the two datasets independantly and stored them with their legacy case identifiers, without making any attempt to match cases. It's probably more imprtant we can match people at some stage as this will be allow users to see what cases a person has previously been a subject of.






The 2007 workshop on Bandos

No comments:

Post a Comment

Note: only a member of this blog may post a comment.