View my profile

2011-02-04

Quality, Data and Otherwise

As our enterprise incorporates new data sets to help us detect emerging threats to human health, we encounter a set of issues in accessing an organization's data. One area is usually an unspoken reservation that an organization holds in their encounter with us. Put simply, everyone's data is a mess--missing values, entry errors, "idiosyncratic" spelling, cryptic abbreviations, among other issues.

It's sort of like inviting someone to view your bedroom closet. If it's like mine, there are shoes piled in a corner, a top shelf with hats and boots that are rarely used, sweaters and sweatshirts mixed up on other shelves, a rack of ties of ascending ugliness, and shirts, slacks, and jackets in no particular order. Everything is there and I can find it, but a stranger would need time to figure out where everything is and more time to fit things together into useful outfits. Finally, there are vestigial items. (I have never been willing to get rid of the suit I wore on my wedding day even though it has not fit in decades and is wildly out of fashion. What lapels!)

We have been exploring a massive data set that encompasses our entire state with millions of records and includes both normalized data and free text. An expert team from our SAS partnership has investigated the quality issues across the data and are helping develop categorical extractions from the free text so we can apply our analytics model to the data to look for anomalies. Those anomalies may be evidence of issues of concern to human health, but we apply both analytics and subject matter expertise to make that determination. The result are signals that can guide public officials' actions and understanding of an incident.

The process of developing the methods to automatically identify those signals involves sophisticated mathematics and domain expertise to develop rules to apply to the data. The rules allow the system to infer meaning from the data, but meaning tempered by a number of factors such as the characteristics of the data, the number of data points, connections to similar data that may add meaning to promote an anomaly to a signal. (Sort of like enlisting expert help in choosing the right tie, shirt, and jacket combination to those of us with aesthetic challenges.)

What has impressed me in the process of understanding the data is the variety of experts we have involved. Their focus on the task to develop these rules based on the data available exemplifies a major type of creativity of our project. While based on a variety of sciences, it is an art form and these are masters.

No comments:

Post a Comment