Thursday, February 10, 2011

Curating Open Health Data with Google Refine

In a previous post, I briefly discussed the meaning and implications of open, linked data. Today I will discuss some work I did at a recent Health 2.0 Hackathon with a particular data set.

The Tools

CHSI
I decided to start with the Community Health Status Indicators from HHS. I was familiar with this data set, having written a brief developer's guide for the first Health 2.0 Hackathon last fall. This is from HealthData.gov, part the government's ongoing "open government" initiative under President Obama and national CTO Aneesh Chopra.

Freebase
Freebase is an open semantic web database. This is the "linked data" part of our exercise. An explanation of what linked data is can be found at LinkedData.org and we won't deal with it in depth except to make connections between the open data released by HHS and real world data in the semantic web.

Google Refine
Google Refine (formerly GridWorks) is a tool for curating, reducing, and linking data using Freebase. Using Google Refine we can take an ordinary spreadsheet, correlate it with semantic data sets in Freebase, and create sets of triples for import into Freebase itself. For this exercise, I created a "base" ordomain of data in Freebase called CHSI. However, for the first session the challenge of translating tabular data into triples is one that could not be addressed in the time allotted.

The Process
The first step is to take a set of data in CSV format and import it into Google Refine as a new project.

This is easy enough and produces a spreadsheet in the familiar fashion.

Now, creating a spreadsheet is just the first step. The real magic happens when we link data in this spreadsheet to semantic data in Freebase. The act of linking data to the real world is called reification, and in Freebase this is done through the "reconcile" function. By clicking on the menu (arrow) icon on a column header, we see a number of menu options, one of which is "Start reconciling..."


The first thing to reconcile is the state. This is easy for Freebase to reason through, as state names are unique and easily recognized. After reconciling, we see each state name is now hyperlinked. We can follow the hyperlink to the Freebase entry for that state.

Next, we want to reconcile counties. The CHSI data is arranged by county, so we can get a fine-grained view of the nation's health data geographically. To reconcile county, we go through the same process.


In the next illustration, you see Freebase has recognized county name, and gives you the default of US County as the semantic data type for that column. If you just reconcile on the name, you'll get a hit-or-miss on the reification, so we want to give Freebase a little more information about this data element. In this case, we can include another column as an extra hint. For our additional column we select state name and start typing in the relationship "contained by." As you start typing, Freebase auto-completes the relationship.



After going through this process, we have hyperlinks in the state and county name columns. These link directly to Freebase and are now semantically linked to their respective entities. Now we can add more columns based on data in Freebase. If you go to the Freebase entry for a county, you will see a number of data elements listed such as GDP, population, pollution levels, household income, adjoining counties, geographical features (the "contained in" relationship") and many others. All of these can be added as additional columns in your spreadsheet.

In my next post, I will discuss visualizing this data.

For more information on using Google Refine, see Jeni's blog post Using Freebase Gridworks to Create Linked Data.








1 comment:

brum-me said...
This comment has been removed by the author.