23 January 2013 - 0:09Layers of “data science”?
The field of “data science” covers a lot of areas, it feels like there’s a continuum of layers that can be considered and lumping them all as “data science” is perhaps less helpful than it could be. Maybe by sharing my list you can help me with further insight. In terms of unlocking value in the underlying data I see the least to most valuable being:
- Storing data
- Making it searchable/accessible
- Augmenting it to fashion new data and insights
- Understanding what drives the trends in the data
- Predicting the future
Storing a “large” amount of data has always been feasible (data warehouses of the 90s don’t sound all that different to our current Big Data processing needs). If you’re dealing with daily Terabyte dumps from telecomms, astro arrays or LHCs then storing it might not be economical but it feels that more companies can easily store more data this decade than in previous decades.
Making the data instantly accessible is harder, this used to be the domain of commercial software and now we have the likes of postgres, mongodb and solr which scale rather well (though there will always be room for higher-spec solutions that deal with things like fsync down to the platter level reliably regardless of power supply and modeling less usual data structures like graphs efficiently). Since CPUs are cheap building a cluster of commodity high-spec machines is no longer a heavy task.
Augmenting our data can makes it more valuable. By example – applying sentiment analysis to a public tweet stream and adding private demographic information gives YouGov’s SoMA (disclosure – I’m working on this via AdaptiveLab) an edge in the brand-analysis game. Once you start joining datasets you have to start dealing with the thorny problems – how do we deal with missing data? If the tools only work with some languages (e.g. English), how do we deal with other languages (e.g. the variants of Spanish) to offer a similarly good product? How do we accurately disambiguate a mention of “apple” between a fruit and a company?
Modeling textual data is somewhat mainstream (witness the availability of Sentiment, NER and categorisation tools). Doing the same for photographs (e.g. Instagram photos) is in the quite-hard domain (have you ever seen a food-identifier classifier for photos that actually works?). We rarely see any augmentations for video. For audio we have song identification and speech recognition, I don’t recall coming across dog-bark/aeroplane/giggling classifiers (which you might find in YouTube videos). Graph network analysis tools are at an interesting stage, we’re only just witnessing them scale to large data amounts of data on commodity PCs and tieing this data to social networks or geographic networks still feels like the domain of commercial tools.
Understanding the trends and communicating them – combining different views on the data to understand what’s really occurring is hard, it still seems to involve a fair bit of art and experience. Visualisations seems to take us a long way to intuitively understanding what’s happening. I’ve started to play with a few for tweets, social graphs and email (unpublished as yet). Visualising many dimensions in 2 or 3D plots is rather tricky, doubly so when your data set contains >millions of points.
Predicting the future – in ecommerce this would be the pinacle – understanding the underlying trends well enough to be able to predict future outcomes from hypothesised actions. Here we need mathematical models that are strong enough to stand up to some rigorous testing (financial prediction is obviously an example, another would be inventory planning). This requires serious model building and thought and is solidly the realm of the statistician.
Currently we just talk about “data science” and often we should be specifying more clearing which sub-domain we’re involved with. Personally I sit somewhere in the middle of this stack, with a goal to move towards the statistical end. I’m not sure one how to define the names for these layers, I’d welcome insight.
This is probably too simple a way of thinking about the field – if you have thoughts I’d be most happy to receive them.
Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.