Big Data is, as Wikipedia puts it “data sets that grow so large that they become awkward to work with using on-hand database management tools.” Think every twitter post ever made or 10,000 peoples’ DNA sequences.
Big data, I think, has a big brother which I’ll call Small Data.
Small Data consists of data sets which are manageable using a spreadsheet yet either hard to obtain, hard to process in to meaningful results and non-obvious in how to visualize and share. I’ll get to an example soon but let’s put this in a table where I’m looking at it from the point of view that Big Data is used by a data scientist and Small Data should be usable by someone with basic spreadsheet skills.
|Big Data||Small Data|
|Data size||Billions of rows||Hundreds / Thousands of rows|
|Ease to obtain||Hard (cost to host)||Hard (cost to find)|
|Ease to process||Hard (hardware costs)||Hard (time costs)|
|Ease to visualize||Easy (data scientist thinks about this all day)||Hard (Joe User knows how to plot a graph and that’s it)|
|Ease to analyze||Hard (hardware costs)||Hard (don’t know which questions to ask)|
I’m thinking out loud here but maybe the only difference is really dataset size and everything is still “hard” in some way.
Let’s jump to my example. What books are on my list to read that I can find cheaply or for free at the library as a book, audio CD or library ebook? My library will let me read books on a nook or kindle. In fact they will even let me listen on an MP3 device to many books but I excluded that.
I want to be very clear – this should be an easy question to answer.
If you don’t value your time it is in fact an easy question to answer by going through your list and searching every one on your local library. But of course your list could be in one of 100 formats and your library has it’s own formatting.
Luckily my wish list is on Amazon. It’s too many for me to do by hand in a reasonable amount of time (350 odd books). My library is the excellent KCLS and the most trafficked library in the US, or something.
We’ve excluded the time-consuming manual route. As someone with programming skills I should be able to write a script to connect the two and spit things out, right? Not really. First there is no lists API on amazon, supposedly because it saw little use but if I had to guess then because the potential for revenue loss if I import my list to the competition. Searching around there are solutions involving scraping data from your list by putting it in a simplified view in to a Google spreadsheet.
Sadly none of them worked. Even if they did, my wish list is multi-page and I’d have to mess around scraping each page. What I learnt though is that you can put your list in to a simplified view by adding ?layout=compact on the end of it (example).
Just copying and pasting the table in to Excel resulted in data. I did that four times, once for each page of my list and voila I have some data!
Crappy data, but data. Out of the table all I got was the book title and the price. The price was missing in cases where the book was only available second hand. I could probably spend an infinite amount of time playing with the copy & paste to extract the URL of the book or write a script to do that. To be clear though I value my time and my judgement was that I could get lost in the scripting route and spend hours iterating over data and fixing tiny bugs but I wanted the most efficient way to answer my question. I also wanted to assume the role of Joe User with some mild spreadsheet skills.
There’s a great way to use the crowd to answer questions about data; Mechanical Turk. With it I could spend some small amount of money to ask people to do things with my spreadsheet. I could upload the sheet and then spend a penny or so per row and the Turk system would allow any of its users to go do something like find out whether book X was available in format Y at my library.
Again I could go scrape my library website with a script or hope that they had an API. Again my guess was Turking it would be cheaper and quicker.
Off to Mturk.com we go. The first thing you have to do is design your task. Mine looked like this:
Thus an individual turker would see one or many of these tasks. Each would have the book name and ask for the price.
Uploading the sheet was non-trivial. First you need to put a heading on each column so I did ‘name’ and ‘price’. Then you need to export it to CSV. So I did, and sent it up to the cloud. MTurk failed saying there were invalid characters so I went in to the CSV with a text editor and removed a stray unicode umlaut and transliterated it to it’s ASCII cousin. This is all a bunch of gobbeldygook which roughly translates as “you need 3-8 years of computer science skills to do all this easily”.
I paid each one cent for each task. That adds up to $3.50 for all the books then some amazon fees bump it to $5 or so. Now I have a list of books, prices and the “best price” which I took to mean the 2nd hand price.
I’m format agnostic so I don’t care if I have a new book, 2nd hand, audio book, kindle, library, ebook from the library or what. It’s all the same content just in different packaging. Of course some are more accessible than others so I can spend $10 to get a book instantly on kindle or I can wait some number of hours or days to get it from the library. Thing is, my list is big enough that assuming the books are equally interesting (and they have to be since I haven’t read them to have data on which are more interesting than others) I can always get something interesting in any format I want.
This is important since right now choosing a book to read at random implies the format. So I pick a book from my list and maybe I have to pay for it, wait for it at the library or whatever. Instead, I want to be able to say “lets get the book on my list in CD Audio that’s been on the list for the longest time” or something like that when I no longer have any CDs in the car for the commute.
I have that data now (title, price, best price) only in theory. I have the first two in one spreadsheet. The second set is in another sheet that MTurk will export to me as a CSV. Connecting the two requires some spreadsheet skills. So I put the two in different sheets in the same workbook. Then I use the title in my original sheet to look up the best price in the second sheet using the VLOOKUP function. The second sheet that amazon exports adds all kinds of data like the time the task was done, the ID of the Turker and stuff I don’t care about.
So I go locate the columns I do care about, which are 28 columns over. I spend a bunch of time trying to use the LOOKUP function in Excel to find the data I want before figuring out that LOOKUP does some kind of odd interpolation thing and expects the data to be sorted. Amazon Mechanical Turk doesn’t return the data in the same order I sent it so I spend time playing around with VLOOKUP which has a different syntax to LOOKUP. Finally I get some data which looks like 350 or so rows of (title, price, best price).
No, we have quite some time to go. Now I have the pricing I don’t know the availability at the library. So I go back and modify my turk task and instead of asking people to find each book on Amazon I ask them to go to my library and find if it’s available in paper, CD or ebook. I do these as three separate tasks and each link is slightly different. You can ask the KCL catalog to search for the type of asset and also if it’s available (since I don’t care if they have a book but they don’t have copies any more). So I do some magic with the URL parameters so that the turkey doesn’t have to do that.
At each step I’m trying to make it a simple Turk task without spending an infinite amount of time making it too simple. For example I could encode the name of the book using urlencode() or something and then they wouldn’t have to copy/paste each book title. Economics becomes useful. Since the floor, the lowest I can pay per book title, is one penny all I have to do is not make it difficult enough that the Turker wants 2 pennies in compensation. So, I encode the book type (cd, ebook, paper) in the URL so they don’t have to interact with the search form. Then I add #available to the start of every book title in the turk task. This is a magic string which tells the KCLS catalog to only show available books. I could have put it in a URL parameter but I was bored and elegance was not the goal.
Here’s the resulting Turk task example:
Note that I also changed the text box input to a yes/no option. I didn’t do that to make it easier for the turker, it makes it easier for me. If I allowed them to type “yes” then I would end up with lots of variations like “Yes”, ” yes”, “YES” and so on that I am uninterested in processing.
Now I have an additional three spreadsheets with the title plus one of paper availability, CD availability or ebook availability.
Except I don’t. In the hour after I submitted each task about 98% of the results were in so I have partial results. This was still quicker than I expected but it also included some bad data. I double checked a few of the results and found one primary turker who had entered “yes” to 77 books without bothering to actually check the KCLS website. I banned him, didn’t find any more mass bad data and left it there.
I have a bunch of options here. I could ask multiple independent turkers to check the results. I could batch them up so one task required checking 3 formats not 3 tasks each checking one format each. I could pay more and attract more reliable turkers. But for now I’m happy with the results.
Long story short here is my final result:
I’ve used some conditional formatting to color the availability and price of the books. Plus I have one derived column looking for which books have the best 2nd hand to brand new price ratio.
Now, a few hours and $20 later I can actually ask some questions about the data. The problem is that Joe User wouldn’t get this far and it’s 2012. We should be able to do this kind of thing. For the curious, here is the complete data.
Back to theory
Big Data is great but clearly we can’t even tackle simple Small Data problems. The data collection is hard, the analysis is hard and the skill sets required are far beyond where they need to be.
There are a number of approaches happening today to try and help solve some of these problems and they go down approximately two routes. On one side there are those who believe “if only all the data in the world was all in some universal format” then things would magically be better. On the other “if only we had strong AI” then things would magically be better.
The latter may happen and clearly would be able to solve my questions. The question is at what cost? Optimistic singularity predictions are still decades out.
I’m pretty skeptical about the semantic web / linked data model whereby merely linking everything together or putting it all in one schema, or some combination, will help anyone. One reason is that it’s been done. Freebase is still ahead of it’s time, it embodies the “huge graph of data in the sky” and it plugs some gaps. “But what if everybody put their data in there!?” I hear you cry. Well, what if everybody stopped smoking, had their vitamins and didn’t read tabloids? It’s not going to happen.
There is some argument to be made that if we merely all used the same format that would help. And it would a little. But remember we tried that with XML. JSON is cute but again the value proposition for us all to move to JSON is not there yet. Either way it wouldn’t help me with my library book questions. My guess is that even if Amazon spit out JSON for my wish list and my library had a JSON API then I’m still waiting on the logic to tie them together. Maybe it will solve my problem, maybe not.
The halfway houses abound. Siri and Wolfram Alpha leap out as combining a sprinkling of data with a soucon of machine intelligence. Look how brilliantly they do! The domains they service may be tight but they offer us a peek of how things will work.
My guess is that the future looks like a munging together of Small Data, Big Data, automatic processing and human intelligence used as and when appropriate. Today we have some wild stabs in the dark at each of these but nothing like the coherent platforms of the future we could wave our hands to describe. It’s going to be fun to see it happen.