Menu 1

Small Data

Big Data is, as Wikipedia puts it “data sets that grow so large that they become awkward to work with using on-hand database management tools.” Think every twitter post ever made or 10,000 peoples’ DNA sequences.

Big data, I think, has a big brother which I’ll call Small Data.

Small Data consists of data sets which are manageable using a spreadsheet yet either hard to obtain, hard to process in to meaningful results and non-obvious in how to visualize and share. I’ll get to an example soon but let’s put this in a table where I’m looking at it from the point of view that Big Data is used by a data scientist and Small Data should be usable by someone with basic spreadsheet skills.

Big Data Small Data
Data size Billions of rows Hundreds / Thousands of rows
Ease to obtain Hard (cost to host) Hard (cost to find)
Ease to process Hard (hardware costs) Hard (time costs)
Ease to visualize Easy (data scientist thinks about this all day) Hard (Joe User knows how to plot a graph and that’s it)
Ease to analyze Hard (hardware costs) Hard (don’t know which questions to ask)

I’m thinking out loud here but maybe the only difference is really dataset size and everything is still “hard” in some way.

Long example

Let’s jump to my example. What books are on my list to read that I can find cheaply or for free at the library as a book, audio CD or library ebook? My library will let me read books on a nook or kindle. In fact they will even let me listen on an MP3 device to many books but I excluded that.

I want to be very clear – this should be an easy question to answer.

If you don’t value your time it is in fact an easy question to answer by going through your list and searching every one on your local library. But of course your list could be in one of 100 formats and your library has it’s own formatting.

Luckily my wish list is on Amazon. It’s too many for me to do by hand in a reasonable amount of time (350 odd books). My library is the excellent KCLS and the most trafficked library in the US, or something.

We’ve excluded the time-consuming manual route. As someone with programming skills I should be able to write a script to connect the two and spit things out, right? Not really. First there is no lists API on amazon, supposedly because it saw little use but if I had to guess then because the potential for revenue loss if I import my list to the competition. Searching around there are solutions involving scraping data from your list by putting it in a simplified view in to a Google spreadsheet.

Sadly none of them worked. Even if they did, my wish list is multi-page and I’d have to mess around scraping each page. What I learnt though is that you can put your list in to a simplified view by adding ?layout=compact on the end of it (example).

Just copying and pasting the table in to Excel resulted in data. I did that four times, once for each page of my list and voila I have some data!

Crappy data, but data. Out of the table all I got was the book title and the price. The price was missing in cases where the book was only available second hand. I could probably spend an infinite amount of time playing with the copy & paste to extract the URL of the book or write a script to do that. To be clear though I value my time and my judgement was that I could get lost in the scripting route and spend hours iterating over data and fixing tiny bugs but I wanted the most efficient way to answer my question. I also wanted to assume the role of Joe User with some mild spreadsheet skills.

There’s a great way to use the crowd to answer questions about data; Mechanical Turk. With it I could spend some small amount of money to ask people to do things with my spreadsheet. I could upload the sheet and then spend a penny or so per row and the Turk system would allow any of its users to go do something like find out whether book X was available in format Y at my library.

Again I could go scrape my library website with a script or hope that they had an API. Again my guess was Turking it would be cheaper and quicker.

Off to we go. The first thing you have to do is design your task. Mine looked like this:

Thus an individual turker would see one or many of these tasks. Each would have the book name and ask for the price.

Uploading the sheet was non-trivial. First you need to put a heading on each column so I did ‘name’ and ‘price’. Then you need to export it to CSV. So I did, and sent it up to the cloud. MTurk failed saying there were invalid characters so I went in to the CSV with a text editor and removed a stray unicode umlaut and transliterated it to it’s ASCII cousin. This is all a bunch of gobbeldygook which roughly translates as “you need 3-8 years of computer science skills to do all this easily”.

I paid each one cent for each task. That adds up to $3.50 for all the books then some amazon fees bump it to $5 or so. Now I have a list of books, prices and the “best price” which I took to mean the 2nd hand price.

I’m format agnostic so I don’t care if I have a new book, 2nd hand, audio book, kindle, library, ebook from the library or what. It’s all the same content just in different packaging. Of course some are more accessible than others so I can spend $10 to get a book instantly on kindle or I can wait some number of hours or days to get it from the library. Thing is, my list is big enough that assuming the books are equally interesting (and they have to be since I haven’t read them to have data on which are more interesting than others) I can always get something interesting in any format I want.

This is important since right now choosing a book to read at random implies the format. So I pick a book from my list and maybe I have to pay for it, wait for it at the library or whatever. Instead, I want to be able to say “lets get the book on my list in CD Audio that’s been on the list for the longest time” or something like that when I no longer have any CDs in the car for the commute.

I have that data now (title, price, best price) only in theory. I have the first two in one spreadsheet. The second set is in another sheet that MTurk will export to me as a CSV. Connecting the two requires some spreadsheet skills. So I put the two in different sheets in the same workbook. Then I use the title in my original sheet to look up the best price in the second sheet using the VLOOKUP function. The second sheet that amazon exports adds all kinds of data like the time the task was done, the ID of the Turker and stuff I don’t care about.

So I go locate the columns I do care about, which are 28 columns over. I spend a bunch of time trying to use the LOOKUP function in Excel to find the data I want before figuring out that LOOKUP does some kind of odd interpolation thing and expects the data to be sorted. Amazon Mechanical Turk doesn’t return the data in the same order I sent it so I spend time playing around with VLOOKUP which has a different syntax to LOOKUP. Finally I get some data which looks like 350 or so rows of (title, price, best price).


No, we have quite some time to go. Now I have the pricing I don’t know the availability at the library. So I go back and modify my turk task and instead of asking people to find each book on Amazon I ask them to go to my library and find if it’s available in paper, CD or ebook. I do these as three separate tasks and each link is slightly different. You can ask the KCL catalog to search for the type of asset and also if it’s available (since I don’t care if they have a book but they don’t have copies any more). So I do some magic with the URL parameters so that the turkey doesn’t have to do that.

At each step I’m trying to make it a simple Turk task without spending an infinite amount of time making it too simple. For example I could encode the name of the book using urlencode() or something and then they wouldn’t have to copy/paste each book title. Economics becomes useful. Since the floor, the lowest I can pay per book title, is one penny all I have to do is not make it difficult enough that the Turker wants 2 pennies in compensation. So, I encode the book type (cd, ebook, paper) in the URL so they don’t have to interact with the search form. Then I add #available to the start of every book title in the turk task. This is a magic string which tells the KCLS catalog to only show available books. I could have put it in a URL parameter but I was bored and elegance was not the goal.

Here’s the resulting Turk task example:

Note that I also changed the text box input to a yes/no option. I didn’t do that to make it easier for the turker, it makes it easier for me. If I allowed them to type “yes” then I would end up with lots of variations like “Yes”, ” yes”, “YES” and so on that I am uninterested in processing.

Now I have an additional three spreadsheets with the title plus one of paper availability, CD availability or ebook availability.

Except I don’t. In the hour after I submitted each task about 98% of the results were in so I have partial results. This was still quicker than I expected but it also included some bad data. I double checked a few of the results and found one primary turker who had entered “yes” to 77 books without bothering to actually check the KCLS website. I banned him, didn’t find any more mass bad data and left it there.

I have a bunch of options here. I could ask multiple independent turkers to check the results. I could batch them up so one task required checking 3 formats not 3 tasks each checking one format each. I could pay more and attract more reliable turkers. But for now I’m happy with the results.

Long story short here is my final result:

I’ve used some conditional formatting to color the availability and price of the books. Plus I have one derived column looking for which books have the best 2nd hand to brand new price ratio.

Now, a few hours and $20 later I can actually ask some questions about the data. The problem is that Joe User wouldn’t get this far and it’s 2012. We should be able to do this kind of thing. For the curious, here is the complete data.

Back to theory

Big Data is great but clearly we can’t even tackle simple Small Data problems. The data collection is hard, the analysis is hard and the skill sets required are far beyond where they need to be.

There are a number of approaches happening today to try and help solve some of these problems and they go down approximately two routes. On one side there are those who believe “if only all the data in the world was all in some universal format” then things would magically be better. On the other “if only we had strong AI” then things would magically be better.

The latter may happen and clearly would be able to solve my questions. The question is at what cost? Optimistic singularity predictions are still decades out.

I’m pretty skeptical about the semantic web / linked data model whereby merely linking everything together or putting it all in one schema, or some combination, will help anyone. One reason is that it’s been done. Freebase is still ahead of it’s time, it embodies the “huge graph of data in the sky” and it plugs some gaps. “But what if everybody put their data in there!?” I hear you cry. Well, what if everybody stopped smoking, had their vitamins and didn’t read tabloids? It’s not going to happen.

There is some argument to be made that if we merely all used the same format that would help. And it would a little. But remember we tried that with XML. JSON is cute but again the value proposition for us all to move to JSON is not there yet. Either way it wouldn’t help me with my library book questions. My guess is that even if Amazon spit out JSON for my wish list and my library had a JSON API then I’m still waiting on the logic to tie them together. Maybe it will solve my problem, maybe not.

The halfway houses abound. Siri and Wolfram Alpha leap out as combining a sprinkling of data with a soucon of machine intelligence. Look how brilliantly they do! The domains they service may be tight but they offer us a peek of how things will work.

My guess is that the future looks like a munging together of Small Data, Big Data, automatic processing and human intelligence used as and when appropriate. Today we have some wild stabs in the dark at each of these but nothing like the coherent platforms of the future we could wave our hands to describe. It’s going to be fun to see it happen.


I’ve built a site to create an open geocoding dataset over at

The premise I worked with is to change the way geocoders work. Today, a geocoder uses some chunk of import code to import a large dataset from one format in to another. Then the geocoder itself (which is a large piece of software) takes a string from the user like “london” and uses it’s imported dataset to eventually give you a bounding box. The client uses this bounding box to zoom and pan a map to the correct place.

What if you threw all that away and just linked the string “london” to a bounding box? Thus opengeocoder.

In previews the number one thing asked for was synonym support. That is, “AK” should spit out the same box as “Alaska” without having to add both strings and two bounding boxes. So, you can do that. There is an API which spits out JSON so you can hook your map project up to it.

OpenGeocoder starts with a blank database. Any geocodes that fail are saved so that anybody can fix them. Dumps of the data are available.

There is much to add. Behind the scenes any data changes are wikified but not all of that functionality is exposed. It lacks the ability to point out which strings are not geocodable (things like “a”) and much more. But it’s a decent start at what a modern, crowd-sourced, geocoder might look like.

OpenStreetMapper Murdered

Ulf's killer?

Tragically one of OpenStreetMappers finest contributors is no longer with us:

“We are trying to find the people who killed our relative, Ulf Möller. On the evening of the 9th of January 2012, Ulf fell victim to a brutal robbery-murder in Eastern Germany. The people who attacked him apparently were from Eastern Europe, possibly from Poland or Lithuania. When they used Ulf’s bank cards to withdraw money, surveillance cameras captured clear pictures of one of them.” link to site about our loss.

What can you do?

WhereCampSF 2012

Meet awesome people

I’ve put up a registration page for wherecamp 2012. It’s right before Where 2.0 on the 31 March & the 1st April (which also happens to be the OSM license change deadline). Details like venue are still being worked out. It’s free to attend (but donations welcome) and people like you publicize the event so we can get awesome sponsors to pay for things like food. Also, feel free to get in touch if you’re so inclined.


I’m a New Radical

Holding some old maps

The Observer, a British Sunday newspaper and sister to The Guardian, has a very kind article about me and the ubiquitous OpenStreetMap today.

The photo shoot was the most fun. I’ve worked with Kaela at the excellent Serendipity before. On fairly short notice we found a second hand book store in Duval and bought up a dozen or so old paper maps for something like a dime each. Then we had some fun taking pictures outside in the rain and ruining each map before going inside and taking the picture you see.

My doctor’s wife goaded me to agree with her recently that paper maps from the ’60s are not worth a whole lot to kids doing school projects. Her better half had apparently donated several in such a cause.

It made me think about how people of my generation began to use scientific calculators extensively at school and could skip the fundamental knowledge of solving quadratic equations. Just as a generation earlier multiplying large numbers was expedited by simpler hand-held calculators. Later on, I was lucky enough to work at Wolfram Research as an intern before university and had ready access to Mathematica. That’s like giving toddlers access to thermonuclear weapons. Perhaps a relevant analogy would be giving 10 year-old primary school students in England access to various computational equipment from Bletchley Park in 1943.

Presumably computational algebra systems will trickle down to high school and then elementary school students with time just as the other technologies did.

Thus too with maps?

It’s already happened, admittedly to the ready dismay of cartographers everywhere. This makes me think of, randomly, the market share over time of mobile phone operating systems:

The graph works well as an analogy for any technical displacement over time. I’d be curious to see one for the use of various types of maps over time. Broadly paper was dominant for about 2,000 years and then the PND took, at a guess, half the market share within a decade or two. Shortly after that the internet arrived and with it MapQuest and MultiMap. In the blink of an eye Google took the eyeballs – but not the profit – from them.

With a bit of luck perhaps the next phase will be dominated by a more enlightened and open approach.

Powered by WordPress. Designed by WooThemes