When people talk about this space, you hear a lot of terminology
thrown around - Structured and Structured Data, Hadoop, NoSQL, Machine
Learning, Stream Computing, Remote Sensing... the list goes on. When I first
entered this world several years back, I was completely overwhelmed by the
alphabet soup of technologies built to get more mileage out of the world's
ever-growing pile of data.
I've found that the best way to keep everything in order is to
think in terms of the end product and then work your way back. All of these
technologies exist because somebody, somewhere has a problem he/she is trying
to solve. If a company is trying help farmers improve yields by supplying
information on which types of seeds to plant in which locations at which times
(last week's Economist profiled these efforts),
then that company requires a collection of
technologies that work together to produce insight.
At a high level, you need a way to acquire the data, somewhere to
store it either temporally or long-term, an efficient way to clean or process
it, analytic tools for unearthing key results, and tools for translating these
results into action steps in a way that makes sense to the person that has to
take that action. Put differently, you need a stack of technologies that
accomplish five functions:
1. Get the data
Data is everywhere and
growing exponentially as more people acquire and use smart phones and other
types of wired-up technologies, as more companies use RFIDs, scanners, and
other types of sensors to measure and monitor phenomena independently of
humans, and as more governments make data available to the public.
Entering the field, the
only data I had ever worked with came from Microsoft Excel documents. I learned
quickly that data comes in many shapes, sizes, and formats. If you go to Data.gov and
browse some of the awesome information that the government is making available,
you'll see data in a variety of static formats including XLS, CSV, or JSON.
One of the big pushes in
this space is towards data that is dynamic or real-time. The transit apps we
use on our smart phones are only valuable insofar as they provide us
information on what is happening right now in the subway system. And the
information you provide to the farmer on her seed strategy for the season is
more valuable to the extent that it is based off data that reflects recent
weather and soil patterns.
![]() |
| Image: WMATA |
2. Store the data
Depending
on the application, you will need a place to store your data. A developer of a
transit app example may not worry about storage, choosing instead to grab data
through an API and move right to steps 3 through 5. Most organizations on the
other hand will want a place to park the data longer-term so analysts or
whoever else can access it as business needs arise.
First
defined in 1970 by one Edgar
Codd, relational databases linking different tables together have been the
dominant database type for some time. A bike store could create a simple
relational database using three Excel data-tables with information on different
types of bikes, customers and the bikes they own, and store transactions. If you can easily link or relate the bike table to the sales table for example, it becomes easy to answer questions
like, how many bikes did ABC bike shop sell ahead of this
weekend's Fat Tire Festival?
Increasingly,
many companies are considering new types of non-relational databases that make
it easier to store standard
structured data as well as unstructured data that fails to fit nicely into
rows and columns. I'll get into more of the details in a later post, but like
anything else there is plenty of debate about which is better, faster, smarter.
Oftentimes, the right choice depends on the company, how much data they are
dealing with, and the kinds of things they want to do with that data. Some
organizations end up maintaining multiple kinds of databases.
From
our end-product/end-user perspective, you are interested in both data integrity
- is my database set up to safely preserve this information? – and
data usefulness- will my database allow me to answer
questions that I've thought of and perhaps those that I've yet to
consider?
3. Ready the data
This
is where the people who get hands-on with data spend most of their time.
Wrangling data, as many call it, includes getting it into a workable format,
cleaning it, joining it with other data sources, and aggregating it.
![]() |
| Image courtesy of Kurt Raschke |
When
creating transit apps that cover multiple systems in multiple cities,
developers spend huge amounts of time massaging data that pumps out of transit
agencies in various forms and at various quality levels. At a recent Meetup
group, the room chuckled when the presenter put up a picture of an Arlington
bus stop buckling under the weight of not one, but three separate transit
agencies signs, each with its own unique information.
The
Portland, Oregon inspired OneBus Away Project is working to fix
this, but a solution in the Capital region, where multiple transit agencies
crisscross three state borders, is a bit trickier.
4. Analyze the data
This
is where the magic happens, where you take data that you have gathered stored,
labored to prepare, and do something cool with it. At a high-level, the
business goal here is to do something that reduces uncertainty. Organizations are always looking ahead,
thinking about what the next quarter or next year may bring.
Going back to our
bike shop example from above, you can look at the number of bikes sold over the
last year and build a few graphs that clearly show the best-selling bikes, the
months where sales were lowest/highest, and your best customers. That’s not a
bad place to start.
But if
you can use this information to paint a picture of what to expect in the
future, that is much more powerful. You want to get the point where you can say
to the store manager, let’s keep x number
of this type of bike in inventory and launch a marketing campaign using y kind
of media targeting z kind of audience.
With some applications, simply
throwing together a couple graphs will bring forward patterns to guide decision
making, and you may get away without any heavy modeling. The more sophisticated
things get however – the more complex your business, the more data you are
dealing with, the more variables bouncing around – the more you will want to
explore something extra.
The
work of the Climate Corporation cited in the Economist article provides another
example. While there is incredible power in simply making information available
to people - the Climate Corporation would be doing agriculture a great service
by producing high-quality information on historical crop growth and weather
around the US - the real power is in the algorithm operating behind the scenes, sucking in
the data and spitting out actionable information.
5. Explain the data
You
can have it all going on in steps 1 through 4, but you are nowhere if your
end-user doesn't see the value or want to use your data product. Apple is known
for great technology and engineering, but it is great design that led me to
accumulate more Apple products than I care to admit - my iMac still looks good
after 4 years anchoring my small home office area.
Similarly,
the front-end (developer speak for the part of your product people actually see
and interact with) of any data product is absolutely critical if you want your
mobile transit app to sell, or your client to take the action that your
analytic model recommends.
![]() |
| From D3.js.org |
And to finish things off...
At the risk of sounding
like a consultant (which I am), you can think of these five functions as a sort
of data value chain (writing the words data value chain hurts as much as it
must reading it). The next time you read an article in Wired or Fast Company
about a new app or innovative, data-driven approach to addressing a problem in
business or society, you can be sure that getting to the point where a hip
publication would feature it required thinking through at least several (if not
all) of the above.
In the coming weeks,
I'll look at some technologies and approaches that fall into each bucket. My
hope is that this post can serve as a high-level roadmap for the blog, so when
things get specific or a little on the technical side, this initial post can
help right the ship for reader AND writer!


