Friday, May 30, 2014

Going from Data to Insight in 5 Steps


When people talk about this space, you hear a lot of terminology thrown around - Structured and Structured Data, Hadoop, NoSQL, Machine Learning, Stream Computing, Remote Sensing... the list goes on. When I first entered this world several years back, I was completely overwhelmed by the alphabet soup of technologies built to get more mileage out of the world's ever-growing pile of data.

I've found that the best way to keep everything in order is to think in terms of the end product and then work your way back. All of these technologies exist because somebody, somewhere has a problem he/she is trying to solve. If a company is trying help farmers improve yields by supplying information on which types of seeds to plant in which locations at which times (last week's Economist profiled these efforts), then that company requires a collection of technologies that work together to produce insight.

At a high level, you need a way to acquire the data, somewhere to store it either temporally or long-term, an efficient way to clean or process it, analytic tools for unearthing key results, and tools for translating these results into action steps in a way that makes sense to the person that has to take that action. Put differently, you need a stack of technologies that accomplish five functions:

1. Get the data

Data is everywhere and growing exponentially as more people acquire and use smart phones and other types of wired-up technologies, as more companies use RFIDs, scanners, and other types of sensors to measure and monitor phenomena independently of humans, and as more governments make data available to the public. 

Entering the field, the only data I had ever worked with came from Microsoft Excel documents. I learned quickly that data comes in many shapes, sizes, and formats. If you go to Data.gov and browse some of the awesome information that the government is making available, you'll see data in a variety of static formats including XLS, CSV, or JSON. 

One of the big pushes in this space is towards data that is dynamic or real-time. The transit apps we use on our smart phones are only valuable insofar as they provide us information on what is happening right now in the subway system. And the information you provide to the farmer on her seed strategy for the season is more valuable to the extent that it is based off data that reflects recent weather and soil patterns. 

Image: WMATA
To support these kinds of applications, the developer community has pushed government agencies to make data available in streaming form through an application programming interface or API. Much like a physical product requires different parts and processes, APIs consist of tools and protocols that developers stitch together to create software. The transit app on your smart phone works by directing questions to the transit agency API at regular intervals. Hey, can you tell me the arrival times for the next 5 trains coming into Dupont Circle? Hey, which escalators are broken throughout the city (the answer in the Washington D.C. area is usually “all of them” as any DCer will tell you)? 

2. Store the data

Depending on the application, you will need a place to store your data. A developer of a transit app example may not worry about storage, choosing instead to grab data through an API and move right to steps 3 through 5. Most organizations on the other hand will want a place to park the data longer-term so analysts or whoever else can access it as business needs arise. 

First defined in 1970 by one Edgar Codd, relational databases linking different tables together have been the dominant database type for some time. A bike store could create a simple relational database using three Excel data-tables with information on different types of bikes, customers and the bikes they own, and store transactions. If you can easily link or relate the bike table to the sales table for example, it becomes easy to answer questions like, how many bikes did ABC bike shop sell ahead of this weekend's Fat Tire Festival

Increasingly, many companies are considering new types of non-relational databases that make it easier to store standard structured data as well as unstructured data that fails to fit nicely into rows and columns. I'll get into more of the details in a later post, but like anything else there is plenty of debate about which is better, faster, smarter. Oftentimes, the right choice depends on the company, how much data they are dealing with, and the kinds of things they want to do with that data. Some organizations end up maintaining multiple kinds of databases. 

From our end-product/end-user perspective, you are interested in both data integrity - is my database set up to safely preserve this information? – and data usefulness- will my database allow me to answer questions that I've thought of and perhaps those that I've yet to consider? 

3. Ready the data

This is where the people who get hands-on with data spend most of their time. Wrangling data, as many call it, includes getting it into a workable format, cleaning it, joining it with other data sources, and aggregating it. 
Image courtesy of Kurt Raschke

When creating transit apps that cover multiple systems in multiple cities, developers spend huge amounts of time massaging data that pumps out of transit agencies in various forms and at various quality levels. At a recent Meetup group, the room chuckled when the presenter put up a picture of an Arlington bus stop buckling under the weight of not one, but three separate transit agencies signs, each with its own unique information. 

The Portland, Oregon inspired OneBus Away Project is working to fix this, but a solution in the Capital region, where multiple transit agencies crisscross three state borders, is a bit trickier. 

4. Analyze the data

This is where the magic happens, where you take data that you have gathered stored, labored to prepare, and do something cool with it. At a high-level, the business goal here is to do something that reduces uncertainty.  Organizations are always looking ahead, thinking about what the next quarter or next year may bring. 

Going back to our bike shop example from above, you can look at the number of bikes sold over the last year and build a few graphs that clearly show the best-selling bikes, the months where sales were lowest/highest, and your best customers. That’s not a bad place to start. 

But if you can use this information to paint a picture of what to expect in the future, that is much more powerful. You want to get the point where you can say to the store manager, let’s keep x number of this type of bike in inventory and launch a marketing campaign using y kind of media targeting z kind of audience.  

With some applications, simply throwing together a couple graphs will bring forward patterns to guide decision making, and you may get away without any heavy modeling. The more sophisticated things get however – the more complex your business, the more data you are dealing with, the more variables bouncing around – the more you will want to explore something extra. 

The work of the Climate Corporation cited in the Economist article provides another example. While there is incredible power in simply making information available to people - the Climate Corporation would be doing agriculture a great service by producing high-quality information on historical crop growth and weather around the US - the real power is in the algorithm operating behind the scenes, sucking in the data and spitting out actionable information.

5. Explain the data

You can have it all going on in steps 1 through 4, but you are nowhere if your end-user doesn't see the value or want to use your data product. Apple is known for great technology and engineering, but it is great design that led me to accumulate more Apple products than I care to admit - my iMac still looks good after 4 years anchoring my small home office area. 

Similarly, the front-end (developer speak for the part of your product people actually see and interact with) of any data product is absolutely critical if you want your mobile transit app to sell, or your client to take the action that your analytic model recommends. 

The tools for nicely packaging information and delivering it to your user are better and cheaper than ever, and include things as simple as a Pivot Table or Custom Dashboard in Excel to a fully-customized web app that delivers maps, charts, and the like in a web browser. The power with the latter option, in addition to a level of customization and interactivity that you will never get with Microsoft Office, is the ability to share it over the Internet on desktops and/or on the go through mobile apps. Few people in 2014 would have the patience to open up a spreadsheet to determine which mode of public transit to take on a given day. 

From D3.js.org
For a sneak peak at what I am talking about, check out this incredible New York Times data visual showing how state politics have shifted over time. The data, the design, and the interactivity all come together to tell a powerful story efficiently (toggling between what is happening nationally and what is happening to the individual states is as easy as scrolling over the lines). 

And to finish things off... 

At the risk of sounding like a consultant (which I am), you can think of these five functions as a sort of data value chain (writing the words data value chain hurts as much as it must reading it). The next time you read an article in Wired or Fast Company about a new app or innovative, data-driven approach to addressing a problem in business or society, you can be sure that getting to the point where a hip publication would feature it required thinking through at least several (if not all) of the above. 

In the coming weeks, I'll look at some technologies and approaches that fall into each bucket. My hope is that this post can serve as a high-level roadmap for the blog, so when things get specific or a little on the technical side, this initial post can help right the ship for reader AND writer! 
 


Friday, May 23, 2014

Getting the ball rolling!

I woke up this morning a whole 29 years old! That 9 number is dangerously close to flipping that 2 to a 3 and, much like the experience of waking up New Year's morning with the whole of the year stretched before you brimming with potential, I woke up this morning motivated to mix things up. Save for a couple posts in school, I have never been much of a blogger. And while I am known to write a mean thank you note or Christmas card, my experience writing regularly about a professional topic is rather limited.

I have been in the analytics space for close to 2 years, doing hands-on data work for a large technology company. It has been a great experience. I came in with more of a business/strategy background and learned quickly that success as a young person in my division meant learning how to work with data... how to find it, how to grab it from various sources, how to make it interoperable, how to clean it, how to aggregate it, analyze it, present it, all the while thinking about the end user and his or her business purpose. Like any of the technologies that we take for granted, so much goes in to making checking the status of the next bus as simple as a few taps on a touch-screen.

For the first time, I could begin to piece together what makes this stuff possible. After many not-so-smooth conversations with database administrators, data engineers, statisticians, and software developers, as well as countless hours spent learning new technologies on my own, I understand how the pieces fit together and have begun dreaming about possibilities.

At the end of the day, analytics is about solving problems for people: enabling business leaders to be more confident in their decision-making, positioning cities to envision the outcome of a new program or infrastructure project before breaking ground on it, or equipping an individual with better information on potential career opportunities or safe bike routes between points in a city.

In my mind, analytics is so much more than a regression equation, time-series analysis, or optimization problem. While I won't dispute that the algorithm spinning in the background is the engine of any data-driven product, creating value through analytics is as much about great presentation and great design. It is also about integrating with the right databases, APIs, or sensors to bring information to your end-user when it is most valuable, which for many applications is as soon as it is created.

My plan for this blog is to focus on the technologies across the entire data value chain (more on that later) and how people and organizations are combining them to help people and solve problems. I understand that as blog topics go, this is very broad. I expect to jump around a little bit a the beginning as I get the hang of this and learn to parse out the interesting content from the sleep-inducing. The advantage of keeping this broad, as implied above, is that you have the opportunity to do justice to this exciting space, while in the process hopefully educating both folks who approach analytics from more of a business angle (as I did) or more of a technical.

And we're off...