Data Dash DC

Monday, June 30, 2014

How crowd-sourced data is solving transportation problems

A couple months back I attended a neat event here in DC hosted by the Meetup group Transportation Techies. The topic was buses and the technologies deployed to collect data on the number of passengers, routes, and service quality.

Kevin Webb of Conveyal gave a fascinating talk on the open-source-driven data products his company is building for the transport sector, including an online map of all the bus routes operating in Mexico City. Although it does not have a completely formalized bus network like we have here in DC, Mexico City does have a transit system that developed over time as entrepreneurs recognized unmet needs for transit from A to B.

Conveyal's approach to Getting the Data was innovative and truly grassroots. They built a small app that was Strava-esque in its ability to capture GPS coordinates on the fly. They then recruited a small army of university students, equipped them with this new location app, and sent them all over Mexico City with the goal to ride and track as many forms of transit as possible. All of this data funneled into a central data repository that Conveyal then tapped to Explain the Data through a mapping tool such as Mapbox.

Conveyal and Strava are among the new wave of tech companies whose products are fueled by crowd-sourced data. Waze, a mobile app providing information on road conditions, relies on its network of users to post real-time alerts on traffic flow and a host of other good-to-knows such as suspension-snapping pot holes, outrageously low gas prices, and clever getaway routes. MTB Project, an initiative dear to my heart, allows off-roaders to create maps of their favorite rides and post detailed descriptions for the less-initiated.

Like with almost any crowd-sourced project, the value of the offering grows with the number of data points. The more folks who download and contribute to Waze, exclusively through their voice recognition software while keeping hands on the wheel and eyes on the road, the better it is for everybody else. Open Street Map, which powers companies like MapBox, is a crowd-sourced alternative to Google Maps in that it relies primarily on individuals to contribute roads and other geographic features that the Google car fails to pick up.

This summer, I plan to contribute to an open-source project for the first time when I take to the trails. We'll see how it goes!

Tuesday, June 24, 2014

A peek under the hood of a bikeshare web app

In a previous post, we looked at how data powers bike share systems. The operators of these systems have done a fantastic job coming up with a standard that allows data scientists and software developers to grab data and quickly begin working with it.

Last week, I put together my own, simplified version of a bike-app:

The live version is available here and works by regularly asking the worldwide bikeshare API for updated information on bike availability. Each system in turn sends data over the Internet in JavaScript Object Notation or JSON form. Think of each set of sub-brackets as a packet of information that describes a different station:

[
  {
    "bikes": 6, 
    "name": "31000 - 20th & Bell St", 
    "idx": 0, 
    "lat": 38856100, 
    "timestamp": "2014-06-23T10:51:07.302Z", 
    "lng": -77051200, 
    "id": 0, 
    "free": 5, 
    "number": 1
  },  
  {
    "bikes": 8, 
    "name": "31002 - 20th & Crystal Dr", 
    "idx": 2, 
    "lat": 38856400, 
    "timestamp": "2014-06-23T10:51:07.316Z", 
    "lng": -77049200, 
    "id": 2, 
    "free": 7, 
    "number": 3
  },....

]

The equivalent data table in Excel would have 9 columns, as many rows as there are stations and look something like the following:

While less intuitive than data stored as an Excel table, JSON's key advantage is its workability in the browser. Capital Bikeshare's API pipes data to this website where it is available for developers.

My front-end code - developer speak for the part of the program the user actually sees - in JavaScript pulls in data and loops through each sub-bracket to grab information needed for the map such as station name (name), the current number of bikes (bikes), the current number of free slots (free), and the station's latitude and longitude (lat and lng).

The mapping magic happens thanks to a JavaScript library called Mapbox.js. Lightweight, open-source, and robust, Mapbox greatly simplifies the process of plotting data on a map, making it possible for less-seasoned programmers (such as yours truly) to quickly set something up.

What I love about Mapbox and the rest of the open-source mapping community is the stock they put into great design. I think of DC as one big park with a bunch of buildings in the middle of it and chose this base map because it really highlights this quality of the city.

Now that I have my own personal bikeshare app, I have absolutely no excuse for missing out on a bike this morning. Time for me to wrap this up and get out there!

Friday, June 13, 2014

How data helps bike share operators keep the good times rolling

DC is home to one of the many bike share systems that dot the US. Having launched in September 2010, Capital Bike Share (CaBi for short) is now everywhere in the Capital. Tourists grab the friendly red bikes and use them to jet around the National Mall. Commuters use them to bike to work downtown, forgoing the hassle of the Bus/Train system and the worry of stashing a personal bike for the day.

Joining CaBi was one of the first things I did after moving back to DC after a brief stint up north. I absolutely love it and on mornings when I have my affairs in order, CaBi rewards me with a killer commute that descends from the National Cathedral, skirts the Potomac, and cruises up the National Mall. Not bad, right?

More often than not however, my mornings are anything but orderly. And on days when the sun is shining and the humidity is in check, I often stumble out to an empty bike share station and am forced onto the bus, helmet in hand.

Fortunately for folks like me, there are some good data efforts under way to help manage the problem for individuals and the system at large. Last week, a group of civic hackers from Code for DC released this CaBi Bikeshare Odds app allowing users to estimate the probability of getting a bike based on the time of day.

Code for DC drew data and much of their inspiration from the Data Science for Social Good Bikeshare project. The project works all 5 steps of the Data to Insight chain to help bike share companies re-balance their bike fleets so that people like me stand a decent chance at grabbing a rig when and where we need it.

At a basic level, the Code for DC product works by pulling real-time bike and weather data in through APIs (Get the Data), using the Python programming language to prepare the data for modeling (Ready the Data), running the data through a model built to predict the number of bikes at a station based on things like weather, time of day, number of bikes currently at the station, and historical usage (Analyze the Data), and sharing the results at frequent intervals by way of a simple, map-based web app that shows the current and predicted number of bikes around the city (Explain the Data). For a more detailed description of tools and methods used, check out the team's awesome documentation here!

While fantastic, I think both projects could do more to Explain the Data. By that I mean take the interface development several steps further by making it available to people on the fly. Time-challenged people like me aren't going to power up a computer and navigate to a website. Why not flash relevant information about nearby bike stations and the likely flows in and out of them? That way, your users seamlessly pick up pieces of information that they can use to plan their mornings. Getting information to users seamlessly is the problem that emerging wearables technologies like Google Glass are trying to solve, and something I will explore in a later post.

It is also worth checking out some of the other projects the Data Science for Social Good folks are churning out. As government agencies continue to make more and better data available on issues ranging from health, to transportation, to employment, we'll see more projects like the one above.

Thursday, June 5, 2014

Data from smart phones

In my last post, we looked at the 5 steps needed to go from data to insight. I think of the first step, Get the Data, as really the fuel that powers everything else. Data's ever-increasing availability and variety of form is spawning the innovations introduced in the remaining four. What good is a database if you have nothing interesting to store in it? Not much. Why bother with a Hadoop processing system if you are working with data that fits nicely into one Excel workbook? You wouldn't.

The cool, headline-grabbing technologies are there because data is exploding. So where specifically is this stuff coming from?

Our mobile devices throw off an incredible amount of information. Traffic in Google Maps works by crowd-sourcing location information telecommunication companies pull from phones. While not every person stuck in traffic along the Capitol Beltway is a proud owner of an Android, enough of these devices are out there so that Google knows when to paint I-495 a heartbreakingly dark shade of red.

Similarly, phone-generated location data in developing countries is helping the medical community manage malaria outbreaks. By understanding patterns of movement across at-risk regions, public health professionals can craft better communications and target areas to spray with more precision.

When our devices aren't passively generating data from our everyday motions, they are generating data based on actions we take. As an avid bike rider, I am a regular user of the Strava Bike App. The app allows me to track my ride and easily share how much slower I shredded Skyline Drive in the Shenandoah Valley than the 300 people ahead of me. Every time I push the record button, I am generating information that is useful not just to me the individual, but also to transportation planners looking to re-think city infrastructure, and companies looking to market to active types such as myself.

For each activity, Strava records my ride with an incredible amount of detail. I get great summary information like average or max speed, vertical feet climbed, and calories burned. If I want more detail, I can get a graph like the one below plotting speed on the y-axis versus mileage on the x. I can even playback my ride to see at which points on the map I achieved what speeds and what elevations.

This is great for me, but also great for Strava and hopefully the rest of the United States' growing cycling community. Oregon's Department of Transportation recently purchased Strava data from Portland for $20,000 and is using it to understand bike traffic at a level of detail that traditional methods such as traffic surveys couldn't approach. Every time I record a ride through Strava, I am contributing to a body of evidence that city planners may one day use to re-think sketchy intersections and head-spinning traffic circles. On that subject, points to whoever can explain the circle inside a circle-type intersection near American University!

Smart phones are just one of a handful of technologies tuned to record the world around them and make data available for later use. From a data privacy perspective, the problem is that data from these devices is at the individual level. So without proper care or intent, it becomes very easy for another person or organization to learn about you the individual - where you spend your time, what you like to eat, what you spend your money on - rather than the group you belong to - your generation, your income bracket, your likely political leanings. In data speak, we have moved from inference to here is what is actually going on with this person.

On a personal level, I am happy to share my rides data with Strava to the extent that they use it to make biking better and safer. But I am still working through how I would feel if they began selling my information so outdoor companies could improve their marketing to me. I find a lot of the personalized marketing I receive now to be annoying because it is ill-timed and imprecise. But suppose a bike company got it right and threw an offer for the perfect mountain bike at a time when I was considering an upgrade. Would I really turn it down?

Friday, May 30, 2014

Going from Data to Insight in 5 Steps

When people talk about this space, you hear a lot of terminology thrown around - Structured and Structured Data, Hadoop, NoSQL, Machine Learning, Stream Computing, Remote Sensing... the list goes on. When I first entered this world several years back, I was completely overwhelmed by the alphabet soup of technologies built to get more mileage out of the world's ever-growing pile of data.

I've found that the best way to keep everything in order is to think in terms of the end product and then work your way back. All of these technologies exist because somebody, somewhere has a problem he/she is trying to solve. If a company is trying help farmers improve yields by supplying information on which types of seeds to plant in which locations at which times (last week's Economist profiled these efforts), then that company requires a collection of technologies that work together to produce insight.

At a high level, you need a way to acquire the data, somewhere to store it either temporally or long-term, an efficient way to clean or process it, analytic tools for unearthing key results, and tools for translating these results into action steps in a way that makes sense to the person that has to take that action. Put differently, you need a stack of technologies that accomplish five functions:

1. Get the data

Data is everywhere and growing exponentially as more people acquire and use smart phones and other types of wired-up technologies, as more companies use RFIDs, scanners, and other types of sensors to measure and monitor phenomena independently of humans, and as more governments make data available to the public.

Entering the field, the only data I had ever worked with came from Microsoft Excel documents. I learned quickly that data comes in many shapes, sizes, and formats. If you go to Data.gov and browse some of the awesome information that the government is making available, you'll see data in a variety of static formats including XLS, CSV, or JSON.

One of the big pushes in this space is towards data that is dynamic or real-time. The transit apps we use on our smart phones are only valuable insofar as they provide us information on what is happening right now in the subway system. And the information you provide to the farmer on her seed strategy for the season is more valuable to the extent that it is based off data that reflects recent weather and soil patterns.

Image: WMATA

To support these kinds of applications, the developer community has pushed government agencies to make data available in streaming form through an application programming interface or API. Much like a physical product requires different parts and processes, APIs consist of tools and protocols that developers stitch together to create software. The transit app on your smart phone works by directing questions to the transit agency API at regular intervals. Hey, can you tell me the arrival times for the next 5 trains coming into Dupont Circle? Hey, which escalators are broken throughout the city (the answer in the Washington D.C. area is usually “all of them” as any DCer will tell you)?

2. Store the data

Depending on the application, you will need a place to store your data. A developer of a transit app example may not worry about storage, choosing instead to grab data through an API and move right to steps 3 through 5. Most organizations on the other hand will want a place to park the data longer-term so analysts or whoever else can access it as business needs arise.

First defined in 1970 by one Edgar Codd, relational databases linking different tables together have been the dominant database type for some time. A bike store could create a simple relational database using three Excel data-tables with information on different types of bikes, customers and the bikes they own, and store transactions. If you can easily link or relate the bike table to the sales table for example, it becomes easy to answer questions like, how many bikes did ABC bike shop sell ahead of this weekend's Fat Tire Festival?

Increasingly, many companies are considering new types of non-relational databases that make it easier to store standard structured data as well as unstructured data that fails to fit nicely into rows and columns. I'll get into more of the details in a later post, but like anything else there is plenty of debate about which is better, faster, smarter. Oftentimes, the right choice depends on the company, how much data they are dealing with, and the kinds of things they want to do with that data. Some organizations end up maintaining multiple kinds of databases.

From our end-product/end-user perspective, you are interested in both data integrity - is my database set up to safely preserve this information? – and data usefulness- will my database allow me to answer questions that I've thought of and perhaps those that I've yet to consider?

3. Ready the data

This is where the people who get hands-on with data spend most of their time. Wrangling data, as many call it, includes getting it into a workable format, cleaning it, joining it with other data sources, and aggregating it.

Image courtesy of Kurt Raschke

When creating transit apps that cover multiple systems in multiple cities, developers spend huge amounts of time massaging data that pumps out of transit agencies in various forms and at various quality levels. At a recent Meetup group, the room chuckled when the presenter put up a picture of an Arlington bus stop buckling under the weight of not one, but three separate transit agencies signs, each with its own unique information.

The Portland, Oregon inspired OneBus Away Project is working to fix this, but a solution in the Capital region, where multiple transit agencies crisscross three state borders, is a bit trickier.

4. Analyze the data

This is where the magic happens, where you take data that you have gathered stored, labored to prepare, and do something cool with it. At a high-level, the business goal here is to do something that reduces uncertainty. Organizations are always looking ahead, thinking about what the next quarter or next year may bring.

Going back to our bike shop example from above, you can look at the number of bikes sold over the last year and build a few graphs that clearly show the best-selling bikes, the months where sales were lowest/highest, and your best customers. That’s not a bad place to start.

But if you can use this information to paint a picture of what to expect in the future, that is much more powerful. You want to get the point where you can say to the store manager, let’s keep x number of this type of bike in inventory and launch a marketing campaign using y kind of media targeting z kind of audience.

With some applications, simply throwing together a couple graphs will bring forward patterns to guide decision making, and you may get away without any heavy modeling. The more sophisticated things get however – the more complex your business, the more data you are dealing with, the more variables bouncing around – the more you will want to explore something extra.

The work of the Climate Corporation cited in the Economist article provides another example. While there is incredible power in simply making information available to people - the Climate Corporation would be doing agriculture a great service by producing high-quality information on historical crop growth and weather around the US - the real power is in the algorithm operating behind the scenes, sucking in the data and spitting out actionable information.

5. Explain the data

You can have it all going on in steps 1 through 4, but you are nowhere if your end-user doesn't see the value or want to use your data product. Apple is known for great technology and engineering, but it is great design that led me to accumulate more Apple products than I care to admit - my iMac still looks good after 4 years anchoring my small home office area.

Similarly, the front-end (developer speak for the part of your product people actually see and interact with) of any data product is absolutely critical if you want your mobile transit app to sell, or your client to take the action that your analytic model recommends.

The tools for nicely packaging information and delivering it to your user are better and cheaper than ever, and include things as simple as a Pivot Table or Custom Dashboard in Excel to a fully-customized web app that delivers maps, charts, and the like in a web browser. The power with the latter option, in addition to a level of customization and interactivity that you will never get with Microsoft Office, is the ability to share it over the Internet on desktops and/or on the go through mobile apps. Few people in 2014 would have the patience to open up a spreadsheet to determine which mode of public transit to take on a given day.

From D3.js.org

For a sneak peak at what I am talking about, check out this incredible New York Times data visual showing how state politics have shifted over time. The data, the design, and the interactivity all come together to tell a powerful story efficiently (toggling between what is happening nationally and what is happening to the individual states is as easy as scrolling over the lines).

And to finish things off...

At the risk of sounding like a consultant (which I am), you can think of these five functions as a sort of data value chain (writing the words data value chain hurts as much as it must reading it). The next time you read an article in Wired or Fast Company about a new app or innovative, data-driven approach to addressing a problem in business or society, you can be sure that getting to the point where a hip publication would feature it required thinking through at least several (if not all) of the above.

In the coming weeks, I'll look at some technologies and approaches that fall into each bucket. My hope is that this post can serve as a high-level roadmap for the blog, so when things get specific or a little on the technical side, this initial post can help right the ship for reader AND writer!

Friday, May 23, 2014

Getting the ball rolling!

I woke up this morning a whole 29 years old! That 9 number is dangerously close to flipping that 2 to a 3 and, much like the experience of waking up New Year's morning with the whole of the year stretched before you brimming with potential, I woke up this morning motivated to mix things up. Save for a couple posts in school, I have never been much of a blogger. And while I am known to write a mean thank you note or Christmas card, my experience writing regularly about a professional topic is rather limited.

I have been in the analytics space for close to 2 years, doing hands-on data work for a large technology company. It has been a great experience. I came in with more of a business/strategy background and learned quickly that success as a young person in my division meant learning how to work with data... how to find it, how to grab it from various sources, how to make it interoperable, how to clean it, how to aggregate it, analyze it, present it, all the while thinking about the end user and his or her business purpose. Like any of the technologies that we take for granted, so much goes in to making checking the status of the next bus as simple as a few taps on a touch-screen.

For the first time, I could begin to piece together what makes this stuff possible. After many not-so-smooth conversations with database administrators, data engineers, statisticians, and software developers, as well as countless hours spent learning new technologies on my own, I understand how the pieces fit together and have begun dreaming about possibilities.

At the end of the day, analytics is about solving problems for people: enabling business leaders to be more confident in their decision-making, positioning cities to envision the outcome of a new program or infrastructure project before breaking ground on it, or equipping an individual with better information on potential career opportunities or safe bike routes between points in a city.

In my mind, analytics is so much more than a regression equation, time-series analysis, or optimization problem. While I won't dispute that the algorithm spinning in the background is the engine of any data-driven product, creating value through analytics is as much about great presentation and great design. It is also about integrating with the right databases, APIs, or sensors to bring information to your end-user when it is most valuable, which for many applications is as soon as it is created.

My plan for this blog is to focus on the technologies across the entire data value chain (more on that later) and how people and organizations are combining them to help people and solve problems. I understand that as blog topics go, this is very broad. I expect to jump around a little bit a the beginning as I get the hang of this and learn to parse out the interesting content from the sleep-inducing. The advantage of keeping this broad, as implied above, is that you have the opportunity to do justice to this exciting space, while in the process hopefully educating both folks who approach analytics from more of a business angle (as I did) or more of a technical.

And we're off...