Want to get beyond data grunt work? Try these steps for data setup.

With so much information spread out across various workflow software, Excel files, and even scraps of paper, it takes enormous effort to gather all the data, and to normalize it so all the information can work together toward a consistent story. It’s not much of a surprise that the grunt work of “data wrangling” takes up to 80% of all data effort.

How do you keep the data wrangling to a minimum? Setting up your data in the right way is key. I’ll offer a few best practices below.

Know what you’re getting
When a retail supplier ships goods to a store, everyone is in agreement on what’s being shipped—so the folks in the store can easily take the items off the truck, and on to the shelves.

Shipping data is no different. As clients (including internal clients) hand their data over to you, one of the first questions you need to be able to answer is: What data am I receiving right now? The more precisely you can answer that question, the more effectively you’ll avoid ambiguities—so you can simplify data processing down the road.

For this to work, it’s important that both the providing party and the accepting party are in agreement on the data content. This means being as explicit as possible about nuances like how the data was collected (rarely obvious without some kind of meta-data). It also means specifying ambiguous terms—like whether the “dollars” you’re referring to are US or Canadian (you’d be amazed how often people get tripped up on that one).

Repeat it back

At the end of the data acceptance process, you have to feel that you own and understand the content. Which means that part of the acceptance process is articulating assumptions, looking at summaries and trends, comparing data to other data sources. To make data acceptance really work, take the time to articulate back to the provider what you found in the data provided. That makes the data trusted and memorable. It also introduces common terminology and brings people on the same page. And you’d be amazed how much that articulation forces you to conquer data formatting problems up front.

Finding the analysis-ready data points

Once you have all the data in hand, your next job is to figure out which data will actually be useful for your analyses—and what might be best to keep aside. You’ll need to strike a subtle balance between storing as much data as possible—and not trying to boil the ocean. To find that balance, ask yourself a few critical questions:

  • What data readily falls into the scope of the project, what clearly falls outside of it—and what lands in between?
  • What data will provide the best insight? Compare multiple sources for quality, granularity, data collection methodologies and resulting differences in data volume, coverage, and trends over time. When you see the range of the data you have, you’ll have a better grasp on which data is the most valuable—and which might be a waste of time, given your other options.
  • Where can it fit? Some data can be accepted in whole; other data might still be valuable mashed with other data sources to create a full picture.
  • Where are the hidden gems? A lot of data will contain interesting facts that might look like errors or inconsistencies to an untrained eye. Keep an open, creative perspective that lets you realize where an “unwanted” piece of data might really be valuable—so you can uncover the less-obvious data points that are especially (and unexpectedly) worth keeping.

Beware of over-normalizing

Once you have all your data in hand, you’ll want to normalize and harmonize the data into something consistent and usable to lend itself readily to insights. This is where shape of the data becomes important. But beware of over-normalization. Since normalization requires reducing information to just a few variables (to make apples-to-apples comparisons easier), there’s always the risk of scrubbing the data so well that you rub away the critical nuances. And those nuances can be painstaking to put back in to the mix once you realize you need them. To save time up front, be clear in defining your analytical variables. In other words, keep your data’s “native tongue” intact – if it speaks in terms of marketing spend, revenues, or units in stock, do not translate it into amounts and counts (even if your “inner engineer” is pushing you to common terminologies).

Of course, none of what I’ve described here is exhaustive. And I’ve left out the ways that automation can be an enormous resource—a topic I hope to return to with a follow-up post on data agility. But what I hope I have offered is a catalyst to get you thinking about taking the data wrangling out of the data process—and putting the real analysis back in