Don’t they know how hard it is? Maybe, but no one said big data was clean and easy. In fact, most things described as ‘big’ are hard and daunting…except perhaps for Clifford, my daughter’s favorite fictional Big Red Dog.
Some describe big data as simply a data set too large to be analyzed (or even opened), in a standard spreadsheet. It requires using programs such as R, SQL, or Python—that name alone seems scary, as well as the somewhat rare (though increasingly common) skill set of a ‘data scientist’. Compounding this problem is that once data is aggregated and analyzed, it will almost always be unstructured which translates to messy and incomplete.
Your data will always be imperfect and be found somewhere along the spectrum between insightful, perfect data and unstructured, random white noise. Where your data falls depends on: 1) a clear articulation of a specific measurement you wish to capture, 2) how carefully you design your data collection systems, 3) how rigorously you enforce data validation rules, and 4) the availability of data to begin with. Each of these topics is worthy of a separate discussion that I’ll address in future postings.
Often, simple is best. For example, joining a new data set might add a marginally useful insight, but could risk generating a significant number of duplicates. Evaluate whether this makes sense and proceed with caution…and save a backup for reversion just in case.
Lastly, big data analysis is only as good as the trust and decisions that flow from it. Imperfections need to be known and identified so that the data set can be completely trusted within it’s stated limitations. Failing to disclose these limitations will make any big data project a potentially interesting, but useless endeavor.