An Adaptive Approach for Handling Messy Big Data
The Adaptive approach for handling big data relies on the following principles
1)Â Agnostic to statistical tool – A frequency analysis and a cross tab is the same, even if we do it in R or Python or Julia or SAS languages. The tradeoffs between license cost must balance the liability for a bug in the software as well as the cost of transitioning and training to a newer software.
2)Â Believers in choose the right database for the right data â€“ Some things are best stored in Hadoop. Some data are best stored in NoSQL like MongoDB or Cassandra. Others are best in a shared case of RDBMS. We balance the need for diversity in databases to handle diverse data with the need for a coherent and consistent ETL environment which can help businesses with an easy to query analytics environment.
3)Â Right tool for right situation â€“ Sometimes we should use SQL to query. Sometimes we should select some other language (including Pig or HiveQL). The analytics situation should dictate the tool, not the training needs of the analyst. If your internal analytics team does not have anyone who knows how to parse JSON, it is cheaper to hire a consultant who knows the tools that can do it, rather than pursue XML just because that is what your team members are familiar with or your statistical tool can only handle XML.
4)Â Following the right mix of tools and technique â€“ Some tools are great for some techniques. For example R is great for Data Visualization. Knowing the R GUI Deducer or the Kmggplot2 plugin for RCommander further cuts down the time to make great visualizations. Would you like to do data mining on gigabytes of data? Maybe we should choose Python. Do you have sensitive (and huge) data for propensity modeling for financial services and need a quick and robust model? Aha! SAS language is the tool for you. We think an adaptive approach should let the customer (you) and YOUR need(s) and YOUR situation dictate the mix of tools and techniques. Oh, and one more thing! All software has bugs and there is no such thing as a perfect software (created by any human regardless of open or closed source).
5)Â Basic principles still apply â€“ Yes the year is 2014! But we still think random sampling is relevant. So is distribution analysis. Oh, and so is Bonferroni Outlier and some (but not all) of those statistical tests.
6) Statistics Jargon can be made easier for non-statisticians to understand â€“ The Adaptive approach adjusts itself to the audience and not the speaker. Harry Truman said, â€œif you canâ€™t convince them, confuse themâ€. He would have been a great statistician and an awful data scientist. We believe in simple words to explain complex techniques.
7)Â Garbage in Garbage Out â€“ Okay, we did not invent this one. But we would rather use our experience with data quality and automated checks for data hygiene than rush in stage 2 and find that we missed that data column again. Adjusting the analysis to the data rather than adopt the analysis and fit the data.
Â So that is our Adaptive approach! Did we miss something? Let us know.