Selecting the Right Software for Data Analytics and Data Integration
By Ajay Ohri.
In our previous post we talked of the adaptive approach to data integration by looking at the existing phases of the data and analytics pipeline as related to handling big data. In this, and the next few posts we will share our experiences as we review some software. Part of the reason for sharing is the confusion we have sometimes experienced as technical users. It seems illogical for every software to be the best, doesnâ€™t it?
Our methodology for reviewing Big Data software is as follows and is open to modification and scrutiny.
1) We ask questions regarding customization and flexibility of options. These include the difference between cloud hosted and enterprise site hosted. This also includes features of Hadoop compatibility and RDBMS compatibility. The greater the fit between offered software and the targeted database environment, the better it is.
2) We ask questions regarding benchmarking the particular software against its own competition. We do our own benchmarks anyway but it is important to listen to a reality check from the software vendor as well as the reasons why they made certain platform decisions.
3) We interrupt the nicely flowing demo by, you guessed it, more questions. These include variations for data quality and dataset size and how well that is handled. What is the size of the data beyond which the software will slow down? What are some of the assumptions when cleaning the data? Neat and tidy data is unfortunately an intermediate output, not a preliminary input in the real life Big Data Analytics we see.
4) Black box algorithms under the hood of machine learning is not as good as knowing what exact algorithms are being used. Machine learning is a very broad term and algorithms particularly come with tradeoffs in rigor and accuracy.
5) Demos are good. But a trial license to kickstart our own research on our data for a few weeks is even better. It also shows confidence (or the lack of) in a particular software vendor.
6) Pricing that is reasonable and flexible ( for users) beats annual pricing of a few hundred thousand dollars for a big enterprise licence any day. This allows the customers to test and then scale up.
7) Visual solutions to showcase analysis beats tabular reports to show the same analysis. Visual and graphical cognition is still faster than tabular row by row cognition. At least we find it much easier to understand graphs than huge tables. Also, a minimal approach at graphs or data visualization beats flashy animation any day.
8) It is important to distinguish between â€œgood to haveâ€ features and essential time saving functionality. Collaboration and social integration are clearly desirable but less essential to database connectivity through JDBC.
In the following posts we will review software based on the methodology here.
Ajay Ohri is the senior data scientist consultant for Adaptive Systems Inc . He is the author of two books on R (R for Business Analytics and R for Cloud Computing: An Approach for Data Scientists)