BigDataHerd

Blog

Data Integration and Data Cleaning Software Reviews

In this blog we review demos of four next generation software which are meant for data integration and data cleansing. The software reviewed are ClearStory, Trifacta, Paxata and Tamr.
ClearStory is a story in progress
ClearStory (http://www.clearstorydata.com/ )
Clearstory is a software that promises to turn the art of data science into the art of telling data stories. You need not be a data scientist to use ClearStory. The first impression we got is that it is trying to be good at everything without being great at anything along the data pipeline route. There is some mixture of Salesforce like collaboration (Chatter) and some augmentation of sources with a lot of data visualization and some added functionality in creating derived variables.

fotolia_110278105panel-data-harmonization-scores.jpg
Clearstory’s version of Data Visualization is like Tableau Software “lite”, but it isn’t quite there. Data harmonization comes from additional sources and leveraging Apache Spark. If Apache Spark really takes off as the next generation of Hadoop projects for faster Big Data processing, ClearStory could be in a unique and advantageous niche if it manages to marry collaborative data visualization for Big Data with enhanced sources and custom analytics functions. However, quite clearly it would have to deepen the pool of analytics functions while keeping the interface simple and intuitive enough for end users to use and share with others.
Strengths
Integrated solution from data input to data visualization
Good mix of social collaboration with data science
Additional data sources library is available
Data from APIs has good functionality
Weakness
No on site deployment version. Over reliant on cloud deployed version.
Data Quality and Transformation not impressive and needs more rigor for higher variety of data quality errors.
Opportunities
Better Data Visualization and Analytics can make this a more integrated version of a visual BI solution based on next gen Hadoop software.
ClearStory tries to be good at everything and fails at being the best in class in anything as of it’s current iteration. Good pedigree of people involved makes this an interesting startup to keep an eye on, given rapid changes in technology. The overall unique selling point of making data science a collaborative story to share seems a differentiating innovation.
Trifacta impresses with machine learning to cleanse data
Trifacta (http://www.trifacta.com/) aims to reduce pain in data cleansing. It does so by imputing data quality based on machine learning recommendations. However, Trifacta is further down the curve, it is almost completely Hadoop based and does not have a straight connection to RDBMS. Since most data cleansing needs to occur on both historic as well as new data feeds, we feel that the addition of a connector or even some RDBMS capability can greatly enhance the appeal of Trifacta.
Paxata is the best in class for data transformation for Big Data
We found Paxata (http://www.paxata.com/) the best in class for Big Data transformation and quality enhancement. We were further heartened by the fact that it can use data that is both hadoop and RDBMS based. An even more encouraging sign was that it has both cloud based and enterprise based deployment options. This is a well funded startup with rapid improvements in platform. Overall, Paxata is both affordable and has a trial version. Another positive is the honesty and transparency on what algorithms are used for imputing the invalid data, which was a change from the black box approach of other software.
Tamr tames wild data pretty well
Tamr (http://www.tamr.com/) is a startup out of Boston, fresh out of MIT. It has been built for a clear focus – taming messy data and lack of consistency in input data (or variety). Tamr has both cloud and on premise deployments for enterprises. Tamr actually sees Trifacta and ClearStory as complementary software. The best thing about Tamr is the emphasis on simple transformations and trying to see it through relationships. However the user interface is clearly a work in progress. RESTful APIs are again a plus point in Tamr.
Overall summary- We found these software solutions a refreshing change from the old Business Intelligence software we have been seeing. However, one drawback is rapid changes in their models which can cause risk averse, large enterprises to pause before investing in large scale data integration. One additional observation we found is that clearly, product development seems to be running ahead of documentation and the demos were quite a revelation compared to the information we could see from their websites.
A lot of time is spent on data integration and data analysis is as good as the data gets. Test out these new-age software solutions and let us know your views on how you think this can fit in and drive additional productivity out of your data silos.

Discussion

Leave a Reply

Save Filter
×