Warning: Declaration of CycloneSlider_ExportPage::render_page($post) should be compatible with CycloneSlider_WpAdminSubPage::render_page() in /homepages/36/d236670510/htdocs/bigdataherd/wp-content/plugins/cyclone-slider-2/src/CycloneSlider/ExportPage.php on line 5

Warning: Declaration of CycloneSlider_ImportPage::render_page($post) should be compatible with CycloneSlider_WpAdminSubPage::render_page() in /homepages/36/d236670510/htdocs/bigdataherd/wp-content/plugins/cyclone-slider-2/src/CycloneSlider/ImportPage.php on line 167
Blog – BigDataHerd – BigDataHerd Blog | BigDataHerd – BigDataHerd



5 Key Components of a Solid Big Data Strategy

For most organizations, the answers to many questions lie in Big Data – the massive volumes of structured and unstructured data generated both inside and outside the organization. Being able to analyze all of this data in a meaningful way can be a daunting task if the proper infrastructure is not in place, and if you don’t have the means to process data from multiple sources quickly and effectively. And once you have processed it, it’s a whole other battle to make it meaningful to the people in your organization who need to understand it. To help organizations build the right Big Data strategy, here are five key components they should consider:


  1. Establish a common data model. Ensure that all of your data is centralized in a common data model to provide a single accurate view of the business. The common data model establishes conventions such as fields, naming, attributes and relationships so that everything is aligned across transactional and other systems.
  2. Harness the power of external data. Truly capturing meaning from Big Data means effectively integrating foundational data from internal data sources with external data from third-party environments (i.e. vendor data, social media, and demographics). The platform must be able to harness information in multiple ways, from structured databases and distributed predictive analytic systems, to mining unstructured data.
  3. Focus on scalability and open standards. By using an open-standards platform, organizations can leverage existing systems while reducing IT costs and gaining flexibility in terms of serving the business. Systems adhering to open industry standards are readily available and are preferred to proprietary systems for a number of reasons, not the least of which is their ability to integrate with existing legacy systems, systems from multiple other vendors and future add-on solutions.
  4. Model once consume anywhere. Today, information can be accessed on almost every mobile device, and from cloud-based netbooks to in-store portals. Organizations need to ensure a common infrastructure for producing and delivering enterprise reports, scorecards, dashboards, and ad-hoc analysis while empowering end-users with real-time, 24 * 7 access to self-service BI, mobile BI, and the ability to create their own BI content and personalized dashboards using a simple, easy point-and-click interface.
  5. Provide users with actionable insights.  Users need to be able to act on information without leaving the application and opening another.  This type of closed-loop, cross-domain analytics ensures that Big Data will have an immediate informative and beneficial impact on day-to-day operations.

Establishing the foundation for leveraging Big Data is worth the effort.  When business users can take action right from the retail analytics dashboard, the impact on operations and customer experience is immediate.

Leveraging Big Data to Improve Customer Satisfaction

Companies today understand that improving customer satisfaction is vital to their success, and that it means more than simply tracking complaints. Combining structured data from areas such as sales, marketing and supply chain with unstructured or semi-structured data from surveys, syndication data and other outside sources can give companies a new perspective on their customers.


For example, merging structured with unstructured content to find underlying customer satisfaction issues allows enterprises to proactively monitor customer satisfaction levels. In many organizations, sales and customer service still work in separate silos and customer feedback is often not allowed to flow freely between the different operations resulting in ineffective distribution channels. A COO would be interested in the convergence of sales information, call center operations and social media. Big Data can create correlation between product sales, support and the customer voice to validate the true issues impacting customer satisfaction – and for targeting new customer segments, even competitors’ customers can be analyzed for industry trends to reveal propensity to buy certain products or services.

Another customer satisfaction challenge solved by Big Data is identifying the most valuable customers from a 360-degree view, with the goal of presenting them with offers and benefits relevant to their interests. And to exclude those customers who merely take advantage of discounts without maintaining any level of loyalty for the merchant. Store operations, customer service, and to some extent marketing would be interested in this solution to get the most benefit from sales and promotions. The purpose of this is to keep loyal customers by making them feel rewarded and special, and these insights enable better focus and less waste in that effort

Predictive Analytics: Moving beyond the buzzword to the action

By Ajay Ohri

Predictive Analytics has been around for some time, championed by the likes of SAS, SAP, IBM, among others. While data science and big data analytics bring hope, joy, despair and confusion to CIOs, it is acknowledged that predictive analytics is a tested and mature industry. Then by extension we should see predictive analytics in most industries and corporations around us. But do we? Is predictive analytics even close to reaching its full utilization and potential? The answer is a resounding “no”. Yes, awareness of analytics is increasing and more and more people are exploring it.


The true benefits predictive analytics-led decision making imparts to an enterprise are many and are getting better. The reason for this is quite simple – data storage has become much more inexpensive, data pipelines have increasingly been digitized from end to end and basic blocks for predictive analytics including business reporting and baseline metrics already exist for a majority of organizations.

What predictive analytics can do is give you a lift over traditional decision making including historic planning and naive forecasting paradigms. An added emphasis is not just on increasing revenue but on decreasing costs. An enterprise with a high profit margin realizes a greater advantage when it decreases a single dollar in costs than in increasing a single dollar in revenue. This is because cost decreases go straight to profit enhancement. A more methodical way of utilizing predictive analytics is key to good Return on Investment.

We can now predict which of your employees are going to leave and reduce employee attrition slightly, reduce litigation costs by analyzing our legal expenses, tie in website and social media analytics to text mining to social network models to analyze interactions and relationships, click stream models for enhanced digital revenue, use life-time value modeling for customer revenue, and using recency frequency and monetization for segmentation.The decreased costs in hardware thanks to the cloud computing paradigm, and the added increase in software capability to handle big data using distributed paradigms like Hadoop have further ushered in a golden age of analytics.

analytics-and-statistics-1013tm-pic-45It is not enough to do analytics though, one must model the right data and question your assumptions constantly and refine these models. Some critical missteps in predictive analytics are ignoring basic infrastructure like data quality, master data management, expensive capital outlays due to legacy reasons, and risk aversion to both cloud computing hardware and inexpensive open source software. An added point is not keeping adequate test and control champion and challenger strategies that lead to inaccurate baselining. Peter Drucker said, “Culture eats strategy for breakfast”. With a predictive analytics strategy one needs an analytical culture as well, where data driven questioning and investigation is both encouraged and rewarded.

How do you measure the Return of Investment on Predictive Analytics? What is the way to measure and analyze which software in my analytics suite of choices is going to give me better ROI? In a forthcoming post we will discuss, how the analytics industry can provide for even greater depth and breadth of analysis, in addition to implications for customer data.

Considering Healthy Data before Big Data

Considering Healthy Data before Big Data

We’ve been giving some thought to the big data trend and the effect it’s had on the business intelligence space. The hype alone has pushed many big businesses to amass huge amounts of structured and unstructured data with the intent of gaining strategic insight and a competitive advantage. In some instances, the collection and analysis of big data is seen as integral to an organization’s genetic code – think behemoths like Walmart and Target.

The benefits are easy to ascertain – enterprise-wide insight, unprecedented dialogue with consumers and the ability to redevelop your products based on up-to-the-minute feedback and awarenesses.


Yet the pursuit of volume brings with it the pesky (and often overlooked) issue of data quality. Considering that the best of decisions are made by leveraging sound data, putting the issue of quality front and center may well turn out to be your company’s best asset.

Let’s touch on three essential reasons why this is so:

1. Excellent Data Quality = well-made decisions

Although data quality may never be as “sexy” a topic as Mobile BI or the cloud, it will always set the stage for skillful analysis and staying a few leaps ahead of the competition (who may not be as committed to the ongoing discipline.)

Achieving quality may involve a complex process of defining common definitions (naming conventions, what defines a customer? a prospect?); executing an initial data cleanse and maintaining order via ongoing data monitoring, ETL and other technologies.

Organizations may even consider hiring a Data Quality Manager and subsequent team to govern data processes, including migration,manipulation and analysis. The Data Quality Team must also ensure that reliable information is loaded into the data warehouse. Responsibilities may cross organizational silos, in that both IT and business users will need to step up and take responsibility for this initiative.

Keep in mind that many BI systems allow users to write back directly to the data source. If this is the case within your organization, it’s imperative that the user is authorized so as not to corrupt the system with erroneous data. In short, it takes strong communication and a team acting in tandem to ensure superior data quality.

Ultimately, success will depend upon trust. Users must believe that the informationthey are analyzing is “healthy,” timely and accurate in order to use it well.

Enforcing good data practices, particularly at the source system level will boost your credibility, reinforce sound analysis and save you volumes of time-consuming inefficiencies down the road.

2. Compliance Matters

In a post Enron world, it’s wise to consider that poor data quality may be leaving you out of compliance with the law. Although Sarbanes-Oxley (SOX) affects primarily public companies, businesses that undergo a merger, acquisition or IPO must also be in compliance, lest they face fines and potential lawsuits. Just another reason to keep data quality at “top of mind” status and perhaps allot a little extra in your data quality column during budget season.

3. It’s ineffectual to ignore

We won’t hit you with a bunch of catastrophic stories regarding the side effects of “unhealthy” data. We’d rather you consider the vanished productivity of workers tasked with fixing data problems, rather than performing their actual work; which leads to outsized costs in the end.

It pays to keep in mind that a task may cost approximately 10 times more when the data isn’t clean, compared to when it is. So go ahead and figure out the cost of a particular task and then multiply it by 10. That number should be enough to persuade you to dive in and make an investment in data quality, considering that the ROI is so easy to prove.

We hope we’ve convinced you to circle back to square one when mulling over the big data trend. It’s not that bigger isn’t better, but without data quality soundness, the hype is just an empty promise. Better to allocate your resources wisely concerning quality, than to face the music on a down note of lost time and effort.

An Adaptive Approach for Handling Messy Big Data

The Adaptive approach for handling big data relies on the following principles


1)  Agnostic to statistical tool – A frequency analysis and a cross tab is the same, even if we do it in R or Python or Julia or SAS languages. The tradeoffs between license cost must balance the liability for a bug in the software as well as the cost of transitioning and training to a newer software.

2)  Believers in choose the right database for the right data – Some things are best stored in Hadoop. Some data are best stored in NoSQL like MongoDB or Cassandra. Others are best in a shared case of RDBMS. We balance the need for diversity in databases to handle diverse data with the need for a coherent and consistent ETL environment which can help businesses with an easy to query analytics environment.

3)  Right tool for right situation – Sometimes we should use SQL to query. Sometimes we should select some other language (including Pig or HiveQL). The analytics situation should dictate the tool, not the training needs of the analyst. If your internal analytics team does not have anyone who knows how to parse JSON, it is cheaper to hire a consultant who knows the tools that can do it, rather than pursue XML just because that is what your team members are familiar with or your statistical tool can only handle XML.Tha adaptive way

4)  Following the right mix of tools and technique – Some tools are great for some techniques. For example R is great for Data Visualization. Knowing the R GUI Deducer or the Kmggplot2 plugin for RCommander further cuts down the time to make great visualizations. Would you like to do data mining on gigabytes of data? Maybe we should choose Python. Do you have sensitive (and huge) data for propensity modeling for financial services and need a quick and robust model? Aha! SAS language is the tool for you. We think an adaptive approach should let the customer (you) and YOUR need(s) and YOUR situation dictate the mix of tools and techniques. Oh, and one more thing! All software has bugs and there is no such thing as a perfect software (created by any human regardless of open or closed source).

5)  Basic principles still apply – Yes the year is 2014! But we still think random sampling is relevant. So is distribution analysis. Oh, and so is Bonferroni Outlier and some (but not all) of those statistical tests.

6) Statistics Jargon can be made easier for non-statisticians to understand – The Adaptive approach adjusts itself to the audience and not the speaker. Harry Truman said, “if you can’t convince them, confuse them”. He would have been a great statistician and an awful data scientist. We believe in simple words to explain complex techniques.

7)  Garbage in Garbage Out – Okay, we did not invent this one. But we would rather use our experience with data quality and automated checks for data hygiene than rush in stage 2 and find that we missed that data column again. Adjusting the analysis to the data rather than adopt the analysis and fit the data.

 So that is our Adaptive approach! Did we miss something? Let us know.

Selecting the Right Software for Data Analytics and Data Integration

By Ajay Ohri.

In our previous post we talked of the adaptive approach to data integration by looking at the existing phases of the data and analytics pipeline as related to handling big data. In this, and the next few posts we will share our experiences as we review some software. Part of the reason for sharing is the confusion we have sometimes experienced as technical users. It seems illogical for every software to be the best, doesn’t it?


Our methodology for reviewing Big Data software is as follows and is open to modification and scrutiny.

1) We ask questions regarding customization and flexibility of options. These include the difference between cloud hosted and enterprise site hosted. This also includes features of Hadoop compatibility and RDBMS compatibility. The greater the fit between offered software and the targeted database environment, the better it is.

2) We ask questions regarding benchmarking the particular software against its own competition. We do our own benchmarks anyway but it is important to listen to a reality check from the software vendor as well as the reasons why they made certain platform decisions.

3) We interrupt the nicely flowing demo by, you guessed it, more questions. These include variations for data quality and dataset size and how well that is handled. What is the size of the data beyond which the software will slow down? What are some of the assumptions when cleaning the data? Neat and tidy data is unfortunately an intermediate output, not a preliminary input in the real life Big Data Analytics we see.

4) Black box algorithms under the hood of machine learning is not as good as knowing what exact algorithms are being used. Machine learning is a very broad term and algorithms particularly come with tradeoffs in rigor and accuracy.

5) Demos are good. But a trial license to kickstart our own research on our data for a few weeks is even better. It also shows confidence (or the lack of) in a particular software vendor.

6) Pricing that is reasonable and flexible ( for users) beats annual pricing of a few hundred thousand dollars for a big enterprise licence any day. This allows the customers to test and then scale up.

7) Visual solutions to showcase analysis beats tabular reports to show the same analysis. Visual and graphical cognition is still faster than tabular row by row cognition. At least we find it much easier to understand graphs than huge tables. Also, a minimal approach at graphs or data visualization beats flashy animation any day.

8) It is important to distinguish between “good to have” features and essential time saving functionality. Collaboration and social integration are clearly desirable but less essential to database connectivity through JDBC.

In the following posts we will review software based on the methodology here.

Ajay Ohri is the senior data scientist consultant for Adaptive Systems Inc . He is the author of two books on R (R for Business Analytics and R for Cloud Computing: An Approach for Data Scientists)

Data Integration and Data Cleaning Software Reviews

In this blog we review demos of four next generation software which are meant for data integration and data cleansing. The software reviewed are ClearStory, Trifacta, Paxata and Tamr.
ClearStory is a story in progress
ClearStory (http://www.clearstorydata.com/ )
Clearstory is a software that promises to turn the art of data science into the art of telling data stories. You need not be a data scientist to use ClearStory. The first impression we got is that it is trying to be good at everything without being great at anything along the data pipeline route. There is some mixture of Salesforce like collaboration (Chatter) and some augmentation of sources with a lot of data visualization and some added functionality in creating derived variables.

Clearstory’s version of Data Visualization is like Tableau Software “lite”, but it isn’t quite there. Data harmonization comes from additional sources and leveraging Apache Spark. If Apache Spark really takes off as the next generation of Hadoop projects for faster Big Data processing, ClearStory could be in a unique and advantageous niche if it manages to marry collaborative data visualization for Big Data with enhanced sources and custom analytics functions. However, quite clearly it would have to deepen the pool of analytics functions while keeping the interface simple and intuitive enough for end users to use and share with others.
Integrated solution from data input to data visualization
Good mix of social collaboration with data science
Additional data sources library is available
Data from APIs has good functionality
No on site deployment version. Over reliant on cloud deployed version.
Data Quality and Transformation not impressive and needs more rigor for higher variety of data quality errors.
Better Data Visualization and Analytics can make this a more integrated version of a visual BI solution based on next gen Hadoop software.
ClearStory tries to be good at everything and fails at being the best in class in anything as of it’s current iteration. Good pedigree of people involved makes this an interesting startup to keep an eye on, given rapid changes in technology. The overall unique selling point of making data science a collaborative story to share seems a differentiating innovation.
Trifacta impresses with machine learning to cleanse data
Trifacta (http://www.trifacta.com/) aims to reduce pain in data cleansing. It does so by imputing data quality based on machine learning recommendations. However, Trifacta is further down the curve, it is almost completely Hadoop based and does not have a straight connection to RDBMS. Since most data cleansing needs to occur on both historic as well as new data feeds, we feel that the addition of a connector or even some RDBMS capability can greatly enhance the appeal of Trifacta.
Paxata is the best in class for data transformation for Big Data
We found Paxata (http://www.paxata.com/) the best in class for Big Data transformation and quality enhancement. We were further heartened by the fact that it can use data that is both hadoop and RDBMS based. An even more encouraging sign was that it has both cloud based and enterprise based deployment options. This is a well funded startup with rapid improvements in platform. Overall, Paxata is both affordable and has a trial version. Another positive is the honesty and transparency on what algorithms are used for imputing the invalid data, which was a change from the black box approach of other software.
Tamr tames wild data pretty well
Tamr (http://www.tamr.com/) is a startup out of Boston, fresh out of MIT. It has been built for a clear focus – taming messy data and lack of consistency in input data (or variety). Tamr has both cloud and on premise deployments for enterprises. Tamr actually sees Trifacta and ClearStory as complementary software. The best thing about Tamr is the emphasis on simple transformations and trying to see it through relationships. However the user interface is clearly a work in progress. RESTful APIs are again a plus point in Tamr.
Overall summary- We found these software solutions a refreshing change from the old Business Intelligence software we have been seeing. However, one drawback is rapid changes in their models which can cause risk averse, large enterprises to pause before investing in large scale data integration. One additional observation we found is that clearly, product development seems to be running ahead of documentation and the demos were quite a revelation compared to the information we could see from their websites.
A lot of time is spent on data integration and data analysis is as good as the data gets. Test out these new-age software solutions and let us know your views on how you think this can fit in and drive additional productivity out of your data silos.

Four Essential Drivers of Cloud BI

Four Essential Drivers of Cloud BI

These days, an enterprise’s ability to respond to change in a cost effective manner still has a “sink or swim” impact on its success. That’s probably why in recent years, there’s no shortage of talk surrounding business agility.

Cloud computing has emerged as a major driver of business effectiveness, as it enables the automated scalability of IT resources at a moment’s notice, in response to internal or external demands. Inevitably, when full resources aren’t required, this drives costs savings and delivers an immediate ROI.


Business intelligence is also a mainstay of business agility, as it dispenses massive insight into company data, along with the ability to make strategic decisions and forecasts in an up-to-the minute, on-the-fly fashion.

As the assets of these two technologies merge in the form of cloud BI, it’s easy to see why their popularity is on the rise— over one third of companies deploy cloud components in their business intelligence strategy. Such agility has a compelling impact on performance. It’s fair to say that cloud computing is fast becoming a veritable business strategy, based on the following drivers:

1) Reduced overall IT costs

2) Up-to-the-minute flexibility

3) Speed of implementation

4) Hardware and software maintenance reduction

Examining these drivers in greater detail leads to some pretty interesting findings, which we’ll discuss below:

1. Reduced Cost

Departments often put off getting the BI system that they know will enhance their performance, because they dread having to go through the capital expenditure approval process, not to mention absorbing the costs of pricey hardware and upgrade costs. They also don’t want to have to face the challenge of reallocating staff to manage BI infrastructure.

The carrot that cloud BI dangles is certainly one of reduced costs and headaches—both are viable arguments in favor of cloud BI.

But keep in mind that costs are primarily reduced when it comes to hardware. In terms of actual software costs, phrases like “reduced cost up-front” or “quicker time to ROI” are actually closer to the truth.

For instance, many hosted BI solutions offer cloud BI licensing as an option and many cloud BI solutions are SaaS (software as a service) solutions. User licenses are paid for on a monthly or yearly basis and companies don’t truly own them; they simply pay for use of the software. The applications are hosted outside of the enterprise and accessed via the internet.

Companies that can’t afford a capital expenditure for the upfront costs of BI software licenses often opt for this model. But once the numbers are crunched, it’s clear that after paying their monthly fee for two years, they would “break even” on what would have been spent on an up-front software license purchase.

2. Flexibility

Flexibility is all about the ability to match service capacity tofluctuating demands of business users, as well as the capacity to select from a wide variety of cloud BI deployment models.

Cloud computing has the edge in malleability, in that it allows resources to be scaled high or low, depending on a company’s current needs. With SaaS BI, it’s easier and more cost-efficient to add users to an application, compared to the process involved with traditional on-premise. With the SaaS model, whether you are adding 5 or 500 users to a dashboard, you simply pay for an additional “seat” and supply the user(s) with appropriate credentials for accessing the dashboard, using a web browser.

On the other hand, adding a user to an on-premise BI application requires installing the software on the local computer, perhaps the need to upgrade the computer, and then synch the application with the other computers on the network. It’s definitely a more cumbersome, if not expensive, process.

With technologies like Amazon Redshift, a cloud data warehouse solution, managing vast amounts of data in the cloud becomes not only flexible, but scalable and affordable. Users can access any number of servers on demand and scale BI deployments up or down as needed.

And at less than $1,000 per terabyte per year, Redshift is a fraction of the cost of most traditional data warehousing solutions. It can easily be deployed as part of a cloud BI solution, where some or all of the BI stack is deployed in the cloud.

For example, a company may choose to house their data in the cloud with Redshift and also choose a SaaS BI provider (full cloud environment). Or they may use Redshift for data warehousing and deploy an on-premise BI solution (hybrid cloud environment.)

They also have the option to keep their data warehouse on-premise, but go with a SaaS BI provider, which is another kind of hybrid cloud environment. Lastly, they might use Redshift, plus BI that’s not SaaS—but is hosted in the cloud (marrying cloud + a traditional BI license model that’s simply deployed on cloud servers rather than in-house servers.) if those options aren’t enough to get you thinking, there’s also the choice of cloud servers that are public, private or hybrid.

You get the idea – there is a multitude of choices regarding cloud BI deployment models! But of course, it’s not a black and white choice. Items like sensitive financial data, is likely to be hosted on-premise, while other data, like that contained in a dashboard application, can be kept in the cloud. Whatever an enterprise’s requirements, the vast flexibility of cloud computing makes it clear that there is an appropriate cloud model that will, in fact, be appropriate.

3. Speed of Implementation

The fact that cloud BI can be implemented much faster than a traditional on-premise solution stands as a key factor in the IT decision making process. Cloud BI over on-premise translates to fast environment availability without the hassles of acquiring infrastructure and the delays associated with software deployment. Removing these obstacles delivers an immediate benefit in terms of reducing the duration of the BI implementation, and of course, time saved is an immediate pay-off for most businesses. But organizations should also keep in mind that customized SaaS BI solutions can oftentimes be more complex, especially with applications that arrive pre-configured.

4. Hardware/Software Maintenance Reduction

On-premise infrastructure for business intelligence includes data warehouse appliances, BI servers, and human capital to manage and maintain hardware. For some organizations, it may be more cost-efficient to outsource these tasks to an IaaS (infrastructure-as-a-service) or PaaS (platform-as-a-service) cloud provider. This enables the enterprise to run large-scale data analysis for a fraction of the cost it would take to house and maintain the necessary physical and human capital internally. Since the need to configure, optimize and update hardware and software is handled externally, choosing a cloud BI model where infrastructure is the cloud provider’s responsibility turns out to be a more affordable option, cost-and time-wise.

The number of companies with cloud components in their BI stack is on the rise. As trust in cloud computing evolves and more businesses seek a nimble infrastructure, Cloud BI makes it possible for smaller organizations to get into the game—namely, businesses that don’t need all the extras of fully customizable platforms or don’t have an IT staff that can maintain BI in-house.

Keeping the above factors in mind, any option that sanctions the rise of analytics certainly scores our nod of approval.

Operational Analytics for Greater Efficiency and Cost Reduction

Enterprises of all sizes have embraced the ability to make smarter decisions based on the analysis of clean, timely data. These days, data analytics is commonly used across many organizational silos—from marketing and sales to finance, and even the areas of risk and fraud.

Using analytics on data-driven systems throughout the entire value chain has indeed become standard best practice. But if you ask the average knowledge worker about the benefits of data analytics, you’ll probably get a response along the lines of its ability to analyze, acquire and retain customers as a function of marketing. And with all the buzz about enhanced marketing effectiveness, it’s easy to lose sight of the cost savings and process improvement associated with Operational Analytics.


Consider this: If an enterprise generates $10 in additional revenue via a new customer (assuming a 60 % profit margin) the company profits $6.00. However, if an organization realizes a $10 cost savings through operational analytics, they gain $10 in pure profit. A simple yet powerful example of the value of Operational Analytics, which contributes to both the top and bottom lines of an enterprise.

Infrastructure, in a technological and digital sense, created by Supply Chain Management (SCM) and Enterprise Resource Management (ERM) software, can be credited for ramping up the focus on Operational Analytics. Improvements in data science such asprocess mining (a method of analyzing business processes based on actual event logs,) has certainly added to the boost as well. In fact, process mining can be used to simulate the sequence of real-time events (complete with a time stamp) and alter the chain of events if need be, thus creating a stronger model of operational efficiency.

Over the past two decades, these technologies have allowed businesses to not only integrate and automate business processes more sharply, but also increase velocity and enhance customer relations.

Deploying analytics to enhance operational efficiencies can lead to the discovery and elimination of unnecessary costs. In fact, operational data analysis is often the key to identifying process bottlenecks and resolving them programmatically. Instead of repeatedly allotting time to troubleshoot operations, decision makers are now free to spend their time addressing strategic and tactical approaches to business management.

In fact, it’s fair to say that if SCM software is the first step in enhancing data-driven decision making, the next steps are Business Intelligence and Data Visualization. Analytics most certainly serves as the engine that drives the data-driven decision making triad.

Consider the case of a Montana-based LTL (“less than a truckload”) shipper, who wanted to improve its scheduled delivery rates by analyzing traffic and weather data. “Severe weather conditions are an everyday part of life in the northwestern region,” stated the owner. “My carriers are often at great risk, as they strive to provide on-time, day-definite delivery service.”

Utilizing operational analytics that provided instantaneous reports on a specific region’s traffic and weather patterns, helped this operation to avert severe weather and road conditions and dramatically decreased crash rates. For this small business owner, the results were deemed “priceless.”

Consider also the case of a mining company that utilizes thousands of dollars in trucks and extractors and has amassed a treasure-trove of data—including volumes of information regarding fuel usage, truck load allowances and repair history.

By plugging that data into an analytics framework, they’ve been able to optimize the entire operation, simply by pinpointing patterns and ascertaining risk assessments, such as equipment failure and the associated costs.

Oftentimes, in a mining operation, it’s not a sole component or KPI that predicts failure, but rather a compilation of load size and daily equipment wear and tear; thus coming up with a multitude of leading indicators to base predictions upon proves useful. For example, small items such as metal parts and the amount of stress they can endure before they become insolvent has provided this enterprise with several thousand dollars in productivity gains.

If that sounds excessive, consider the cost of having an excavator malfunction in a field and being out of commission for a week; or perhaps the cost of losing a hauling truck for the same duration. Both setbacks to a week’s work could translate to thousands of dollars in lost productivity.

With the use of predictive analytics, both disasters are now easily averted. “Taking our data from a wide variety of sources and ‘crunching it,’ if you will — has truly improved our entire enterprise,” states the Operations Manager. “Being able to make detailed predictions regarding key, specific pieces of equipment in real-time has proved essential to optimizing our operations and preventing disaster. It has certainly been worth the investment.”

What are your thoughts? Do you believe that more agile Operational Analytics can free up your time for higher value-add activities? From our experience and the case studies above, we’ve certainly seen evidence that data analytics is the key to correcting issues upfront, and enabling smarter, more fluid operational processes—we certainly welcome your thoughts on all aspects of this discussion.

The Difference Between Econometric Modeling and Machine Learning

I was talking shop the other day with a colleague who also runs a big data analytics firm. When I spoke with him, one of the things he mentioned briefly was econometric modeling vs. machine learning. I don’t know if it’s applicable or substantive enough for our potential audience but it may.

Essentially he said econometrics is great but not of much interest in his world because the focus is on WHY things happen, it’s “explanatory” in nature. His attention is focused more on machine learning because it is “predictive” in nature. He and his customers aren’t too concerned about the “why”; they are more interested in knowing where things are going next and, if given enough time, to figure out how to address that before it happens.

Frankly, this is not a new debate. The difference between computational statistics and statistical computing is just one more analogy to the debate above. Prior to the current Big Data explosion, statistics and computer science behaved in well defined silos at both Universities and organizations. Now there is a convergence between the two – statistics and computer science – to get what is needed to explain why the customer is acting in a particular way and forecast what they will want next.

Enter the twin paradigms of econometric modeling and machine learning. At first they seem to have similarities as well as differences. Some techniques like regression modeling are taught in both courses. Yet they are different by definition- Econometric models are statistical models used in econometrics. An econometric model specifies the statistical relationship that is believed to be held between the various economic quantities pertaining to a particular economic phenomenon under study.

On the other hand- Machine learning is a scientific discipline that explores the construction and study of algorithms that can learn from data. So that makes a clear distinction right? If it learns on its own from data it is machine learning. If it is used for economic phenomenon it is an econometric model. However the confusion arises in the way these two paradigms are championed. The computer science major will always say machine learning and the statistical major will always emphasize modeling. Since computer science majors now rule at Facebook, Google and almost every technology company, you would think that machine learning is dominating the field and beating poor old econometric modeling.

But what if you can make econometric models learn from data?

Lets dig more into these algorithms. The way machine learning works is to optimize some particular quantity, say cost. A loss function or cost function is a function that maps a value(s) of one or more variables intuitively representing some “cost” associated with the event. An optimization problem seeks to minimize a loss function. Machine learning frequently seek optimization to get the best of many alternatives.

Now, cost or loss holds different meanings in econometric modeling. In econometric modeling we are trying to minimize the error (or root mean squared error). Root mean squared error means root of the sum of squares of errors. An error is defined as the difference between actual and predicted value by the model for previous data.

The difference in the jargon is solely in the way statisticians and computer scientists are trained. Computer scientists try to compensate for both actual error as well as computational cost – that is the time taken to run a particular algorithm. On the other hand statisticians are trained primarily to think in terms of confidence levels or error in terms or predicted and actual without caring for the time taken to run for the model.

That is why data science is defined often as an intersection between hacking skills (in computer science) and statistical knowledge (and math). Something like K Means clustering can be taught in two different ways just like regression can be based on these two approaches. I wrote back to my colleague in Marketing – we have data scientists. They are trained in both econometric modeling and machine learning. I looked back and had a beer. If university professors don’t shed their departmental attitudes towards data science, we will have a very confused set of students very shortly arguing without knowing how close they actually are.

So the next time someone tries to argue on machine learning VERSUS econometric modeling take them here (http://cran.r-project.org/web/views/MachineLearning.html) and here (http://cran.r-project.org/web/views/Econometrics.html).
Smile and Say, Don’t worry. We got both of them, here.

Save Filter