BigDataHerd

Blog

From Napoleon to Tableau: A Brief History of Data Visualization as it evolved

The Big Data revolution has greatly increased the need for data visualization. Historically, data visualization has evolved through the work of noted practitioners. The founder of graphical methods in statistics is William Playfair. William Playfair invented four types of graphs: the line graph and bar chart of economic data (1786), and the pie chart and circle graph (1801). Joseph Priestly had created the innovation of the first timeline charts, in which individual bars were used to visualize the life span of a person (1765). That’s right, timelines were invented 250 years ago and not by Facebook!

Among the most famous early data visualizations is Napoleon’s March as depicted by Charles Minard. The data visualization packs in extensive information on the effect of temperature on Napoleon’s invasion of Russia along with timelines. The graphic is notable for its representation in two dimensions of six types of data: the number of Napoleon’s troops, distance, temperature, the latitude and longitude, direction of travel, and location relative to specific dates.

Napoleons March Graph2

 

Florence Nightingale was also a pioneer in data visualization. She drew coxcomb charts for depicting effect of disease on troop mortality (1858).

John Snow graphs

The use of maps in graphs or spatial analytics was pioneered by John Snow (not from the Game of Thrones!). It was a map of deaths from a cholera outbreak in London, 1854 in relation to the locations of public water pumps and it helped pinpoint the outbreak to a single pump.

London Cholera outbreak

If you are interested in learning more about the history of data visualization you can the see some more references at http://www.datavis.ca/gallery/historical.php . For ancient data visualization including the times of Romans and Egyptians you can see http://data-art.net/resources/history_of_vis.php

Alternatively you can try the R package HistData. The HistData package provides a collection of small data sets that are interesting and important in the history of statistics and data visualization.

“Lies, damned lies, and statistics is a phrase describing the persuasive power of numbers, particularly the use of statistics to bolster a weak argument. The use on Anscombe case study further created the need for data visualization. In his influential paper “Graphs in Statistical Analysis“, F J Anscombe showed a quartet of datasets that had nearly identical descriptive statistics, yet proved to be very different when presented graphically.

Cutting down to the modern era, the work of Edward Tufte has been seminal in establishing data visualization as a science. Tufte wrote the influential book The Visual Display of Quantitative Information.

Key concepts of Tufte Principles are:

  • The representation of numbers, as physically measured on the surface of the graphic itself, should be directly proportional to the quantities represented.
  • Above all else show the data: A large share of ink on a graphic should present data-information, the ink changing as the data change.
  • High density in most graphs can be shrunk way down without losing legibility or information

Tufte created sparklines. Whereas the typical chart is designed to show as much data as possible, and is set off from the flow of text, sparklines are intended to be compact. The sparkline should be about the same height as the text around it.

Sparklines

An example of sparklines above. Note the compactness has been useful in using them for stock exchange time series data.

A bullet graph below is a variation of a bar graph developed by Stephen Few. Stephen Few helped create data visualization as useful to business particularly for the design of dashboards.

Bullet graph

 

Today Stephen Few’s work is used in visualization software like those developed by Tableau Software. The 8 core principles espoused by Few are :

Simplify – Good data visualization captures the essence of data without oversimplifying.Compare – We need to be able to compare our data visualizations side by side. Attend – The tool needs to make it easy for us to attend to the data that’s really important. Explore– Data visualization tools should let us just look. Not just to answer a specific question, but to explore data and discover things. View Diversely – Different views of the same data provide different insights. Ask why – More than knowing “what’s happening”, we need to know “why it’s happening”. Be skeptical – We too rarely question the answers we get from our data because traditional tools have made data analysis so hard. We accept the first answer we get simply because exploring any further is tool hard. Respond – It’s the ability to share our data that leads to value.

Isaac Newton once wrote, “If I have seen further it is by standing on the shoulders of Giants”. It is the work of data visualization gurus from Charles Minard to Stephen Few that we can see more about the increasing amounts of data that businesses and technology are showing us.


To Schema On Read or to Schema On Write, That is the Hadoop Data Lake Question

By Paige Roberts

Whether ‘tis nobler to suffer the slings and arrows of outrageous data structures, or to structure, to query, perchance to find answers from specific data more quickly. That is the quandary of the Hadoop data lake.

The Hadoop data lake concept can be summed up as, “Store it all in one place, figure out what to do with it later.” But while this might be the general idea of your Hadoop data lake, you won’t get any real value out of that data until you figure out a logical structure for it. And you’d better keep track of your metadata one way or another. It does no good to have a lake full of data, if you have no idea what lies under the shiny surface.

At some point, you have to give that data a schema, especially if you want to query it with SQL or something like it. The eternal Hadoop question is whether to apply the brave new strategy of schema on read, or to stick with the tried and true method of schema on write.

Data-Pic

Before we dig deeper into this Hadoop data lake conundrum, let’s start with some definitions.

What is Schema on Write?

Schema on write has been the standard for many years in relational databases. Before any data is written in the database, the structure of that data is strictly defined, and that metadata stored and tracked. Irrelevant data is discarded, data types, lengths and positions are all delineated. The schema; the columns, rows, tables and relationships are all defined first for the specific purpose that database will serve. Then the data is filled into its pre-defined positions. The data must all be cleansed, transformed and made to fit in that structure before it can be stored in a process generally referred to as ETL (Extract Transform Load).

That is why it is called “schema on write” because the data structure is already defined when the data is written and stored. For a very long time, it was believed that this was the only right way to manage data.

But there are more things on heaven and earth than are dreamt of in that philosophy.

What is Schema on Read?

Schema on read is the revolutionary concept that you don’t have to know what you’re going to do with your data before you store it. Data of many types, sizes, shapes and structures can all be thrown willy nilly into the Hadoop Distributed File System, and other Hadoop data storage systems. While some metadata, data about that data, needs to be stored, so that you know what’s in there, you don’t yet know how it will be structured. It is entirely possible that data stored for one purpose might even be used for a completely different purpose than originally intended.

The data is stored without first deciding what piece of information will be important, what should be used as a unique identifier, or what part of the data needs to be summed and aggregated to be useful. Therefore, the data is stored in its original granular form, with nothing thrown away because it is unimportant, nothing consolidated into a composite, and nothing defined as key information.

In fact, no structural information is defined at all when the data is stored.

When someone is ready to use that data, then, at that time, they define what pieces are essential to their purpose. They define where to find those pieces of information that matter for that purpose, and which pieces of the data set to ignore.

This is why it is called “schema on read” since the schema is defined at the time the data is read and used, not at the time that it is written and stored.

Advantages of Schema on Write

The main advantages of schema on write are precision and query speed.

Because you define your data structure ahead of time, when you query, you know exactly where your data is. The structure is generally optimized for the fastest possible return of data for the types of questions the data store was designed to answer. This means you write very simple SQL and get back very fast answers.

In addition, before data is stored in a database, the data must go through a rigorous process to make sure it matches the structure exactly, and will serve the purpose of the database as it is meant to. The data’s quality is checked and enhanced or scrubbed. Duplicates are found and resolved. The data is checked against business rules to make certain it is valid and useful for the purpose defined. This means that the answers you get from querying this data are sharply defined, precise and trustworthy, with little margin for error if your ETL processes and your validation checking have done their job.

Advantages of Schema on Read

The main advantages of schema on read are flexibility in purpose and query power.

Because your data is stored in its original form, nothing is discarded, or altered for a specific purpose. This means that your query capabilities are very flexible. You can ask any question that the original data set might hold answers for, not just the type of questions a data store was originally created to answer. You have the flexibility to ask things you hadn’t even thought of when the data was stored.

Also, different types of data generated by different sources can be stored in the same place. This allows you to query multiple data stores and types at once. If the answer you need isn’t in the data you originally thought it would be in, perhaps it could be found if you combined it with other data sources. This power of this ability cannot be underestimated. This is what makes the Hadoop data lake concept which puts all your available data sets in their original form in a single location such a potent one.

Disadvantages of Schema on Write

The main disadvantages of schema on write are query limitations and inflexible purpose.

The dark side of the tightly controlled precision of a schema on write data store is that the data has been altered and structured specifically to serve a specific purpose. Chances are high that, if another purpose is found for that data, the data store will not suit it well. All the speed that you got from customizing the data structure to match a specific problem set will cost you if you try to use it for a different problem set. And there’s no guarantee that the altered version of the data will even be useful at all for the new, unanticipated need. There’s no ability to query the data in its original form, and certainly no ability to query any other data set that isn’t in the structured format.

Also, to fit the data into the structure, ETL processes and validation rules needed to clean, de-dupe, check and transform that data. Those processes take time to build, time to execute, and time to alter if you need to change it to suit a different purpose.

There is always a time cost to imposing a schema on data. In schema on write strategies, that time cost is paid in the data loading stage.

Disadvantages of Schema on Read

The main disadvantages of schema on read are inaccuracies and slow query speed.

Since the data is not subjected to rigorous ETL and data cleansing processes, nor does it pass through any validation, that data may be riddled with missing or invalid data, duplicates and a bunch of other problems that may lead to inaccurate or incomplete query results.

In addition, since the structure must be defined when the data is queried, the SQL queries tend to be very complex. They take time to write, and even more time to execute.

As I said before, there is always a time cost to imposing schema. In schema on read strategies, that time cost is paid when you query the data.

So, Which Should I Choose, To Schema on Read or To Schema on Write?

That is the question. In my next blog post, I’ll look at some examples of schema on read and schema on write Hadoop SQL technologies, and discuss what criteria a person should use to choose between the two. Until then, remember this above all, to thine own business goals be true.

————–

Paige Roberts has spent a lot of her life stuffing her brain full of information about big data, data integration, data quality, and analytics software, markets, and systems, and is likely to tell you about it in great detail if you don’t run away fast enough. Find her on Twitter or LinkedIn at RobertsPaige, or check out her blog at bigdatapage.com.


To Schema On Read or to Schema On Write – Part 2

By Paige Roberts

In my previous post, we pondered the merits of the philosophies of (schema on read, and schema on write. Link to previous post.) But schema, wherefore art though schema? Would not a Hadoop data lake by any other strategy be as useful?

Let’s have a look at some of the SQL in Hadoop technologies used in data lakes today. To make the decision as to which strategy you would want to use, it would help to know which Hadoop ecosystem technologies use which strategies before you choose.

Examples of the Two Schema Strategies in Hadoop Ecosystem Technologies

Drill is probably the best example of a pure schema on read SQL engine in the Hadoop ecosystem today. It gives you the power to query a broad set of data, from a wide variety of different data stores, including hierarchical data such as JSON and XML. It also gives you the flexibility to ask any question of that data. Drill is still young technology, but it shows tremendous promise to be the ideal data exploration tool of the Hadoop data lake.

Hive is the original schema on read technology, but is, in fact, a marvelous hybrid of the two technologies. It can do queries on a broad set of data types and sources. However, like Drill, it has speed issues due to the complexity of imposing schema during query. Under the covers, it used to generate MapReduce to essentially do ETL at query time, but due to the limitations of MapReduce, many Hive implementations have moved to usingTez for that same purpose. This has given Hive a nice speed boost.

In order for Hive to gain the advantages of a schema on write data store, ORC file format was created. This is a pre-structured format optimized for Hive queries. By combining strategies, Hive has gained many of the advantages of both camps.

Facebook Presto has a sort of schema on read technology that gives you the ability to query a lot of different data sets, except that they have an intermediate step that you have to do before querying that defines or fetches schemas from the various source data sets. So, it’s a bit of a hybrid as well, and has shown some really impressive speed and flexibility because of it.

Spark SQL is entirely a schema on write technology, but they have a unique way of short-cutting a lot of the slow ETL processes normally associated with schema on write. Spark, itself, is not just an in-memory database format, but also a data processing engine that can do high speed ETL processes entirely in memory. This shortens the up-front cost of the schema on write strategy. Like Hive, Spark SQL often does high speed ETL in the background at query time.

There are several other SQL on write options, each with their own way of doing things that offers some sort of unique advantage or disadvantage: Impala with Parquet, Actian Vortexwith Vector in Hadoop, IBM Big SQL with BigInsights, HAWQ with Greenplum, and the list goes on.

There are, of course, the NOSQL databases as well that have their own way of handling the schema dilemma, but that’s a whole other blog post.

So, Which One Should I Choose?

As to which one to choose, that depends entirely on what your goals are, what balcony you want to climb, and what features are important to you accomplishing those goals.

Schema on read options tend to be a better choice for exploration, for “unknown unknowns,” when you don’t know what kind of questions you might want to ask, or the kinds of questions might change over time. They’re also a better option when you don’t have a strong need for immediate responses. They’re ideal for data exploration projects, and looking for new insights with no specific goal in mind.

Schema on write options tend to be very efficient for “known unknowns.” When you know what questions you’re going to need to ask, especially if you will need the answers fast, schema on write is the only sensible way to go. This strategy works best for old school BI types of scenarios on new school big data sets.

In the end, you must decide what you need most, flexibility or precision, speed or power, or some combination of each. Start by deciding what your business most needs to accomplish with its new Hadoop data lake.

No one says that you have to choose only one option. The many parts of the Hadoop ecosystem are designed to live in harmony with YARN keeping peace. Just remember that choosing to go with a Hadoop data lake may mean some trade-offs, but it doesn’t mean giving up SQL.

A good SQL query by any other name still can’t be beat.

———–

Paige Roberts has spent a lot of her life stuffing her brain full of information about big data, data integration, data quality, and analytics software, markets, and systems, and is likely to tell you about it in great detail if you don’t run away fast enough. Find her on Twitter or LinkedIn a RobertsPaige, or check out her blog at bigdatapage.com.


Save Filter
×