From the course: Data Literacy: Exploring and Describing Data
Visual primacy: The importance of starting with pictures
From the course: Data Literacy: Exploring and Describing Data
Visual primacy: The importance of starting with pictures
- [Narrator] I'm willing to bet that for most people, when I talk about data, they imagine rows and columns and numbers. And while those are definitely part of working with data it's sometimes surprising to people to learn that looking at pictures, pictures of data is often more important, more useful and really should be the first priority when we're learning to work with data. And so that's why I always recommend that we start when dealing with data. and to give this a little historical context let's take a look at one of the most influential graphics in the history of epidemiology or the study of the causes and spreads of diseases. This may, which is a section of London, was created by the English physician, John Snow, to help solve the cholera epidemic in London in 1854. What each of these squares show is where a case of cholera was diagnosed. Now, if we zoom in a little bit you'll see a whole bunch of squares here. There's a significant concentration of cases of cholera right here on Broad Street. And right at the center of those was a public water pump that had contaminated water. As a result of this graphical analysis, Snow had the handle to the water pump removed. So people I couldn't use it. And that signaled the beginning of the end of the cholera epidemic in London. And so that lets you know the graphics not only can give you insight into data, they can actually help save lives in certain situations. And to emphasize the importance of graphical analysis I want to share two quotes from a person I quoted elsewhere, statistician John Tukey, who is really one of the most important figures in the practice of data visualization. First, Tukey said the greatest value of a picture is when it forces us to notice what we never expected to see. Again, it's going to cause you to notice anomalies, things that stand out and in another line Tukey said this, Numerical quantities, that's like statistical things, like the mean and the standard deviation, numerical quantities focus on expected values. The things that you anticipate being there. Graphical summaries on unexpected values, they will show you again what you did not expect to see. And there's a lot of good reasons to begin with graphics of different kinds. So here are some of the advantages. First off, it's easy to spot patterns. You can see clusters, you can see big gaps. You can see exceptions to the overall pattern. Also, as Tukey said, they're really good for finding the unexpected, getting some that really is undeniable. And of course, graphics and visualizations in general are good, not for just discovery, but for communication and sharing your insights with other people. Another big purpose of graphics. And the reason you want to start with them is 'cause they're holistic. And that means they give you the big picture. Numerical summaries are simplifications. Now that's intentional. That's not against them. It's just saying that they are reducing the bandwidth of data to help you get sense out of it. Graphics don't always have to do that. There are times that they can actually show you every single observation. And this is important because very different data sets, that's data sets with very different numbers and patterns in them can have identical summaries. And those differences, well, you may not be able to see them when you're looking at summary statistics, become abundantly clear when you use a visualization. And to show you how that works, I'm going to share with you Anscombe's quartet which is a set of four data collections. And let me show you how these work. So what we have here are four sets of data each with an X and a Y variable. We have X1 and Y1 and X2 and so on and so forth. And you can see there's a bunch of different numbers there. But the important thing about Anscombe's quartet is that he developed it very specifically so that there were some important similarities across these data sets. So if we come down here to the bottom, we see that all of them have 11 observations on each variable. Second, you can see that the mean on X is the same for all of them. And the mean on Y is the same for all of them. Also something called the standard deviation, which we'll talk about a little bit later which has to do with how spread out the data is. That's identical for every X and every Y. This measure is a correlation coefficient. Again, I'll talk about it later, but it talks about the relationship, the strength of the relationship between the two numbers, it's identical in all four sets. And then these last two numbers have to do with a regression line, drawing a line through the data. And these two values the intercept and the slope are identical. So on these six different things these four data sets are identical. And the problem is if you were just doing a numerical analysis, you might think that they're identical, but then you do the graphics. Here's the first data set. This is a scatter plot. And you can see that we've got this line going up in the middle. That's our regression line. And we've got these dots that are scattered, random a little bit. This is what we expect to see when we're looking at data. But the second pair he has has this perfect curved pattern. You can draw a straight line through it, but the straight line doesn't fit it, right? But based on the summaries that we have it's identical in many ways. Then we have this other pattern which has this perfectly linear relationship, except for this one dot showing way up at the top which appears to be an anomaly. Maybe it got recorded wrong or maybe something strange is going on there. And then for a very strange data set there's this one where you have all these dots on the left where they all have the same value on X, and then you got one way off to the right, an enormous outlier. And so these four data sets are obviously very very different from one another. By looking at them, it's really, really easy to see, even though it's hard to tell by looking at the tables of numbers. Now in the sections that follow we're going to look at a few different kinds of graphs. First, we're going to look at graphs for categories. That's when you're counting how many cases or observations there are in a particular bucket. Those graphs include bar charts and grouped bar charts and pie charts. So those are three very common tools. Next, we're going to look at some graphs that are used for quantities or things that you measure on a scale and those include dot plots and box plots and histograms. And then third, we're going to look at a few visualizations that work well for looking at the association between variables. Those include line charts and spark lines as well as scatter plots. And each of these variables are easy to do. Most of them pre done in spreadsheets. They're all very fundamental things. And it turns out they're going to give you an enormous amount of insight into the data that you have. And so, as an overview of the next several sections we're going to look at each of these kinds of charts and see what kinds of questions each one of these charts can answer. What can you get out of it? We'll talk about what to look for when you create the chart, where are the most informative elements in it? And we'll talk about how to interpret the results of that and especially how to apply them, how to get something actionable and useful out of the data that you got in that very, very basic chart. And so, again, it's always important to begin by simply looking, see what's there, do some basic graphs and you'll get the overall picture of your data and potentially some very surprising insight that can guide you in your further analyzes.
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.
Contents
-
-
-
-
-
-
Visual primacy: The importance of starting with pictures7m 52s
-
(Locked)
Bar charts6m 51s
-
(Locked)
Grouped bar charts6m 34s
-
(Locked)
Pie charts8m 4s
-
(Locked)
Dot plots5m 18s
-
(Locked)
Box plots6m 49s
-
(Locked)
Histograms4m 37s
-
(Locked)
Line charts8m 23s
-
(Locked)
Sparklines5m 2s
-
(Locked)
Scatterplots8m 22s
-
(Locked)
Data maps3m 29s
-
-
-
-