Anscombe’s quartet is a set of four datasets with two variables (x and y) and 11 observations.It has been been used to demonstrate the importance of graphically displaying data. It has appeared not only in books (for example, in the first page of the first chapter of Tufte’s seminal work, Visual Display of Quantitative Information), but also in scholarly papers (for example, see Healy and Moody, 2014), and blog posts (for example, see Hirst). Here, I use ggvis in the shiny environment to play with the quartet. The code for the post and the accompanying shiny app can be found on my github site.

I must acknowledge the work of the team at R-Studio, which has provided all of the packages used for this post.

Anscombe’s Quartet

library(ggplot2)
library(dplyr)
library(ggvis)
library(knitr)
# library(shiny)
anscombe <- as.data.frame(anscombe)
anscombereorder <- anscombe[, c(1, 5, 2, 6, 3, 7, 4, 8)]
kable(anscombereorder, format = "html", table.attr = "cellpadding=\"7\"", 
    row.names = TRUE)
x1 y1 x2 y2 x3 y3 x4 y4
1 10 8.04 10 9.14 10 7.46 8 6.58
2 8 6.95 8 8.14 8 6.77 8 5.76
3 13 7.58 13 8.74 13 12.74 8 7.71
4 9 8.81 9 8.77 9 7.11 8 8.84
5 11 8.33 11 9.26 11 7.81 8 8.47
6 14 9.96 14 8.10 14 8.84 8 7.04
7 6 7.24 6 6.13 6 6.08 8 5.25
8 4 4.26 4 3.10 4 5.39 19 12.50
9 12 10.84 12 9.13 12 8.15 8 5.56
10 7 4.82 7 7.26 7 6.42 8 7.91
11 5 5.68 5 4.74 5 5.73 8 6.89

The quartet has 4 datasets with two variables x and y. They have been displayed above as being x1 and y1, x2 and y2, x3 and y3, and lastly, x4 and y4.

Beauty of the Quartet

Basic statistical characteristics of these datasets are almost identical. See the table below. However, when they are graphed, differences between them are clearly visible.

anscombelong <- data.frame(x = unlist(anscombe[, 1:4]), y = unlist(anscombe[, 
    5:8]), datasource = rep(1:4, each = 11))
kable(anscombelong %>% group_by(datasource) %>% summarise(`x-mean` = mean(x), 
    `y-mean` = mean(y), `x-variance` = var(x), `y-variance` = var(y), `correlation-xy` = cor(x, 
        y)), table.attr = "cellpadding=\"3\"", format = "html", row.names = FALSE)
datasource x-mean y-mean x-variance y-variance correlation-xy
1 9 7.501 11 4.127 0.8164
2 9 7.501 11 4.128 0.8162
3 9 7.500 11 4.123 0.8163
4 9 7.501 11 4.123 0.8165

The plots…

ggplot(anscombelong, aes(x = x, y = y)) + geom_point() + geom_smooth(method = "lm", 
    se = FALSE) + annotate("text", x = 12, y = 4, label = lm_eqn(lm(y ~ 
    x, anscombelong)), color = "black", parse = TRUE) + ylim(3, 13) + xlim(4, 
    19) + facet_wrap(~datasource) + theme(legend.position = "none") + theme_bw()

plot of chunk unnamed-chunk-4

The linear regression lines have the same equation.

Please note that if you are replicating the above graph, you will need a function, whose code I didn’t display here (but is available in the R markdown document). This function was used to print the regression line equation along with the graphs. It was copied from the discussion on StackOverflow and is available with the code for this Rmarkdown document.

Interactivity using ggvis in a shiny application

  • You can hover over the points to see the specific x- and y-values.
  • This animated visualization can flip across the 4 datasets and show how the regression line remains the same. Press the “Start Flipping” link at the top right of this plot.

Interactive Charts using htmlwidgets

Published on November 10, 2015

Display of Geographic Data in R

Published on August 18, 2015