In these notes, we provide instructions for using the
ggplot2
package to create histograms for sample data in
R.
A histogram is a graphic that can be used to visualize the distribution of a random variable using a sample of a large number of values from that distribution.
Consider the following sample \(1000\) values from the \(\mathrm{Unif}(0,1)\) distribution:
set.seed(13)
values <- runif(1000, min = 0, max = 1)
We can partition the interval \((0,1)\) into several disjoint subintervals and count how many of our sample values were in each subinterval. We create a graphic summarizing this information using rectangles: for each subinterval, we plot a rectangle whose base lies along that interval, and whose height is equal to the number of points in that interval. A histogram for the sample of 1000 uniform values using 10 subintervals is shown below:
Note that each rectangle has height approximately 100 (i.e. each rectangle contains approximately 100 values), which makes sense, since we sampled from the uniform distribution. However, the heights aren’t all exactly equal to 100, since this was a random sample.
We can repeat the experiment, but with 10000 values instead of 1000.
set.seed(13)
values <- runif(10000, min = 0, max = 1)
As our sample size increases, we expect to see that the shape of our histogram gets closer and closer to the shape of the density curve for the distribution of the variable we sampled.
We can visualize the same data using a different number of subintervals as well.
Notice that the histogram appears “noisier” when using a large number of rectangles compared to a small number of rectangles. This is because the height of each rectangle is actually a random variable! And it turns out that the relative height of each bin has higher variance when the bin contains fewer observations (a fact we will prove later when discussing the Central Limit Theorem).
However, one advantage of using a large number of rectangles is that we have a better “resolution” for viewing the shape of the distribution. With only 5 rectangles, we have limited ability to detect the distribution’s shape. (This isn’t as evident, however, with the uniform distribution, since the density function is just a horizontal line).
When creating histograms, we need to balance the trade-off between high variability with a large number of rectangles, and imprecision with a small number of rectangles. The correct number of rectangles is often a subjective decision based on what you think best represents the true distribution.
To create a histogram in R, we first generate data. Below, we’ve sampled 1000 points from the standard Normal distribution and record them in a data frame:
set.seed(3)
z <- rnorm(1000, mean = 0, sd = 1)
my_data <- data.frame(z)
We then load the ggplot2
package. We create our
histogram using a similar template to the one we used to create plots of
functions:
ggplot(my_data, aes(z))+
geom_histogram()+
theme_minimal()
In order to see the edges of each rectangle, it is recommended to
include color = "white"
inside the
geom_histogram
layer, as seen below:
ggplot(my_data, aes(z))+
geom_histogram(color = "white")+
theme_minimal()
We can control how many rectangles are used by including a
bins = ...
input inside the geom_histogram
layer as well:
ggplot(my_data, aes(z))+
geom_histogram(color = "white", bins = 8)+
theme_minimal()
Alternatively, we can specify the length of each rectangle, rather
than the number of rectangles, using
binwidth = ...' inside the
geom_histogram` layer:
ggplot(my_data, aes(z))+
geom_histogram(color = "white", binwidth = .5)+
theme_minimal()
Occasionally, we might want to modify the vertical scale of the
histogram so that its total area is 1; this is useful if we want to
compare the histogram to a specific density curve. To do so, we include
after_stat(density)
inside aes()
in the
ggplot
layer:
ggplot(my_data, aes(z, after_stat(density)))+
geom_histogram(color = "white", binwidth = .5)+
theme_minimal()
Notice that this didn’t change the shape of the histogram at all, but did change the value of the heights of each rectangle.
Suppose we wanted to superimpose the normal density curve on this plot. To do so, we create a sequence of inputs spanning the range of our sample data, compute the value of the normal density at each point, and record the data in a new data frame:
z <- seq(-3, 4,length = 100)
norm_density <- dnorm(z, mean = 0, sd = 1)
my_density_data <- data.frame(z,norm_density)
Now, we add a new geom_line()
layer to the previous
histogram plot. Note that we need to include a new data =
and aes()
input, since this layer will be using a different
data set than the one used for the histogram:
ggplot(my_data, aes(z, after_stat(density)))+
geom_histogram(color = "white", binwidth = .5)+
geom_line(data = my_density_data, aes(z, norm_density))+
theme_minimal()
As expected, the shape of the histogram closely follows the Normal density curve.
We’ve now created basic histograms in R. Further customization of the histogram is possible, and is discussed in the following vignettes: