This document illustrates how to use ggplot2 and scattermore to:
make a scatter plot with many data points that shows the density of points without saturation
that is fast to view and arrange (for example when assembling figure panels)
that can be stored into a compact file without loss in quality
Prepare data
Run the following code to prepare the data used in this document:
suppressPackageStartupMessages({library(ggplot2)library(tibble)})# built-in `diamonds` dataset from the `ggplot2` package (see ?ggplot2::diamonds)tibble(diamonds)
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
Create figure
Load packages
Code
library(ggplot2)library(scattermore)
Plot
Let’s first create a simple scatter plot to illustrate the problems. The diamonds dataset has 53940 observations. If stored into a vectorized graphics device such as pdf() or svg(), the file will be large (each observation is individually represented as graphic elements) and slow to open or arrange. Furthermore, the high number of data points leads to saturation and we do not see the full underlying density of data points.
# create base plotgg <-ggplot(data = diamonds, mapping =aes(x = carat, y = price)) +labs(x ="Weight of the diamond (carat)", y ="Price (US dollars)") +theme_bw(20) +theme(panel.grid =element_blank(),legend.position ="bottom")gg +geom_point()
A simple way to improve the saturation issue is to use transparency, so that overlapping observations lead to darker colors. However, this does not solve the “many points” problem yet.
Code
# ... with transparencygg +geom_point(color =alpha("black", 0.02))
A simple way to solve also the “many points” problem is to avoid showing all individual observations and instead show the local density of points using a color scale.
Code
# ... with marginal density plots by number of cylindersgg +geom_density_2d_filled(bins =48) +coord_cartesian(expand =FALSE) +theme(legend.position ="none")
The linear contour levels or color intervals (controlled by bins or breaks) may not work well in a case like ours, where the density is very high in some regions that will occupy almost the complete color scale and we lose resolution in low density regions. You can use breaks to create non-linear intervals (here combined with ndensity so that we know the range of densities: [0, 1]) and with theme(panel.background) to make the zero-density area dark blue.
Code
# ... with marginal density plots by number of cylindersgg +geom_density_2d_filled(contour_var ="ndensity",breaks =exp(seq(log(1e-4), log(1), length.out =64))) +coord_cartesian(expand =FALSE) +theme(legend.position ="none",panel.background =element_rect(fill =hcl.colors(64)[1]))
Finally, if you prefer to show individual observations and need something that will scale to millions of points without getting slow or hard to use, scattermore provides a solution for that too.
Code
# ... with marginal violin and labelled data pointsgg +geom_scattermore(pointsize =2, alpha =0.02)
Remarks
scattermore does its magic by rendering the geom_point layer into a bitmap, while keeping all the other layers as they are, allowing you to create pdf() files that can be magnified without losing readability of the axes.
scattermore provides two ggplot2 layers: geom_scattermore() (which behaves mostly like geom_point()), and geom_scattermost() which avoids much of the overhead of a normal ggplot2 layer and thus is even more efficient, at the price that it has a slightly different interface and needs to get the data directly as the xy argument.
An alternative, maybe even more convenient drop-in replacement for geom_point() and other ggplot2 geoms that follows a similar strategy is geom_point_rast() from the ggrastr package. This package also provides the rasterize() function that can take any existing ggplot2 plot object and rasterize suitable layer in it. Compared to scattermore, ggrastr does not seem to be as fast and scale as well, though.