**Mass Shooting Deaths and the Flaw of Averages**

**Introduction**

I have been watching the news on television today: another mass murder shooting rampage, again in Texas. This one in Odessa. Mass shootings. *“Our thoughts and prayers”*. Second amendment rights. Background checks. Assault rifles. Mental illness. Mass shootings. *“Our thoughts …”*. What can be done? Perhaps nothing, but almost certainly nothing in the absence of valid data, carefully analyzed, artfully visualized, and clearly communicated.

There is no shortage of opinions regarding what should be done to curtail mass murders, but as yet nothing approaching a consensus. While it is irrefutable that guns are the method of choice for mass murders, some believe that civilian gun ownership is the best defense against becoming a victim, while others downplay the notion that the United States has a particularly acute mass shooting death problem relative to other countries.

Fox News contributed to the latter perspective with a prime-time “Special Report” (2), based on the following graphic, displayed here using the “imageR Package” in R following some setup commands.

```
knitr::opts_chunk$set(echo = TRUE)
library(imager)
library(rvest)
library(tidyverse)
pix <- system.file('Fox_MEAN.jpg',package='imager')
fox <- load.image('Fox_MEAN.jpg')
plot(fox, axes=FALSE, frame.plot=FALSE)
```

**The Flaw of Averages …**

Relying on this visualization, “Educating Liberals” (1) wrote: *“Facts don’t care about your feelings. Show this to your anti-gun friends if you want to witness a good triggering today”*. The data upon which Fox News (2) and Educating Liberals (3) rely are taken from a demonstrably nebulous source (4), and is deceptively misleading while remaining marginally defensible. The underlying raw data may be accurate, but if tortured enough it will confess to anything. What follows reimagines how a neutral observer might let the data speak for itself.

It is correct that Norway had the highest average number of mass shooting deaths per capita over the 2009-2015 period, when measured by arithmetic mean, but it had an average of 0.00 when measured by the median over the same period. It also is correct that the USA had an annual average of 0.95 mass shooting deaths over the 2009- 2015 period, when measured by arithmetic mean, but it had an average of 0.058 when measured by the median over the same period. Finally, it is correct that the USA average ranked dead last among countries listed by Fox News, but their list is comprised of 7 countries selected from the data source such that the other 6 seem worse.

Leaving aside the latter point regarding selective sampling to bias an impression, let’s consider more carefully the choice of arithmetic mean versus median to represent something that approximates our notion of an “average”. If the data of interest has a distribution of values that is roughly symmetrical, such as the “bell-shaped curve” of the normal distribution, then it matters little if one relies on the mean or median as there will be little difference between them. However, the choice matters proportionately to departure from symmetry, as can be demonstrated with a simple by example. Using R, we generated ten numbers, with nine drawn at random each with equal likelihood of being any value between 1and 5, and with the tenth number set equal to 1,000. The seed precedes our random draws to allow reproducibility. In this example 90% of the ten numbers are between 1 and 5; if 100% had been, we would expect both the mean and median to be close to 3.00 and deviate from it only due to the randomness of the draws. With our sample, the median equals 4.052, larger than simply randomness to account for the tenth observation. Notice how the mean (103.234) went chasing after the single aberrant observation, whereas the median resisted this urge to chase it. If you had to choose between them, does the mean or median better express the “average” data point? As often is the case, visualizing the data can be helpful, and the “dotchart” drives home the absurdity of an average thaqt chases one point of 1,000.

```
set.seed(102456)
x <- c(runif(9, 1, 5), 1000)
summary(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.065 3.079 4.052 103.234 4.551 1000.000
dotchart(x, pch = 19, cex = 1.5)
```

Clearly, the median (~4) adequately represents 90% of the data, with the remaining outlier point off on its own. By contrast, the arithmetic mean (~100) is 25 times the value of the median, yet still only 10% of the outlying point. It fails to adequately represent the actual value of any single point in our sample, a poor choice for our “average”.

**… Applied to the Fox News Report**

For those who dismiss this as a mere technical debate of little practical importance, consider the following graph that displays averages for each of the 17 countries in two dimensions – arithmetic mean and median. The arithmetic mean of annual mass shooting deaths in Norway clearly is far greater than for any of the other 16 countries. Taken in isolation, this implies Norway is a dangerous place, far more so than, say, the USA. As it happens, there were no deadly mass shootings in Norway for six of the seven years under consideration. In 2011 a Norwegian right-wing Christian extremist bombed the government building that houses the Prime Minister, killing eight, then took a ferry to an island campground wearing a police uniform, and opened fire on campers, killing 69. That implies a mean of ~10 mass shootings annually, which translates to ~2 per million population (given a population of about 5 million.)

```
ave_mass_shooting_deaths_perM <- read.table(header = FALSE, text = "
Norway 1.99 USA 0.058
Serbia 0.38 Albania 0
Macedonia 0.34 Austria 0
France 0.34 Belgium 0
Albania 0.2 CzechRepublic 0
Slovakia 0.19 Finland 0
Switzerland 0.14 France 0
Finland 0.13 Germany 0
Belgium 0.13 Italy 0
CzechRepublic 0.12 Macedonia 0
USA 0.09 Netherlands 0
Austria 0.07 Norway 0
Netherlands 0.05 Russia 0
UnitedKingdom 0.05 Serbia 0
Germany 0.02 Slovakia 0
Russia 0.01 Switzerland 0
Italy 0.01 UnitedKingdom 0
")
library(gtools)
d_mean <- ave_mass_shooting_deaths_perM[ , 1:2]
colnames(d_mean)<- c("country", "mean")
d_mean <- d_mean[mixedorder(d_mean$country),]
d_median <- ave_mass_shooting_deaths_perM[ , 3:4]
colnames(d_median)<- c("country", "median")
d_median <- d_median[mixedorder(d_median$country),]
d <- merge(d_median, d_mean, sort = TRUE)
d <- d %>% mutate_if(is.factor, as.character)
library(ggrepel)
( m1 <- ggplot(d, aes(x = mean, y = median)) +
labs(title = "Average Annual Mass Shootings",
subtitle = "per Million Population \n 2009-2015") +
geom_text_repel(data = subset(d, median > 0.01), aes(label = country)) +
geom_text_repel(data = subset(d, mean > 0.50), aes(label = country)) + geom_point(size = 3) )
```

Notice that 16 of the 17 countries appear to have a median number of annual mass shooting deaths equal to 0. Why is that? To find an answer we need the underlying data that generated the arithmetic means and medians for each country.

**The 17 Countries, Year by Year**

Below we load into R the annual data on mass shooting deaths for each country, including the total deaths, the arithmetic mean and the median. For the seven years under study, we compute counts of the number of years that deaths for a country exceeded 5, 10, and 15, respectively.

```
mass_shooting_deaths_perM_byYear <- read.table(header = FALSE, sep = "", text = "
Albania 0 0 0 0 0 4 0 4 0.57 0
Austria 0 0 0 0 4 0 0 4 0.57 0
Belgium 0 0 6 0 0 4 0 10 1.43 0
CzechRep. 0 0 0 0 0 0 9 9 1.29 0
Finland 5 0 0 0 0 0 0 5 0.71 0
France 0 0 0 8 0 0 150 158 22.57 0
Germany 13 0 0 0 0 0 0 13 1.86 0
Italy 0 0 0 0 0 0 4 4 0.57 0
Macedonia 0 0 0 5 0 0 0 5 0.71 0
Netherlands 0 0 6 0 0 0 0 6 0.86 0
Norway 0 0 69 0 0 0 0 69 9.86 0
Russia 0 0 0 6 6 0 0 12 1.71 0
Serbia 0 0 0 0 13 0 6 19 2.71 0
Slovakia 0 7 0 0 0 0 0 7 1 0
Switzerland 0 0 0 0 4 0 4 8 1.14 0
U.K. 0 12 0 0 0 0 0 12 1.71 0
U.S.A. 38 12 18 66 16 12 37 199 28.43 18")
colnames(mass_shooting_deaths_perM_byYear) <-
c("country", "yr2009", "yr2010", "yr2011",
"yr2012", "yr2013", "yr2014", "yr2015",
"Total", "Mean", "Median")
library(textshape)
d_by_yr <- column_to_rownames(d_by_yr, loc = 1)
d_by_yr$gt_5 <- rowSums(d_by_yr > 5)
d_by_yr$gt_10 <- rowSums(d_by_yr > 10)
d_by_yr$gt_15 <- rowSums(d_by_yr > 15)
d_gt_x <- d_by_yr[ , 8:10]
d_gt_x<- d_gt_x %>% rownames_to_column("country")
```

The following graph makes clear the explanation. For those 16 countries other than the USA, 11 experienced mass shooting deaths in a single year, and the remaining five in two years of the seven studied. Therefore, these countries each had zero such events in either 5 or 6 of the seven years, a majority, so their median values must be 0.00. By contrast, the USA had double-digit events every single year. The USA experiences annual mass shooting deaths as the norm, whereas such tragic events are the exception for all the other 16 countries in the sample.

```
keycol <- "deaths"
valuecol <- "years_of_7"
gathercols <- c("gt_5", "gt_10", "gt_15")
d_gt_x_long <- gather_(d_gt_x, valuecol, keycol, gathercols)
d_gt_x_long$deaths <- as.factor(d_gt_x_long$deaths)
library(ggrepel)
( c2 <- ggplot(data = d_gt_x_long) +
aes(x = country, y = deaths, group = deaths) +
geom_bar(stat = "identity", color = "yellow") +
geom_text(aes(label = deaths), size = 3, hjust = 0.5,
vjust = 2, position = "stack", color = "white") +
labs(title = "Number of Years Exceeding 5, 10, and 15 Mass Shooting Deaths, 2009-2015",
subtitle = "Bottom number = # Years with >5 Mass Shooting Deaths, then >10, then >15") +
ylab("# Years > 5, 10, and 15 Mass Shooting Deaths") +
theme(legend.position = "none",
panel.grid = element_blank(),
axis.text.y = element_blank()) + theme(axis.text.x = element_text(angle = 60, hjust = 1)) )
```

We can collapse this notion to a simple scatterplot of countries, with the vertical axis representing mass shooting deaths in the “peak year” for a given country, and the horizontal axis representing the sum of all mass shooting deaths. For the unlabeled cluster of ten countries in the upper left block, each country experienced only one episode of mass shooting deaths during the 2009-2015 period, so their maximum in a single year equals their total overall, hence a ratio of 1.00. Norway also has a ratio of 1.00, with a single incident with 69 deaths, and France is just below 1.00, with 150 of their 158 total deaths

```
mass_shooting_deaths_perM_byYear$Max <-
do.call('pmax',c(mass_shooting_deaths_perM_byYear[,2:8]))
mass_shooting_deaths_perM_byYear <-
mass_shooting_deaths_perM_byYear %>%
mutate(Max2all = Max/Total)
( y1 <- ggplot(mass_shooting_deaths_perM_byYear,
aes(x = Total, y = Max2all, color = Max)) +
geom_point(size = 3) +
labs(title = "Concentration of Mass Shooting Deaths within Single Year",
subtitle = "as percentage of Total Mass Shooting Deaths (N = 17 countries)") +
xlab("Total Deaths from Mass Shootings, 2009-2015") +
ylab("Maximum Deaths in Single Year as Percent of Total") +
geom_hline(yintercept = 0.50, linetype="solid", color = "blue") +
geom_hline(yintercept = 0.75, linetype="dashed", color = "red") +
geom_vline(xintercept = 50, linetype="solid", color = "blue") +
geom_vline(xintercept = 175, linetype="dashed", color = "red") +
ylim(0, 1) +
geom_label_repel(data = subset(mass_shooting_deaths_perM_byYear, Max2all < 0.75),
aes(label = country)) +
geom_label_repel(data = subset(mass_shooting_deaths_perM_byYear, Total > 50),
aes(label = country)) )
```

The following graph, using the ggplot2 package in R and its faceting option to display each country, helps reinforce the point. The dotted red line is set at 10 mass shooting deaths. For the other 16 countries the mass shooting deaths are consistently very low, year after year, but for a single incident in each of Norway and France. By contrast, the USA is volatile, with double-digit deaths from mass shootings every single year.

```
long_d <- mass_shooting_deaths_perM_byYear[ , -(9:13)]
colnames(mass_shooting_deaths_perM_byYear) <-
c("country", "09", "10", "11", "12", "13", "14", "15")
( ts1 <- long_d %>% gather(year, value, -country) %>% ggplot(aes(year, value)) +
geom_point(size = 3) +
geom_hline(yintercept = 10, linetype="dashed", color = "red") +
labs(title = "Mass Shooting Deaths over Time, by Country",
subtitle = "by Year (seven years: 2009-2015) and by Country (N = 17) \n
Dotted line set at 10 deaths: \n
11 countries < 10 every year, 5 countries > 10 one year (with 2>13), USA > 10 all 7 years") +
xlab("Years 2009 through 2015") +
ylab("Mass Shooting Deaths") +
theme(legend.position = "none",
panel.grid = element_blank(),
axis.text.x = element_blank()) + facet_wrap(~country) )
```

**From 17 Countries to 231**

We have taken the underlying data as given, and found ways to let the data speak. I believe most discerning and objective readers would agree that the Fox News display misrepresented what the data indicated. It is for others to decide if this was an honest, naïve view of the data, or deliberate torture and mutilation to force a confession.

We conclude with a shot at the underlying reason annual mass shooting deaths in double-digits has become part and parcel of the American landscape. There may be other explanations, but opportunity seems foremost among them. Data on per capita cilvilian ownership of guns for each of 231 countries has been scraped from a website (5) and entered into R.

```
g <- read_html("https://en.wikipedia.org/wiki/Estimated_number_of_civilian_guns_per_capita_by_country")( t1 <- html_nodes(g, "table") )
guns <- html_table(t1[[1]], fill = TRUE)
colnames(guns) <- c("ID", "country", "CivGunsPer100", "Region", "Subregion", "Popul", "CivGuns", "X8", "X9", "X10" , "X11")
guns <- na.omit(guns)
guns <- subset(guns, select = -c(ID, X8, X9, X10, X11) )
guns[c(227:231), "CivGuns"] <- as.numeric(0)
```

The following display relies upon data from the Small Arms Survey, and indicates that the USA has about 120 civilian guns per 100 citizens. By contrast, every African country (N=58) has less than 20, and every European country (N=41) has less than 40. We’ve shown how the USA stands out in previous graphs, with the most deaths over the period and consistent double-digit fatalities all seven years, compared to no more than two such years for any other country studied). The graph below displays a boxplot of per capita civilian gun ownsership for countries grouped by continent. The vertical dimension to each box corresponds to the middle 50% of guns for countries on a given continent, with a horizontal bar inside the box a the median level for each continent. The notches provide a 95% confidence interval on the estimate of the population median, so continents whose notchess do not overlap on the vertical axis have a statistically significant difference in medians. For example, Europe () is just barely significantly higher than the Americas, and the Americas clearly are significantly higher than Africa. The points above and below the box represent the remaining 50% of countries on each continent, and are more likely considered outliers the further they have strayed from their box. The greatest outlier in the world is the USA, with a value (120.5) roughly twelve times higher than the median for the Americas. The USA obsession with gun ownership dwarfs that of the next most gun-totting countries (Falkland Islands, Yemen, and New Caledonia). Of the 231 countries cited, these are the only four with civilian gun ownersip exceeding 40 per 100 citizens. It’s not common for the USA to be compared with, let alone in some sense outperformed by, that set of countries. Regarding the primary cause of mass shooting deaths in the USA, perhaps it’s time to connect the dots:

**It’s the Guns, Stupid!**

```
library(ggrepel)
p1 <- ggplot(guns, aes(x = Region, y = CivGunsPer100)) +
geom_boxplot(notch = T) +
geom_dotplot(binaxis='y', stackdir='center', binwidth = 1)
p2 <- p1 + geom_text_repel(data = subset(guns, CivGunsPer100 > 40.167), aes(label = country))
p3 <- p2 + labs(title = "It's the Guns, Stupid",
subtitle = "Civilian Guns per 100 Citizens -- Only 3 of 230 countries have 1/3rd or more (>40) of the 120.5 Civilian Guns per 100 USA citizens") +
geom_hline(yintercept = 40.167, linetype="dashed", color = "red")
( p4 <- p3 + annotate(geom="text", x=2.2, y=120.5, label="120.5", color = "blue") +
annotate(geom="text", x=2.2, y=62.1, label="62.1", color = "blue") +
annotate(geom="text", x=3.2, y=52.8, label="52.8", color = "blue") +
annotate(geom="text", x=5.2, y=42.5, label="42.5", color = "blue") )
```

**Updates to 2019 …**

The count is at seven deaths in Odessa; it raises the mass shooting death toll to 38 from three separate events – in a single month (August, 2019), for a single country (USA). Recall that there were only three greater death tolls during the 2009-2015 period for any year, across all 17 countries including the USA. Is August, 2019 an anomaly? Have mass shooting deaths increased in the USA since 2015? Let’s update our data (6) and see.

The graph below certainly supports a claim that the problem has accelerated in the USA, as the grey bars are consistently taller. Is the visual impression statistically significant? That would require overwhelming evidence supporting such a claim, as the sample sizes (7 old; 4 new) are very small.

```
year <- as.character(c("2009":"2019"))
deaths <- c(long_d[17, 2:8], 69, 117, 80, 66)
usa <- as.data.frame(rbind(year, (deaths)))
rownames(usa)[2] <- "deaths"
usa <- as.data.frame(t(usa))
usa$year <- factor(usa$year, levels = usa$year)
usa$new <- c(rep("N", 7), rep("Y", 4))
( t1 <- ggplot(usa, aes(year, deaths, fill = new)) +
geom_bar(stat = "identity") +
scale_x_discrete() )
```

**… indicate It’s Gotten Worse**

Nonetheless, let’s take a look. The boxplots below do not come close to overlapping, so there is no need to add notches. The substancial gap is highly suggestive of a significant difference between the old (2009-2015) and new (2016-2019) data, even though 2019 is for 8 months.

`(boot <- ggplot(usa, aes(new, as.numeric(deaths))) +
geom_boxplot() +
stat_sum(colour="darkgray",alpha=0.5) +
scale_size(breaks=1:2, range=c(3,6)) )`

A distribution-free permutation test indicates that, even after accounting for the small sample sizes, the difference in medians is statistically significant (at alpha = 0.01 level).

```
library(coin)
independence_test(as.numeric(deaths) ~ as.factor(new),
data = usa)
Asymptotic General Independence Test
data: as.numeric(deaths) by as.factor(new) (N, Y)
Z = -2.5552, p-value = 0.01061
alternative hypothesis: two.sided
```

**Conclusion**

At the least, the reader should be wary of the flaw of averages, carefully considering which average (arithmetic, geometric, or trimmed mean, median, mode,…) is most appropriate for a given task. Beyond that, perhaps some inspiration as been provided for communicating data-driven results via informative visualizations. It’s likely too much to expect any readers to send an apology to President Obama for the derogatory and dismissive reactions from Fox News (7) and others (8) to comments he made while in office, such as: *““I say this every time we’ve got one of these mass shootings. This just doesn’t happen in other countries,” claimed Obama. It is a claim that he has continually repeated over the years. Talk about being self-absorbed.*

**Sources:**