A tale of two cities - Analyzing Crime in Seattle and San Francisco

Background

I analyze the crime data for 2016 from the Seattle and San Francisco open data outlets.

First, I load the libraries required for our analysis. In the next steps, I clean the data, create some features and then use additional geospatial data to guide my analysis.

All the code and data for this analysis can be viewed on my GitHub. Feel free to clone this and run your own analysis.

# Load only things that do not exists. See below:
# http://stackoverflow.com/a/4090208/1494702
req.pkg <- c("tidyverse", "ggmap", "lubridate", "randomcoloR", "leaflet", "rgdal", "sp")
install.diff <- req.pkg[!(req.pkg %in% installed.packages()[,"Package"])]
if(length(install.diff)) install.packages(install.diff)

require(tidyverse)
require(ggmap)
require(lubridate)
require(randomcoloR)
require(leaflet)
require(rgdal)
require(sp)

Now we load the data files and preview some of the data:

file.name.sea <- "/Users/gauravgada/GitHub/viz-des/01-report/sea_2016.csv"
sea <- read.csv(file.name.sea)

file.name.sf <- "/Users/gauravgada/GitHub/viz-des/01-report/sf_2016.csv"
sf <- read.csv(file.name.sf)

knitr::kable(head(sea), format = "markdown")

X	RMS.CDW.ID	General.Offense.Number	Offense.Code	Offense.Code.Extension	Offense.Type	Summary.Offense.Code	Summarized.Offense.Description	Date.Reported	Occurred.Date.or.Date.Range.Start	Occurred.Date.Range.End	Hundred.Block.Location	District.Sector	Zone.Beat	Census.Tract.2000	Longitude	Latitude	Location	Month	Year
1	1032273	2016104093	X	47	PROPERTY FOUND	X	STOLEN PROPERTY	2016-03-25 23:17:00	2016-03-25 22:00:00	2016-03-25 22:30:00	12XX BLOCK OF S KING ST	G	G3	9000.2017	-122.3153	47.59834	(47.598335266, -122.315315247)	3	2016
2	1032274	201690505	2699	4	THEFT OF SERVICES	2699	THEFT OF SERVICES	2016-03-14 23:40:00	2016-03-14 23:40:00	NA	84XX BLOCK OF 54 AV S	S	S2	11800.1007	-122.2649	47.52818	(47.528175354, -122.264923096)	3	2016
3	1032275	2016108950	911	0	HOMICIDE-PREMEDITATED-GUN	900	HOMICIDE	2016-03-30 05:27:00	2016-03-30 05:27:00	NA	4XX BLOCK OF 2 AV S	K	K3	9200.2004	-122.3314	47.60107	(47.601074219, -122.331390381)	3	2016
4	1032276	2016105920	X	48	PROPERTY LOST	X	LOST PROPERTY	2016-03-27 15:13:00	2016-03-22 00:00:00	2016-03-26 00:30:00	12XX BLOCK OF S KING ST	G	G3	9000.2017	-122.3153	47.59834	(47.598335266, -122.315315247)	3	2016
5	1032277	201694739	2303	0	THEFT-SHOPLIFT	2300	SHOPLIFTING	2016-03-18 07:48:00	2016-03-18 07:48:00	NA	14XX BLOCK OF 2 AV	M	M1	8100.3003	-122.3384	47.60879	(47.608787537, -122.338447571)	3	2016
6	1032278	201622976	2605	0	FRAUD-CREDIT CARD	2600	FRAUD	2016-01-20 11:08:00	2016-01-20 07:00:00	NA	45XX BLOCK OF 8 AV NE	U	U2	5200.5006	-122.3196	47.66222	(47.662223816, -122.31955719)	1	2016

knitr::kable(head(sf), format = "markdown")

X.1	IncidntNum	Category	Descript	DayOfWeek	DateTime	PdDistrict	Resolution	Address	X	Y	Location	PdId
1	140464217	WARRANTS	WARRANT ARREST	Thursday	4/28/16 23:31	PARK	ARREST, BOOKED	7TH AV / HUGO ST	-122.4642	37.76507	(37.7650658096524, -122.464236532783)	1.40464e+13
2	141059263	WARRANTS	WARRANT ARREST	Monday	4/25/16 14:59	BAYVIEW	ARREST, BOOKED	KEITH ST / SHAFTER AV	-122.3889	37.72998	(37.7299809672996, -122.388856204292)	1.41059e+13
3	160000318	ROBBERY	ATTEMPTED ROBBERY WITH BODILY FORCE	Friday	1/1/16 1:10	TENDERLOIN	NONE	300 Block of TAYLOR ST	-122.4113	37.78554	(37.7855355791102, -122.411302813025)	1.60000e+13
4	160018048	VEHICLE THEFT	STOLEN TRUCK	Thursday	1/7/16 15:45	CENTRAL	NONE	400 Block of POST ST	-122.4093	37.78824	(37.788239899778, -122.409256096494)	1.60018e+13
5	160002988	LARCENY/THEFT	GRAND THEFT FROM LOCKED AUTO	Friday	1/1/16 18:30	SOUTHERN	NONE	HOWARD ST / 5TH ST	-122.4049	37.78150	(37.7814987213218, -122.404934413339)	1.60003e+13
6	140586736	WEAPON LAWS	SWITCHBLADE KNIFE, POSSESSION	Friday	5/6/16 9:17	INGLESIDE	ARREST, BOOKED	900 Block of ELLSWORTH ST	-122.4165	37.73301	(37.7330103867416, -122.416517427405)	1.40587e+13

Research questions

The questions that I ask of our data are -

How does the crime in Seattle and San Francisco vary over the course of the year?
What times of the day are more safe versus what times are more dangerous?
How are crimes geospatially distributed? Do they happen all over the city or are they concentrated in pockets?

First I start by cleaning the data.

I convert all the time data into datetime that is recognizable by R. This will help me with further analysis.
I generate some features by splitting the data on hyphens. ex., in the Seattle dataset, I split the column “Offense.Type” on the hyphen.

sea <- sea %>%
  separate(Offense.Type, into = c("Offense.summ", "Offense.det1", "Offense.det2", "Offense.det3"), sep = "-", remove = F)

# http://stackoverflow.com/questions/29111017/convert-string-to-datetime-r
# http://stackoverflow.com/a/18503822/1494702
# http://strftime.org/

to.datetime.sea <- function(x) {
    return(as.POSIXct(x))
}

sea[c("Date.Reported", "Occurred.Date.or.Date.Range.Start", "Occurred.Date.Range.End")] <- lapply(sea[c("Date.Reported", "Occurred.Date.or.Date.Range.Start", "Occurred.Date.Range.End")], to.datetime.sea)

to.datetime.sf <- function(x) {
    date.fmt <- "%m/%d/%y %H:%M"
    return(as.POSIXct(x, format=date.fmt))
}

sf[c("DateTime")] <- lapply(sf[c("DateTime")], to.datetime.sf)

I sort and gather the top crime types in Seattle and San Francisco. Then I filter on the top-5 crimes to avoid making my time trend graph less cluttered.

sea.sorted <- as.data.frame(sort(table(sea$Offense.summ), decreasing = T))

sf.sorted <- as.data.frame(sort(table(sf$Category), decreasing = T))

sf.top5 <- sf %>%
  filter(Category %in% sf.sorted$Var1[0:5])

sf.summ.by.hour <- sf.top5 %>%
group_by(hour(DateTime), Category) %>%
summarise(cnt = n())

sea.top5 <- sea %>%
  filter(Offense.summ %in% sea.sorted$Var1[0:5])

sea.summ.by.hour <- sea.top5 %>%
group_by(hour(Date.Reported), Offense.summ) %>%
summarise(cnt = n())

We further manipulate the data to make it easier to plot.

colnames(sea.summ.by.hour)[1] = "Hour"
colnames(sf.summ.by.hour)[1] = "Hour"
colnames(sea.summ.by.hour)[2] = "Crime"
colnames(sf.summ.by.hour)[2] = "Crime"
sf.summ.by.hour$city = "SF"
sea.summ.by.hour$city = "SEA"

sf.summ.by.hour <- as.data.frame(sf.summ.by.hour)
sea.summ.by.hour <- as.data.frame(sea.summ.by.hour)

lapply(sf.summ.by.hour, class)

## $Hour
## [1] "integer"
## 
## $Crime
## [1] "factor"
## 
## $cnt
## [1] "integer"
## 
## $city
## [1] "character"

lapply(sea.summ.by.hour, class)

## $Hour
## [1] "integer"
## 
## $Crime
## [1] "character"
## 
## $cnt
## [1] "integer"
## 
## $city
## [1] "character"

# sf.summ.by.hour$summ <- as.character(sf.summ.by.hour$summ)

summ.by.hour <- rbind(sf.summ.by.hour, sea.summ.by.hour)

Seattle and San Francisco’s crime by hour.

The below plots show the crime variation by time in the two cities. We find that theft is the most popular crime in both cities. Theft in Seattle peaks during the noon, while it peaks at 6 to 7 PM in San Francisco.

n <- 10
col_vector <- unname(distinctColorPalette(n))

ggplot(summ.by.hour) +
  geom_line(data = filter(summ.by.hour, city == "SF"), aes(Hour, cnt, group = Crime, color = Crime)) +
  scale_x_continuous(breaks = seq(0,23,1), minor_breaks = NULL) +
  scale_y_continuous(breaks = seq(0,3500,500)) +
  geom_line(data = filter(summ.by.hour, city == "SEA"), aes(Hour, cnt, group = Crime, color = Crime)) +
  scale_color_manual(values=col_vector) +
  facet_wrap(~city, scales = "free_x") + theme_bw()

sf.summ.by.mnth <- sf.top5 %>%
group_by(month(DateTime), Category) %>%
summarise(cnt = n())

sea.summ.by.mnth <- sea.top5 %>%
group_by(month(Date.Reported), Offense.summ) %>%
summarise(cnt = n())

colnames(sea.summ.by.mnth)[1] = "Month"
colnames(sf.summ.by.mnth)[1] = "Month"
colnames(sea.summ.by.mnth)[2] = "Crime"
colnames(sf.summ.by.mnth)[2] = "Crime"
sf.summ.by.mnth$city = "SF"
sea.summ.by.mnth$city = "SEA"

sf.summ.by.mnth <- as.data.frame(sf.summ.by.mnth)
sea.summ.by.mnth <- as.data.frame(sea.summ.by.mnth)

summ.by.mnth <- rbind(sf.summ.by.mnth, sea.summ.by.mnth)

sf.summ.by.mnth <- sf.top5 %>%
group_by(round((as.numeric(DateTime) - as.numeric(utc))/(60*60*24*30)), Category) %>%
summarise(cnt = n())

sea.summ.by.mnth <- sea.top5 %>%
group_by(round(as.numeric((Date.Reported) - as.numeric(utc))/(60*60*24*30)), Offense.summ) %>%
summarise(cnt = n())

colnames(sea.summ.by.mnth)[1] = "Month"
colnames(sf.summ.by.mnth)[1] = "Month"
colnames(sea.summ.by.mnth)[2] = "Crime"
colnames(sf.summ.by.mnth)[2] = "Crime"
sf.summ.by.mnth$city = "SF"
sea.summ.by.mnth$city = "SEA"

sf.summ.by.mnth <- as.data.frame(sf.summ.by.mnth)
sea.summ.by.mnth <- as.data.frame(sea.summ.by.mnth)

summ.by.mnth <- rbind(sf.summ.by.mnth, sea.summ.by.mnth)

Seattle and San Francisco’s crime by month.

In our further analysis of time trends, we see that crime in Seattle follows strong monthly trends with very little crime happening from January to August. Crime starts peaking in Seattle from September through December. San Francisco on the other hand, does not have any noticeable monthly crime trends.

ggplot(summ.by.mnth) +
  geom_line(data = filter(summ.by.mnth, city == "SF"), aes(Month, cnt, group = Crime, color = Crime)) +
  scale_x_continuous(breaks = seq(0,12,1), minor_breaks = NULL) +
  scale_y_continuous(breaks = seq(0,6000,500)) +
  geom_line(data = filter(summ.by.mnth, city == "SEA"), aes(Month, cnt, group = Crime, color = Crime)) +
  scale_color_manual(values=col_vector) +
  facet_wrap(~city, scales = "free_x") + theme_bw()

Geospatial Analysis

For geospatial analysis, we thank Zillow for providing the neighborhood shapefiles. We load the shapefiles into a geospatial dataframe and use leaflet to analyze the crime occurences by neighborhood.

wa.map <- readOGR(dsn = "/Users/gauravgada/GitHub/viz-des/01-report/WA", layer = "ZillowNeighborhoods-WA")

## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/gauravgada/GitHub/viz-des/01-report/WA", layer: "ZillowNeighborhoods-WA"
## with 586 features
## It has 5 fields

sea.map <- subset(wa.map, City == "Seattle")

sea.coords <- data.frame(sea$Longitude, sea$Latitude)
colnames(sea.coords)[1] = "Longitude"
colnames(sea.coords)[2] = "Latitude"
coordinates(sea.coords) <- ~ Longitude + Latitude
proj4string(sea.coords) <- proj4string(sea.map)

sea$neighborhood <- over(sea.coords, sea.map)$Name

ca.map <- readOGR(dsn = "/Users/gauravgada/GitHub/viz-des/01-report/CA", layer = "ZillowNeighborhoods-CA")

## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/gauravgada/GitHub/viz-des/01-report/CA", layer: "ZillowNeighborhoods-CA"
## with 2051 features
## It has 5 fields

sf.map <- subset(ca.map, City == "San Francisco")

sf.coords <- data.frame(sf$X, sf$Y)
colnames(sf.coords)[1] = "Longitude"
colnames(sf.coords)[2] = "Latitude"
coordinates(sf.coords) <- ~ Longitude + Latitude
proj4string(sf.coords) <- proj4string(sf.map)

sf$neighborhood <- over(sf.coords, sf.map)$Name

nb.hist <- as.data.frame(sort(table(sea$neighborhood)))

nb.hist <- nb.hist %>%
  filter(Freq > 0)

sea.map$inst[match(nb.hist$Var1, sea.map$Name)] = nb.hist$Freq

nb.hist <- as.data.frame(sort(table(sf$neighborhood)))

nb.hist <- nb.hist %>%
  filter(Freq > 0)

sf.map$inst[match(nb.hist$Var1, sf.map$Name)] = nb.hist$Freq

Crime in Seattle by neighborhood

It seems that crime is not evenly distributed in the city. Downtown Seattle was the crime capital among the Seattle neighborhoods closely followed by Capitol Hill. University District comes third. It would have been interesting to have neighborhood population data to analyze per capita crime rates across these neighborhoods, but I couldn’t find that data.

pal <- colorBin("YlOrRd", domain = sea.map$inst, bins = seq(0,9000,1000))

labels <- sprintf(
  "<strong>%s</strong><br/>%g cases",
  sea.map$Name, sea.map$inst
) %>% lapply(htmltools::HTML)

leaflet(sea.map, width = "100%") %>%
  addProviderTiles(providers$CartoDB.DarkMatterNoLabels) %>%
  addPolygons(stroke = FALSE,
              fillOpacity = 0.5,
              label = labels,
              fillColor = ~pal(inst),
              highlight = highlightOptions(
                weight = 5,
                color = "#666",
                dashArray = "",
                fillOpacity = 0.7,
                bringToFront = TRUE))

Crime in San Francisco by neighborhood

Similarly, in San Francisco as well, the crime is concentrated in a few pockets. South of Market neighborhood sees the highest number of crimes followed by Mission and Downtown.

pal <- colorBin("YlOrRd", domain = sf.map$inst, bins = seq(0,18000,2000))

labels <- sprintf(
  "<strong>%s</strong><br/>%g cases",
  sf.map$Name, sf.map$inst
) %>% lapply(htmltools::HTML)

leaflet(sf.map, width = "100%") %>%
  addProviderTiles(providers$CartoDB.DarkMatterNoLabels) %>%
  addPolygons(stroke = FALSE,
              fillOpacity = 0.5,
              label = labels,
              fillColor = ~pal(inst),
              highlight = highlightOptions(
                weight = 5,
                color = "#666",
                dashArray = "",
                fillOpacity = 0.7,
                bringToFront = TRUE))

Thanks for following along with my analysis! I hope you enjoyed it.