Background

I analyze the crime data for 2016 from the Seattle and San Francisco open data outlets.

First, I load the libraries required for our analysis. In the next steps, I clean the data, create some features and then use additional geospatial data to guide my analysis.

All the code and data for this analysis can be viewed on my GitHub. Feel free to clone this and run your own analysis.

# Load only things that do not exists. See below:
# http://stackoverflow.com/a/4090208/1494702
req.pkg <- c("tidyverse", "ggmap", "lubridate", "randomcoloR", "leaflet", "rgdal", "sp")
install.diff <- req.pkg[!(req.pkg %in% installed.packages()[,"Package"])]
if(length(install.diff)) install.packages(install.diff)

require(tidyverse)
require(ggmap)
require(lubridate)
require(randomcoloR)
require(leaflet)
require(rgdal)
require(sp)

Now we load the data files and preview some of the data:

file.name.sea <- "/Users/gauravgada/GitHub/viz-des/01-report/sea_2016.csv"
sea <- read.csv(file.name.sea)

file.name.sf <- "/Users/gauravgada/GitHub/viz-des/01-report/sf_2016.csv"
sf <- read.csv(file.name.sf)

knitr::kable(head(sea), format = "markdown")
X RMS.CDW.ID General.Offense.Number Offense.Code Offense.Code.Extension Offense.Type Summary.Offense.Code Summarized.Offense.Description Date.Reported Occurred.Date.or.Date.Range.Start Occurred.Date.Range.End Hundred.Block.Location District.Sector Zone.Beat Census.Tract.2000 Longitude Latitude Location Month Year
1 1032273 2016104093 X 47 PROPERTY FOUND X STOLEN PROPERTY 2016-03-25 23:17:00 2016-03-25 22:00:00 2016-03-25 22:30:00 12XX BLOCK OF S KING ST G G3 9000.2017 -122.3153 47.59834 (47.598335266, -122.315315247) 3 2016
2 1032274 201690505 2699 4 THEFT OF SERVICES 2699 THEFT OF SERVICES 2016-03-14 23:40:00 2016-03-14 23:40:00 NA 84XX BLOCK OF 54 AV S S S2 11800.1007 -122.2649 47.52818 (47.528175354, -122.264923096) 3 2016
3 1032275 2016108950 911 0 HOMICIDE-PREMEDITATED-GUN 900 HOMICIDE 2016-03-30 05:27:00 2016-03-30 05:27:00 NA 4XX BLOCK OF 2 AV S K K3 9200.2004 -122.3314 47.60107 (47.601074219, -122.331390381) 3 2016
4 1032276 2016105920 X 48 PROPERTY LOST X LOST PROPERTY 2016-03-27 15:13:00 2016-03-22 00:00:00 2016-03-26 00:30:00 12XX BLOCK OF S KING ST G G3 9000.2017 -122.3153 47.59834 (47.598335266, -122.315315247) 3 2016
5 1032277 201694739 2303 0 THEFT-SHOPLIFT 2300 SHOPLIFTING 2016-03-18 07:48:00 2016-03-18 07:48:00 NA 14XX BLOCK OF 2 AV M M1 8100.3003 -122.3384 47.60879 (47.608787537, -122.338447571) 3 2016
6 1032278 201622976 2605 0 FRAUD-CREDIT CARD 2600 FRAUD 2016-01-20 11:08:00 2016-01-20 07:00:00 NA 45XX BLOCK OF 8 AV NE U U2 5200.5006 -122.3196 47.66222 (47.662223816, -122.31955719) 1 2016
knitr::kable(head(sf), format = "markdown")
X.1 IncidntNum Category Descript DayOfWeek DateTime PdDistrict Resolution Address X Y Location PdId
1 140464217 WARRANTS WARRANT ARREST Thursday 4/28/16 23:31 PARK ARREST, BOOKED 7TH AV / HUGO ST -122.4642 37.76507 (37.7650658096524, -122.464236532783) 1.40464e+13
2 141059263 WARRANTS WARRANT ARREST Monday 4/25/16 14:59 BAYVIEW ARREST, BOOKED KEITH ST / SHAFTER AV -122.3889 37.72998 (37.7299809672996, -122.388856204292) 1.41059e+13
3 160000318 ROBBERY ATTEMPTED ROBBERY WITH BODILY FORCE Friday 1/1/16 1:10 TENDERLOIN NONE 300 Block of TAYLOR ST -122.4113 37.78554 (37.7855355791102, -122.411302813025) 1.60000e+13
4 160018048 VEHICLE THEFT STOLEN TRUCK Thursday 1/7/16 15:45 CENTRAL NONE 400 Block of POST ST -122.4093 37.78824 (37.788239899778, -122.409256096494) 1.60018e+13
5 160002988 LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO Friday 1/1/16 18:30 SOUTHERN NONE HOWARD ST / 5TH ST -122.4049 37.78150 (37.7814987213218, -122.404934413339) 1.60003e+13
6 140586736 WEAPON LAWS SWITCHBLADE KNIFE, POSSESSION Friday 5/6/16 9:17 INGLESIDE ARREST, BOOKED 900 Block of ELLSWORTH ST -122.4165 37.73301 (37.7330103867416, -122.416517427405) 1.40587e+13

Research questions

The questions that I ask of our data are -

  1. How does the crime in Seattle and San Francisco vary over the course of the year?
  2. What times of the day are more safe versus what times are more dangerous?
  3. How are crimes geospatially distributed? Do they happen all over the city or are they concentrated in pockets?

First I start by cleaning the data.

  1. I convert all the time data into datetime that is recognizable by R. This will help me with further analysis.
  2. I generate some features by splitting the data on hyphens. ex., in the Seattle dataset, I split the column “Offense.Type” on the hyphen.
sea <- sea %>%
  separate(Offense.Type, into = c("Offense.summ", "Offense.det1", "Offense.det2", "Offense.det3"), sep = "-", remove = F)

# http://stackoverflow.com/questions/29111017/convert-string-to-datetime-r
# http://stackoverflow.com/a/18503822/1494702
# http://strftime.org/

to.datetime.sea <- function(x) {
    return(as.POSIXct(x))
}

sea[c("Date.Reported", "Occurred.Date.or.Date.Range.Start", "Occurred.Date.Range.End")] <- lapply(sea[c("Date.Reported", "Occurred.Date.or.Date.Range.Start", "Occurred.Date.Range.End")], to.datetime.sea)

to.datetime.sf <- function(x) {
    date.fmt <- "%m/%d/%y %H:%M"
    return(as.POSIXct(x, format=date.fmt))
}

sf[c("DateTime")] <- lapply(sf[c("DateTime")], to.datetime.sf)

I sort and gather the top crime types in Seattle and San Francisco. Then I filter on the top-5 crimes to avoid making my time trend graph less cluttered.

sea.sorted <- as.data.frame(sort(table(sea$Offense.summ), decreasing = T))

sf.sorted <- as.data.frame(sort(table(sf$Category), decreasing = T))
sf.top5 <- sf %>%
  filter(Category %in% sf.sorted$Var1[0:5])

sf.summ.by.hour <- sf.top5 %>%
group_by(hour(DateTime), Category) %>%
summarise(cnt = n())

sea.top5 <- sea %>%
  filter(Offense.summ %in% sea.sorted$Var1[0:5])

sea.summ.by.hour <- sea.top5 %>%
group_by(hour(Date.Reported), Offense.summ) %>%
summarise(cnt = n())

We further manipulate the data to make it easier to plot.

colnames(sea.summ.by.hour)[1] = "Hour"
colnames(sf.summ.by.hour)[1] = "Hour"
colnames(sea.summ.by.hour)[2] = "Crime"
colnames(sf.summ.by.hour)[2] = "Crime"
sf.summ.by.hour$city = "SF"
sea.summ.by.hour$city = "SEA"

sf.summ.by.hour <- as.data.frame(sf.summ.by.hour)
sea.summ.by.hour <- as.data.frame(sea.summ.by.hour)
lapply(sf.summ.by.hour, class)
## $Hour
## [1] "integer"
## 
## $Crime
## [1] "factor"
## 
## $cnt
## [1] "integer"
## 
## $city
## [1] "character"
lapply(sea.summ.by.hour, class)
## $Hour
## [1] "integer"
## 
## $Crime
## [1] "character"
## 
## $cnt
## [1] "integer"
## 
## $city
## [1] "character"
# sf.summ.by.hour$summ <- as.character(sf.summ.by.hour$summ)
summ.by.hour <- rbind(sf.summ.by.hour, sea.summ.by.hour)

Seattle and San Francisco’s crime by hour.

The below plots show the crime variation by time in the two cities. We find that theft is the most popular crime in both cities. Theft in Seattle peaks during the noon, while it peaks at 6 to 7 PM in San Francisco.

n <- 10
col_vector <- unname(distinctColorPalette(n))

ggplot(summ.by.hour) +
  geom_line(data = filter(summ.by.hour, city == "SF"), aes(Hour, cnt, group = Crime, color = Crime)) +
  scale_x_continuous(breaks = seq(0,23,1), minor_breaks = NULL) +
  scale_y_continuous(breaks = seq(0,3500,500)) +
  geom_line(data = filter(summ.by.hour, city == "SEA"), aes(Hour, cnt, group = Crime, color = Crime)) +
  scale_color_manual(values=col_vector) +
  facet_wrap(~city, scales = "free_x") + theme_bw()

sf.summ.by.mnth <- sf.top5 %>%
group_by(month(DateTime), Category) %>%
summarise(cnt = n())

sea.summ.by.mnth <- sea.top5 %>%
group_by(month(Date.Reported), Offense.summ) %>%
summarise(cnt = n())

colnames(sea.summ.by.mnth)[1] = "Month"
colnames(sf.summ.by.mnth)[1] = "Month"
colnames(sea.summ.by.mnth)[2] = "Crime"
colnames(sf.summ.by.mnth)[2] = "Crime"
sf.summ.by.mnth$city = "SF"
sea.summ.by.mnth$city = "SEA"

sf.summ.by.mnth <- as.data.frame(sf.summ.by.mnth)
sea.summ.by.mnth <- as.data.frame(sea.summ.by.mnth)

summ.by.mnth <- rbind(sf.summ.by.mnth, sea.summ.by.mnth)
sf.summ.by.mnth <- sf.top5 %>%
group_by(round((as.numeric(DateTime) - as.numeric(utc))/(60*60*24*30)), Category) %>%
summarise(cnt = n())

sea.summ.by.mnth <- sea.top5 %>%
group_by(round(as.numeric((Date.Reported) - as.numeric(utc))/(60*60*24*30)), Offense.summ) %>%
summarise(cnt = n())

colnames(sea.summ.by.mnth)[1] = "Month"
colnames(sf.summ.by.mnth)[1] = "Month"
colnames(sea.summ.by.mnth)[2] = "Crime"
colnames(sf.summ.by.mnth)[2] = "Crime"
sf.summ.by.mnth$city = "SF"
sea.summ.by.mnth$city = "SEA"

sf.summ.by.mnth <- as.data.frame(sf.summ.by.mnth)
sea.summ.by.mnth <- as.data.frame(sea.summ.by.mnth)

summ.by.mnth <- rbind(sf.summ.by.mnth, sea.summ.by.mnth)

Seattle and San Francisco’s crime by month.

In our further analysis of time trends, we see that crime in Seattle follows strong monthly trends with very little crime happening from January to August. Crime starts peaking in Seattle from September through December. San Francisco on the other hand, does not have any noticeable monthly crime trends.

ggplot(summ.by.mnth) +
  geom_line(data = filter(summ.by.mnth, city == "SF"), aes(Month, cnt, group = Crime, color = Crime)) +
  scale_x_continuous(breaks = seq(0,12,1), minor_breaks = NULL) +
  scale_y_continuous(breaks = seq(0,6000,500)) +
  geom_line(data = filter(summ.by.mnth, city == "SEA"), aes(Month, cnt, group = Crime, color = Crime)) +
  scale_color_manual(values=col_vector) +
  facet_wrap(~city, scales = "free_x") + theme_bw()

Geospatial Analysis

For geospatial analysis, we thank Zillow for providing the neighborhood shapefiles. We load the shapefiles into a geospatial dataframe and use leaflet to analyze the crime occurences by neighborhood.

wa.map <- readOGR(dsn = "/Users/gauravgada/GitHub/viz-des/01-report/WA", layer = "ZillowNeighborhoods-WA")
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/gauravgada/GitHub/viz-des/01-report/WA", layer: "ZillowNeighborhoods-WA"
## with 586 features
## It has 5 fields
sea.map <- subset(wa.map, City == "Seattle")

sea.coords <- data.frame(sea$Longitude, sea$Latitude)
colnames(sea.coords)[1] = "Longitude"
colnames(sea.coords)[2] = "Latitude"
coordinates(sea.coords) <- ~ Longitude + Latitude
proj4string(sea.coords) <- proj4string(sea.map)

sea$neighborhood <- over(sea.coords, sea.map)$Name

ca.map <- readOGR(dsn = "/Users/gauravgada/GitHub/viz-des/01-report/CA", layer = "ZillowNeighborhoods-CA")
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/gauravgada/GitHub/viz-des/01-report/CA", layer: "ZillowNeighborhoods-CA"
## with 2051 features
## It has 5 fields
sf.map <- subset(ca.map, City == "San Francisco")

sf.coords <- data.frame(sf$X, sf$Y)
colnames(sf.coords)[1] = "Longitude"
colnames(sf.coords)[2] = "Latitude"
coordinates(sf.coords) <- ~ Longitude + Latitude
proj4string(sf.coords) <- proj4string(sf.map)

sf$neighborhood <- over(sf.coords, sf.map)$Name
nb.hist <- as.data.frame(sort(table(sea$neighborhood)))

nb.hist <- nb.hist %>%
  filter(Freq > 0)

sea.map$inst[match(nb.hist$Var1, sea.map$Name)] = nb.hist$Freq

nb.hist <- as.data.frame(sort(table(sf$neighborhood)))

nb.hist <- nb.hist %>%
  filter(Freq > 0)

sf.map$inst[match(nb.hist$Var1, sf.map$Name)] = nb.hist$Freq

Crime in Seattle by neighborhood

It seems that crime is not evenly distributed in the city. Downtown Seattle was the crime capital among the Seattle neighborhoods closely followed by Capitol Hill. University District comes third. It would have been interesting to have neighborhood population data to analyze per capita crime rates across these neighborhoods, but I couldn’t find that data.

pal <- colorBin("YlOrRd", domain = sea.map$inst, bins = seq(0,9000,1000))

labels <- sprintf(
  "<strong>%s</strong><br/>%g cases",
  sea.map$Name, sea.map$inst
) %>% lapply(htmltools::HTML)

leaflet(sea.map, width = "100%") %>%
  addProviderTiles(providers$CartoDB.DarkMatterNoLabels) %>%
  addPolygons(stroke = FALSE,
              fillOpacity = 0.5,
              label = labels,
              fillColor = ~pal(inst),
              highlight = highlightOptions(
                weight = 5,
                color = "#666",
                dashArray = "",
                fillOpacity = 0.7,
                bringToFront = TRUE))

Crime in San Francisco by neighborhood

Similarly, in San Francisco as well, the crime is concentrated in a few pockets. South of Market neighborhood sees the highest number of crimes followed by Mission and Downtown.

pal <- colorBin("YlOrRd", domain = sf.map$inst, bins = seq(0,18000,2000))

labels <- sprintf(
  "<strong>%s</strong><br/>%g cases",
  sf.map$Name, sf.map$inst
) %>% lapply(htmltools::HTML)

leaflet(sf.map, width = "100%") %>%
  addProviderTiles(providers$CartoDB.DarkMatterNoLabels) %>%
  addPolygons(stroke = FALSE,
              fillOpacity = 0.5,
              label = labels,
              fillColor = ~pal(inst),
              highlight = highlightOptions(
                weight = 5,
                color = "#666",
                dashArray = "",
                fillOpacity = 0.7,
                bringToFront = TRUE))

Thanks for following along with my analysis! I hope you enjoyed it.