Background

I analyze the crime data for 2016 from the Seattle and San Francisco open data outlets.

First, I load the libraries required for our analysis. In the next steps, I clean the data, create some features and then use additional geospatial data to guide my analysis.

All the code and data for this analysis can be viewed on my GitHub. Feel free to clone this and run your own analysis.

# Load only things that do not exists. See below:
# http://stackoverflow.com/a/4090208/1494702
req.pkg <- c("tidyverse", "ggmap", "lubridate", "randomcoloR", "leaflet", "rgdal", "sp")
install.diff <- req.pkg[!(req.pkg %in% installed.packages()[,"Package"])]
if(length(install.diff)) install.packages(install.diff)

require(tidyverse)
require(ggmap)
require(lubridate)
require(randomcoloR)
require(leaflet)
require(rgdal)
require(sp)

Now we load the data files and preview some of the data:

file.name.sea <- "/Users/gauravgada/GitHub/viz-des/01-report/sea_2016.csv"
sea <- read.csv(file.name.sea)

file.name.sf <- "/Users/gauravgada/GitHub/viz-des/01-report/sf_2016.csv"
sf <- read.csv(file.name.sf)

knitr::kable(head(sea), format = "markdown")
XRMS.CDW.IDGeneral.Offense.NumberOffense.CodeOffense.Code.ExtensionOffense.TypeSummary.Offense.CodeSummarized.Offense.DescriptionDate.ReportedOccurred.Date.or.Date.Range.StartOccurred.Date.Range.EndHundred.Block.LocationDistrict.SectorZone.BeatCensus.Tract.2000LongitudeLatitudeLocationMonthYear
110322732016104093X47PROPERTY FOUNDXSTOLEN PROPERTY2016-03-25 23:17:002016-03-25 22:00:002016-03-25 22:30:0012XX BLOCK OF S KING STGG39000.2017-122.315347.59834(47.598335266, -122.315315247)32016
2103227420169050526994THEFT OF SERVICES2699THEFT OF SERVICES2016-03-14 23:40:002016-03-14 23:40:00NA84XX BLOCK OF 54 AV SSS211800.1007-122.264947.52818(47.528175354, -122.264923096)32016
3103227520161089509110HOMICIDE-PREMEDITATED-GUN900HOMICIDE2016-03-30 05:27:002016-03-30 05:27:00NA4XX BLOCK OF 2 AV SKK39200.2004-122.331447.60107(47.601074219, -122.331390381)32016
410322762016105920X48PROPERTY LOSTXLOST PROPERTY2016-03-27 15:13:002016-03-22 00:00:002016-03-26 00:30:0012XX BLOCK OF S KING STGG39000.2017-122.315347.59834(47.598335266, -122.315315247)32016
5103227720169473923030THEFT-SHOPLIFT2300SHOPLIFTING2016-03-18 07:48:002016-03-18 07:48:00NA14XX BLOCK OF 2 AVMM18100.3003-122.338447.60879(47.608787537, -122.338447571)32016
6103227820162297626050FRAUD-CREDIT CARD2600FRAUD2016-01-20 11:08:002016-01-20 07:00:00NA45XX BLOCK OF 8 AV NEUU25200.5006-122.319647.66222(47.662223816, -122.31955719)12016
knitr::kable(head(sf), format = "markdown")
X.1IncidntNumCategoryDescriptDayOfWeekDateTimePdDistrictResolutionAddressXYLocationPdId
1140464217WARRANTSWARRANT ARRESTThursday4/28/16 23:31PARKARREST, BOOKED7TH AV / HUGO ST-122.464237.76507(37.7650658096524, -122.464236532783)1.40464e+13
2141059263WARRANTSWARRANT ARRESTMonday4/25/16 14:59BAYVIEWARREST, BOOKEDKEITH ST / SHAFTER AV-122.388937.72998(37.7299809672996, -122.388856204292)1.41059e+13
3160000318ROBBERYATTEMPTED ROBBERY WITH BODILY FORCEFriday1/1/16 1:10TENDERLOINNONE300 Block of TAYLOR ST-122.411337.78554(37.7855355791102, -122.411302813025)1.60000e+13
4160018048VEHICLE THEFTSTOLEN TRUCKThursday1/7/16 15:45CENTRALNONE400 Block of POST ST-122.409337.78824(37.788239899778, -122.409256096494)1.60018e+13
5160002988LARCENY/THEFTGRAND THEFT FROM LOCKED AUTOFriday1/1/16 18:30SOUTHERNNONEHOWARD ST / 5TH ST-122.404937.78150(37.7814987213218, -122.404934413339)1.60003e+13
6140586736WEAPON LAWSSWITCHBLADE KNIFE, POSSESSIONFriday5/6/16 9:17INGLESIDEARREST, BOOKED900 Block of ELLSWORTH ST-122.416537.73301(37.7330103867416, -122.416517427405)1.40587e+13

Research questions

The questions that I ask of our data are -

  1. How does the crime in Seattle and San Francisco vary over the course of the year?
  2. What times of the day are more safe versus what times are more dangerous?
  3. How are crimes geospatially distributed? Do they happen all over the city or are they concentrated in pockets?

First I start by cleaning the data.

  1. I convert all the time data into datetime that is recognizable by R. This will help me with further analysis.
  2. I generate some features by splitting the data on hyphens. ex., in the Seattle dataset, I split the column “Offense.Type” on the hyphen.
sea <- sea %>%
  separate(Offense.Type, into = c("Offense.summ", "Offense.det1", "Offense.det2", "Offense.det3"), sep = "-", remove = F)

# http://stackoverflow.com/questions/29111017/convert-string-to-datetime-r
# http://stackoverflow.com/a/18503822/1494702
# http://strftime.org/

to.datetime.sea <- function(x) {
    return(as.POSIXct(x))
}

sea[c("Date.Reported", "Occurred.Date.or.Date.Range.Start", "Occurred.Date.Range.End")] <- lapply(sea[c("Date.Reported", "Occurred.Date.or.Date.Range.Start", "Occurred.Date.Range.End")], to.datetime.sea)

to.datetime.sf <- function(x) {
    date.fmt <- "%m/%d/%y %H:%M"
    return(as.POSIXct(x, format=date.fmt))
}

sf[c("DateTime")] <- lapply(sf[c("DateTime")], to.datetime.sf)

I sort and gather the top crime types in Seattle and San Francisco. Then I filter on the top-5 crimes to avoid making my time trend graph less cluttered.

sea.sorted <- as.data.frame(sort(table(sea$Offense.summ), decreasing = T))

sf.sorted <- as.data.frame(sort(table(sf$Category), decreasing = T))
sf.top5 <- sf %>%
  filter(Category %in% sf.sorted$Var1[0:5])

sf.summ.by.hour <- sf.top5 %>%
group_by(hour(DateTime), Category) %>%
summarise(cnt = n())

sea.top5 <- sea %>%
  filter(Offense.summ %in% sea.sorted$Var1[0:5])

sea.summ.by.hour <- sea.top5 %>%
group_by(hour(Date.Reported), Offense.summ) %>%
summarise(cnt = n())

We further manipulate the data to make it easier to plot.

colnames(sea.summ.by.hour)[1] = "Hour"
colnames(sf.summ.by.hour)[1] = "Hour"
colnames(sea.summ.by.hour)[2] = "Crime"
colnames(sf.summ.by.hour)[2] = "Crime"
sf.summ.by.hour$city = "SF"
sea.summ.by.hour$city = "SEA"

sf.summ.by.hour <- as.data.frame(sf.summ.by.hour)
sea.summ.by.hour <- as.data.frame(sea.summ.by.hour)
lapply(sf.summ.by.hour, class)
## $Hour
## [1] "integer"
## 
## $Crime
## [1] "factor"
## 
## $cnt
## [1] "integer"
## 
## $city
## [1] "character"
lapply(sea.summ.by.hour, class)
## $Hour
## [1] "integer"
## 
## $Crime
## [1] "character"
## 
## $cnt
## [1] "integer"
## 
## $city
## [1] "character"
# sf.summ.by.hour$summ <- as.character(sf.summ.by.hour$summ)
summ.by.hour <- rbind(sf.summ.by.hour, sea.summ.by.hour)

Seattle and San Francisco’s crime by hour.

The below plots show the crime variation by time in the two cities. We find that theft is the most popular crime in both cities. Theft in Seattle peaks during the noon, while it peaks at 6 to 7 PM in San Francisco.

n <- 10
col_vector <- unname(distinctColorPalette(n))

ggplot(summ.by.hour) +
  geom_line(data = filter(summ.by.hour, city == "SF"), aes(Hour, cnt, group = Crime, color = Crime)) +
  scale_x_continuous(breaks = seq(0,23,1), minor_breaks = NULL) +
  scale_y_continuous(breaks = seq(0,3500,500)) +
  geom_line(data = filter(summ.by.hour, city == "SEA"), aes(Hour, cnt, group = Crime, color = Crime)) +
  scale_color_manual(values=col_vector) +
  facet_wrap(~city, scales = "free_x") + theme_bw()

sf.summ.by.mnth <- sf.top5 %>%
group_by(month(DateTime), Category) %>%
summarise(cnt = n())

sea.summ.by.mnth <- sea.top5 %>%
group_by(month(Date.Reported), Offense.summ) %>%
summarise(cnt = n())

colnames(sea.summ.by.mnth)[1] = "Month"
colnames(sf.summ.by.mnth)[1] = "Month"
colnames(sea.summ.by.mnth)[2] = "Crime"
colnames(sf.summ.by.mnth)[2] = "Crime"
sf.summ.by.mnth$city = "SF"
sea.summ.by.mnth$city = "SEA"

sf.summ.by.mnth <- as.data.frame(sf.summ.by.mnth)
sea.summ.by.mnth <- as.data.frame(sea.summ.by.mnth)

summ.by.mnth <- rbind(sf.summ.by.mnth, sea.summ.by.mnth)
sf.summ.by.mnth <- sf.top5 %>%
group_by(round((as.numeric(DateTime) - as.numeric(utc))/(60*60*24*30)), Category) %>%
summarise(cnt = n())

sea.summ.by.mnth <- sea.top5 %>%
group_by(round(as.numeric((Date.Reported) - as.numeric(utc))/(60*60*24*30)), Offense.summ) %>%
summarise(cnt = n())

colnames(sea.summ.by.mnth)[1] = "Month"
colnames(sf.summ.by.mnth)[1] = "Month"
colnames(sea.summ.by.mnth)[2] = "Crime"
colnames(sf.summ.by.mnth)[2] = "Crime"
sf.summ.by.mnth$city = "SF"
sea.summ.by.mnth$city = "SEA"

sf.summ.by.mnth <- as.data.frame(sf.summ.by.mnth)
sea.summ.by.mnth <- as.data.frame(sea.summ.by.mnth)

summ.by.mnth <- rbind(sf.summ.by.mnth, sea.summ.by.mnth)

Seattle and San Francisco’s crime by month.

In our further analysis of time trends, we see that crime in Seattle follows strong monthly trends with very little crime happening from January to August. Crime starts peaking in Seattle from September through December. San Francisco on the other hand, does not have any noticeable monthly crime trends.

ggplot(summ.by.mnth) +
  geom_line(data = filter(summ.by.mnth, city == "SF"), aes(Month, cnt, group = Crime, color = Crime)) +
  scale_x_continuous(breaks = seq(0,12,1), minor_breaks = NULL) +
  scale_y_continuous(breaks = seq(0,6000,500)) +
  geom_line(data = filter(summ.by.mnth, city == "SEA"), aes(Month, cnt, group = Crime, color = Crime)) +
  scale_color_manual(values=col_vector) +
  facet_wrap(~city, scales = "free_x") + theme_bw()

Geospatial Analysis

For geospatial analysis, we thank Zillow for providing the neighborhood shapefiles. We load the shapefiles into a geospatial dataframe and use leaflet to analyze the crime occurences by neighborhood.

wa.map <- readOGR(dsn = "/Users/gauravgada/GitHub/viz-des/01-report/WA", layer = "ZillowNeighborhoods-WA")
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/gauravgada/GitHub/viz-des/01-report/WA", layer: "ZillowNeighborhoods-WA"
## with 586 features
## It has 5 fields
sea.map <- subset(wa.map, City == "Seattle")

sea.coords <- data.frame(sea$Longitude, sea$Latitude)
colnames(sea.coords)[1] = "Longitude"
colnames(sea.coords)[2] = "Latitude"
coordinates(sea.coords) <- ~ Longitude + Latitude
proj4string(sea.coords) <- proj4string(sea.map)

sea$neighborhood <- over(sea.coords, sea.map)$Name

ca.map <- readOGR(dsn = "/Users/gauravgada/GitHub/viz-des/01-report/CA", layer = "ZillowNeighborhoods-CA")
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/gauravgada/GitHub/viz-des/01-report/CA", layer: "ZillowNeighborhoods-CA"
## with 2051 features
## It has 5 fields
sf.map <- subset(ca.map, City == "San Francisco")

sf.coords <- data.frame(sf$X, sf$Y)
colnames(sf.coords)[1] = "Longitude"
colnames(sf.coords)[2] = "Latitude"
coordinates(sf.coords) <- ~ Longitude + Latitude
proj4string(sf.coords) <- proj4string(sf.map)

sf$neighborhood <- over(sf.coords, sf.map)$Name
nb.hist <- as.data.frame(sort(table(sea$neighborhood)))

nb.hist <- nb.hist %>%
  filter(Freq > 0)

sea.map$inst[match(nb.hist$Var1, sea.map$Name)] = nb.hist$Freq

nb.hist <- as.data.frame(sort(table(sf$neighborhood)))

nb.hist <- nb.hist %>%
  filter(Freq > 0)

sf.map$inst[match(nb.hist$Var1, sf.map$Name)] = nb.hist$Freq

Crime in Seattle by neighborhood

It seems that crime is not evenly distributed in the city. Downtown Seattle was the crime capital among the Seattle neighborhoods closely followed by Capitol Hill. University District comes third. It would have been interesting to have neighborhood population data to analyze per capita crime rates across these neighborhoods, but I couldn’t find that data.

pal <- colorBin("YlOrRd", domain = sea.map$inst, bins = seq(0,9000,1000))

labels <- sprintf(
  "<strong>%s</strong><br/>%g cases",
  sea.map$Name, sea.map$inst
) %>% lapply(htmltools::HTML)

leaflet(sea.map, width = "100%") %>%
  addProviderTiles(providers$CartoDB.DarkMatterNoLabels) %>%
  addPolygons(stroke = FALSE,
              fillOpacity = 0.5,
              label = labels,
              fillColor = ~pal(inst),
              highlight = highlightOptions(
                weight = 5,
                color = "#666",
                dashArray = "",
                fillOpacity = 0.7,
                bringToFront = TRUE))

Crime in San Francisco by neighborhood

Similarly, in San Francisco as well, the crime is concentrated in a few pockets. South of Market neighborhood sees the highest number of crimes followed by Mission and Downtown.

pal <- colorBin("YlOrRd", domain = sf.map$inst, bins = seq(0,18000,2000))

labels <- sprintf(
  "<strong>%s</strong><br/>%g cases",
  sf.map$Name, sf.map$inst
) %>% lapply(htmltools::HTML)

leaflet(sf.map, width = "100%") %>%
  addProviderTiles(providers$CartoDB.DarkMatterNoLabels) %>%
  addPolygons(stroke = FALSE,
              fillOpacity = 0.5,
              label = labels,
              fillColor = ~pal(inst),
              highlight = highlightOptions(
                weight = 5,
                color = "#666",
                dashArray = "",
                fillOpacity = 0.7,
                bringToFront = TRUE))

Thanks for following along with my analysis! I hope you enjoyed it.