I analyze the crime data for 2016 from the Seattle and San Francisco open data outlets.
First, I load the libraries required for our analysis. In the next steps, I clean the data, create some features and then use additional geospatial data to guide my analysis.
All the code and data for this analysis can be viewed on my GitHub. Feel free to clone this and run your own analysis.
# Load only things that do not exists. See below:
# http://stackoverflow.com/a/4090208/1494702
req.pkg <- c("tidyverse", "ggmap", "lubridate", "randomcoloR", "leaflet", "rgdal", "sp")
install.diff <- req.pkg[!(req.pkg %in% installed.packages()[,"Package"])]
if(length(install.diff)) install.packages(install.diff)
require(tidyverse)
require(ggmap)
require(lubridate)
require(randomcoloR)
require(leaflet)
require(rgdal)
require(sp)
Now we load the data files and preview some of the data:
file.name.sea <- "/Users/gauravgada/GitHub/viz-des/01-report/sea_2016.csv"
sea <- read.csv(file.name.sea)
file.name.sf <- "/Users/gauravgada/GitHub/viz-des/01-report/sf_2016.csv"
sf <- read.csv(file.name.sf)
knitr::kable(head(sea), format = "markdown")
X | RMS.CDW.ID | General.Offense.Number | Offense.Code | Offense.Code.Extension | Offense.Type | Summary.Offense.Code | Summarized.Offense.Description | Date.Reported | Occurred.Date.or.Date.Range.Start | Occurred.Date.Range.End | Hundred.Block.Location | District.Sector | Zone.Beat | Census.Tract.2000 | Longitude | Latitude | Location | Month | Year |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1032273 | 2016104093 | X | 47 | PROPERTY FOUND | X | STOLEN PROPERTY | 2016-03-25 23:17:00 | 2016-03-25 22:00:00 | 2016-03-25 22:30:00 | 12XX BLOCK OF S KING ST | G | G3 | 9000.2017 | -122.3153 | 47.59834 | (47.598335266, -122.315315247) | 3 | 2016 |
2 | 1032274 | 201690505 | 2699 | 4 | THEFT OF SERVICES | 2699 | THEFT OF SERVICES | 2016-03-14 23:40:00 | 2016-03-14 23:40:00 | NA | 84XX BLOCK OF 54 AV S | S | S2 | 11800.1007 | -122.2649 | 47.52818 | (47.528175354, -122.264923096) | 3 | 2016 |
3 | 1032275 | 2016108950 | 911 | 0 | HOMICIDE-PREMEDITATED-GUN | 900 | HOMICIDE | 2016-03-30 05:27:00 | 2016-03-30 05:27:00 | NA | 4XX BLOCK OF 2 AV S | K | K3 | 9200.2004 | -122.3314 | 47.60107 | (47.601074219, -122.331390381) | 3 | 2016 |
4 | 1032276 | 2016105920 | X | 48 | PROPERTY LOST | X | LOST PROPERTY | 2016-03-27 15:13:00 | 2016-03-22 00:00:00 | 2016-03-26 00:30:00 | 12XX BLOCK OF S KING ST | G | G3 | 9000.2017 | -122.3153 | 47.59834 | (47.598335266, -122.315315247) | 3 | 2016 |
5 | 1032277 | 201694739 | 2303 | 0 | THEFT-SHOPLIFT | 2300 | SHOPLIFTING | 2016-03-18 07:48:00 | 2016-03-18 07:48:00 | NA | 14XX BLOCK OF 2 AV | M | M1 | 8100.3003 | -122.3384 | 47.60879 | (47.608787537, -122.338447571) | 3 | 2016 |
6 | 1032278 | 201622976 | 2605 | 0 | FRAUD-CREDIT CARD | 2600 | FRAUD | 2016-01-20 11:08:00 | 2016-01-20 07:00:00 | NA | 45XX BLOCK OF 8 AV NE | U | U2 | 5200.5006 | -122.3196 | 47.66222 | (47.662223816, -122.31955719) | 1 | 2016 |
knitr::kable(head(sf), format = "markdown")
X.1 | IncidntNum | Category | Descript | DayOfWeek | DateTime | PdDistrict | Resolution | Address | X | Y | Location | PdId |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 140464217 | WARRANTS | WARRANT ARREST | Thursday | 4/28/16 23:31 | PARK | ARREST, BOOKED | 7TH AV / HUGO ST | -122.4642 | 37.76507 | (37.7650658096524, -122.464236532783) | 1.40464e+13 |
2 | 141059263 | WARRANTS | WARRANT ARREST | Monday | 4/25/16 14:59 | BAYVIEW | ARREST, BOOKED | KEITH ST / SHAFTER AV | -122.3889 | 37.72998 | (37.7299809672996, -122.388856204292) | 1.41059e+13 |
3 | 160000318 | ROBBERY | ATTEMPTED ROBBERY WITH BODILY FORCE | Friday | 1/1/16 1:10 | TENDERLOIN | NONE | 300 Block of TAYLOR ST | -122.4113 | 37.78554 | (37.7855355791102, -122.411302813025) | 1.60000e+13 |
4 | 160018048 | VEHICLE THEFT | STOLEN TRUCK | Thursday | 1/7/16 15:45 | CENTRAL | NONE | 400 Block of POST ST | -122.4093 | 37.78824 | (37.788239899778, -122.409256096494) | 1.60018e+13 |
5 | 160002988 | LARCENY/THEFT | GRAND THEFT FROM LOCKED AUTO | Friday | 1/1/16 18:30 | SOUTHERN | NONE | HOWARD ST / 5TH ST | -122.4049 | 37.78150 | (37.7814987213218, -122.404934413339) | 1.60003e+13 |
6 | 140586736 | WEAPON LAWS | SWITCHBLADE KNIFE, POSSESSION | Friday | 5/6/16 9:17 | INGLESIDE | ARREST, BOOKED | 900 Block of ELLSWORTH ST | -122.4165 | 37.73301 | (37.7330103867416, -122.416517427405) | 1.40587e+13 |
The questions that I ask of our data are -
First I start by cleaning the data.
sea <- sea %>%
separate(Offense.Type, into = c("Offense.summ", "Offense.det1", "Offense.det2", "Offense.det3"), sep = "-", remove = F)
# http://stackoverflow.com/questions/29111017/convert-string-to-datetime-r
# http://stackoverflow.com/a/18503822/1494702
# http://strftime.org/
to.datetime.sea <- function(x) {
return(as.POSIXct(x))
}
sea[c("Date.Reported", "Occurred.Date.or.Date.Range.Start", "Occurred.Date.Range.End")] <- lapply(sea[c("Date.Reported", "Occurred.Date.or.Date.Range.Start", "Occurred.Date.Range.End")], to.datetime.sea)
to.datetime.sf <- function(x) {
date.fmt <- "%m/%d/%y %H:%M"
return(as.POSIXct(x, format=date.fmt))
}
sf[c("DateTime")] <- lapply(sf[c("DateTime")], to.datetime.sf)
I sort and gather the top crime types in Seattle and San Francisco. Then I filter on the top-5 crimes to avoid making my time trend graph less cluttered.
sea.sorted <- as.data.frame(sort(table(sea$Offense.summ), decreasing = T))
sf.sorted <- as.data.frame(sort(table(sf$Category), decreasing = T))
sf.top5 <- sf %>%
filter(Category %in% sf.sorted$Var1[0:5])
sf.summ.by.hour <- sf.top5 %>%
group_by(hour(DateTime), Category) %>%
summarise(cnt = n())
sea.top5 <- sea %>%
filter(Offense.summ %in% sea.sorted$Var1[0:5])
sea.summ.by.hour <- sea.top5 %>%
group_by(hour(Date.Reported), Offense.summ) %>%
summarise(cnt = n())
We further manipulate the data to make it easier to plot.
colnames(sea.summ.by.hour)[1] = "Hour"
colnames(sf.summ.by.hour)[1] = "Hour"
colnames(sea.summ.by.hour)[2] = "Crime"
colnames(sf.summ.by.hour)[2] = "Crime"
sf.summ.by.hour$city = "SF"
sea.summ.by.hour$city = "SEA"
sf.summ.by.hour <- as.data.frame(sf.summ.by.hour)
sea.summ.by.hour <- as.data.frame(sea.summ.by.hour)
lapply(sf.summ.by.hour, class)
## $Hour
## [1] "integer"
##
## $Crime
## [1] "factor"
##
## $cnt
## [1] "integer"
##
## $city
## [1] "character"
lapply(sea.summ.by.hour, class)
## $Hour
## [1] "integer"
##
## $Crime
## [1] "character"
##
## $cnt
## [1] "integer"
##
## $city
## [1] "character"
# sf.summ.by.hour$summ <- as.character(sf.summ.by.hour$summ)
summ.by.hour <- rbind(sf.summ.by.hour, sea.summ.by.hour)
The below plots show the crime variation by time in the two cities. We find that theft is the most popular crime in both cities. Theft in Seattle peaks during the noon, while it peaks at 6 to 7 PM in San Francisco.
n <- 10
col_vector <- unname(distinctColorPalette(n))
ggplot(summ.by.hour) +
geom_line(data = filter(summ.by.hour, city == "SF"), aes(Hour, cnt, group = Crime, color = Crime)) +
scale_x_continuous(breaks = seq(0,23,1), minor_breaks = NULL) +
scale_y_continuous(breaks = seq(0,3500,500)) +
geom_line(data = filter(summ.by.hour, city == "SEA"), aes(Hour, cnt, group = Crime, color = Crime)) +
scale_color_manual(values=col_vector) +
facet_wrap(~city, scales = "free_x") + theme_bw()
sf.summ.by.mnth <- sf.top5 %>%
group_by(month(DateTime), Category) %>%
summarise(cnt = n())
sea.summ.by.mnth <- sea.top5 %>%
group_by(month(Date.Reported), Offense.summ) %>%
summarise(cnt = n())
colnames(sea.summ.by.mnth)[1] = "Month"
colnames(sf.summ.by.mnth)[1] = "Month"
colnames(sea.summ.by.mnth)[2] = "Crime"
colnames(sf.summ.by.mnth)[2] = "Crime"
sf.summ.by.mnth$city = "SF"
sea.summ.by.mnth$city = "SEA"
sf.summ.by.mnth <- as.data.frame(sf.summ.by.mnth)
sea.summ.by.mnth <- as.data.frame(sea.summ.by.mnth)
summ.by.mnth <- rbind(sf.summ.by.mnth, sea.summ.by.mnth)
sf.summ.by.mnth <- sf.top5 %>%
group_by(round((as.numeric(DateTime) - as.numeric(utc))/(60*60*24*30)), Category) %>%
summarise(cnt = n())
sea.summ.by.mnth <- sea.top5 %>%
group_by(round(as.numeric((Date.Reported) - as.numeric(utc))/(60*60*24*30)), Offense.summ) %>%
summarise(cnt = n())
colnames(sea.summ.by.mnth)[1] = "Month"
colnames(sf.summ.by.mnth)[1] = "Month"
colnames(sea.summ.by.mnth)[2] = "Crime"
colnames(sf.summ.by.mnth)[2] = "Crime"
sf.summ.by.mnth$city = "SF"
sea.summ.by.mnth$city = "SEA"
sf.summ.by.mnth <- as.data.frame(sf.summ.by.mnth)
sea.summ.by.mnth <- as.data.frame(sea.summ.by.mnth)
summ.by.mnth <- rbind(sf.summ.by.mnth, sea.summ.by.mnth)
In our further analysis of time trends, we see that crime in Seattle follows strong monthly trends with very little crime happening from January to August. Crime starts peaking in Seattle from September through December. San Francisco on the other hand, does not have any noticeable monthly crime trends.
ggplot(summ.by.mnth) +
geom_line(data = filter(summ.by.mnth, city == "SF"), aes(Month, cnt, group = Crime, color = Crime)) +
scale_x_continuous(breaks = seq(0,12,1), minor_breaks = NULL) +
scale_y_continuous(breaks = seq(0,6000,500)) +
geom_line(data = filter(summ.by.mnth, city == "SEA"), aes(Month, cnt, group = Crime, color = Crime)) +
scale_color_manual(values=col_vector) +
facet_wrap(~city, scales = "free_x") + theme_bw()
For geospatial analysis, we thank Zillow for providing the neighborhood shapefiles. We load the shapefiles into a geospatial dataframe and use leaflet to analyze the crime occurences by neighborhood.
wa.map <- readOGR(dsn = "/Users/gauravgada/GitHub/viz-des/01-report/WA", layer = "ZillowNeighborhoods-WA")
## OGR data source with driver: ESRI Shapefile
## Source: "/Users/gauravgada/GitHub/viz-des/01-report/WA", layer: "ZillowNeighborhoods-WA"
## with 586 features
## It has 5 fields
sea.map <- subset(wa.map, City == "Seattle")
sea.coords <- data.frame(sea$Longitude, sea$Latitude)
colnames(sea.coords)[1] = "Longitude"
colnames(sea.coords)[2] = "Latitude"
coordinates(sea.coords) <- ~ Longitude + Latitude
proj4string(sea.coords) <- proj4string(sea.map)
sea$neighborhood <- over(sea.coords, sea.map)$Name
ca.map <- readOGR(dsn = "/Users/gauravgada/GitHub/viz-des/01-report/CA", layer = "ZillowNeighborhoods-CA")
## OGR data source with driver: ESRI Shapefile
## Source: "/Users/gauravgada/GitHub/viz-des/01-report/CA", layer: "ZillowNeighborhoods-CA"
## with 2051 features
## It has 5 fields
sf.map <- subset(ca.map, City == "San Francisco")
sf.coords <- data.frame(sf$X, sf$Y)
colnames(sf.coords)[1] = "Longitude"
colnames(sf.coords)[2] = "Latitude"
coordinates(sf.coords) <- ~ Longitude + Latitude
proj4string(sf.coords) <- proj4string(sf.map)
sf$neighborhood <- over(sf.coords, sf.map)$Name
nb.hist <- as.data.frame(sort(table(sea$neighborhood)))
nb.hist <- nb.hist %>%
filter(Freq > 0)
sea.map$inst[match(nb.hist$Var1, sea.map$Name)] = nb.hist$Freq
nb.hist <- as.data.frame(sort(table(sf$neighborhood)))
nb.hist <- nb.hist %>%
filter(Freq > 0)
sf.map$inst[match(nb.hist$Var1, sf.map$Name)] = nb.hist$Freq
It seems that crime is not evenly distributed in the city. Downtown Seattle was the crime capital among the Seattle neighborhoods closely followed by Capitol Hill. University District comes third. It would have been interesting to have neighborhood population data to analyze per capita crime rates across these neighborhoods, but I couldn’t find that data.
pal <- colorBin("YlOrRd", domain = sea.map$inst, bins = seq(0,9000,1000))
labels <- sprintf(
"<strong>%s</strong><br/>%g cases",
sea.map$Name, sea.map$inst
) %>% lapply(htmltools::HTML)
leaflet(sea.map, width = "100%") %>%
addProviderTiles(providers$CartoDB.DarkMatterNoLabels) %>%
addPolygons(stroke = FALSE,
fillOpacity = 0.5,
label = labels,
fillColor = ~pal(inst),
highlight = highlightOptions(
weight = 5,
color = "#666",
dashArray = "",
fillOpacity = 0.7,
bringToFront = TRUE))
Similarly, in San Francisco as well, the crime is concentrated in a few pockets. South of Market neighborhood sees the highest number of crimes followed by Mission and Downtown.
pal <- colorBin("YlOrRd", domain = sf.map$inst, bins = seq(0,18000,2000))
labels <- sprintf(
"<strong>%s</strong><br/>%g cases",
sf.map$Name, sf.map$inst
) %>% lapply(htmltools::HTML)
leaflet(sf.map, width = "100%") %>%
addProviderTiles(providers$CartoDB.DarkMatterNoLabels) %>%
addPolygons(stroke = FALSE,
fillOpacity = 0.5,
label = labels,
fillColor = ~pal(inst),
highlight = highlightOptions(
weight = 5,
color = "#666",
dashArray = "",
fillOpacity = 0.7,
bringToFront = TRUE))
Thanks for following along with my analysis! I hope you enjoyed it.