R语言大数据分析纽约市的311万条投诉统计可视化与时间序列分析
原文链接:http://tecdat.cn/?p=9800
介绍
本文并不表示R在数据分析方面比Python更好或更快速,我本人每天都使用两种语言。这篇文章只是提供了比较这两种语言的机会。
本文中的 数据 每天都会更新,我的文件版本更大,为4.63 GB。
CSV文件包含纽约市的311条投诉。它是纽约市开放数据门户网站中最受欢迎的数据集。
数据工作流程
install.packages("devtools")
library("devtools")
install_github("ropensci/plotly")
library(plotly)
需要创建一个帐户以连接到plotly API。或者,可以只使用默认的ggplot2图形。
set_credentials_file("DemoAccount", "lr1c37zw81") ## Replace contents with your API Key
使用dplyr在R中进行分析
假设已安装sqlite3(因此可通过终端访问)。
$ sqlite3 data.db # Create your database
$.databases # Show databases to make sure it works
$.mode csv
$.import <filename> <tablename>
# Where filename is the name of the csv & tablename is the name of the new database table
$.quit
将数据加载到内存中。
library(readr)
# data.table, selecting a subset of columns
time_data.table <- system.time(fread('/users/ryankelly/NYC_data.csv',
select = c('Agency', 'Created Date','Closed Date', 'Complaint Type', 'Descriptor', 'City'),
showProgress = T))
kable(data.frame(rbind(time_data.table, time_data.table_full, time_readr)))
user.selfsys.selfelapseduser.childsys.childtime_data.table63.5881.95265.63300time_data.table_full205.5713.124208.88000time_readr277.7205.018283.02900
我将使用data.table读取数据。该 fread
函数大大提高了读取速度。
关于dplyr
默认情况下,dplyr查询只会从数据库中提取前10行。
library(dplyr) ## Will be used for pandas replacement
# Connect to the database
db <- src_sqlite('/users/ryankelly/data.db')
db
数据处理的两个最佳选择(除了R之外)是:
数据表
dplyr
预览数据
# Wrapped in a function for display purposes
head_ <- function(x, n = 5) kable(head(x, n))
head_(data)
AgencyCreatedDateClosedDateComplaintTypeDescriptorCityNYPD04/11/2015 02:13:04 AM Noise - Street/SidewalkLoud Music/PartyBROOKLYNDFTA04/11/2015 02:12:05 AM Senior Center ComplaintN/AELMHURSTNYPD04/11/2015 02:11:46 AM Noise - CommercialLoud Music/PartyJAMAICANYPD04/11/2015 02:11:02 AM Noise - Street/SidewalkLoud TalkingBROOKLYNNYPD04/11/2015 02:10:45 AM Noise - Street/SidewalkLoud Music/PartyNEW YORK
选择几列
ComplaintTypeDescriptorAgencyNoise - Street/SidewalkLoud Music/PartyNYPDSenior Center ComplaintN/ADFTANoise - CommercialLoud Music/PartyNYPDNoise - Street/SidewalkLoud TalkingNYPDNoise - Street/SidewalkLoud Music/PartyNYPD
ComplaintTypeDescriptorAgencyNoise - Street/SidewalkLoud Music/PartyNYPDSenior Center ComplaintN/ADFTANoise - CommercialLoud Music/PartyNYPDNoise - Street/SidewalkLoud TalkingNYPDNoise - Street/SidewalkLoud Music/PartyNYPDNoise - Street/SidewalkLoud TalkingNYPDNoise - CommercialLoud Music/PartyNYPDHPD Literature RequestThe ABCs of Housing - SpanishHPDNoise - Street/SidewalkLoud TalkingNYPDStreet ConditionPlate Condition - NoisyDOT
使用WHERE过滤行
ComplaintTypeDescriptorAgencyNoise - Street/SidewalkLoud Music/PartyNYPDNoise - CommercialLoud Music/PartyNYPDNoise - Street/SidewalkLoud TalkingNYPDNoise - Street/SidewalkLoud Music/PartyNYPDNoise - Street/SidewalkLoud TalkingNYPD
使用WHERE和IN过滤列中的多个值
ComplaintTypeDescriptorAgencyNoise - Street/SidewalkLoud Music/PartyNYPDNoise - CommercialLoud Music/PartyNYPDNoise - Street/SidewalkLoud TalkingNYPDNoise - Street/SidewalkLoud Music/PartyNYPDNoise - Street/SidewalkLoud TalkingNYPD
在DISTINCT列中查找唯一值
## City
## 1 BROOKLYN
## 2 ELMHURST
## 3 JAMAICA
## 4 NEW YORK
## 5
## 6 BAYSIDE
使用COUNT(*)和GROUP BY查询值计数
# dt[, .(No.Complaints = .N), Agency]
#setkey(dt, No.Complaints) # setkey index's the data
q <- data %>% select(Agency) %>% group_by(Agency) %>% summarise(No.Complaints = n())
head_(q)
AgencyNo.Complaints3-1-122499ACS3AJC7ART3CAU8
使用ORDER和-排序结果


数据库中有多少个城市?
# dt[, unique(City)]
q <- data %>% select(City) %>% distinct() %>% summarise(Number.of.Cities = n())
head(q)
## Number.of.Cities
## 1 1818
让我们来绘制10个最受关注的城市
CityNo.ComplaintsBROOKLYN2671085NEW YORK1692514BRONX1624292 766378STATEN ISLAND437395JAMAICA147133FLUSHING117669ASTORIA90570Jamaica67083RIDGEWOOD66411
用
UPPER
转换CITY格式。
CITYNo.ComplaintsBROOKLYN2671085NEW YORK1692514BRONX1624292 766378STATEN ISLAND437395JAMAICA147133FLUSHING117669ASTORIA90570JAMAICA67083RIDGEWOOD66411
投诉类型(按城市)
# Plot result
plt <- ggplot(q_f, aes(ComplaintType, No.Complaints, fill = CITY)) +
geom_bar(stat = 'identity') +
theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
plt

第2部分时间序列运算
提供的数据不适合SQLite的标准日期格式。
在SQL数据库中创建一个新列,然后使用格式化的date语句重新插入数据 创建一个新表并将格式化日期插入原始列名。
使用时间戳字符串过滤SQLite行:YYYY-MM-DD hh:mm:ss
# dt[CreatedDate < '2014-11-26 23:47:00' & CreatedDate > '2014-09-16 23:45:00',
# .(ComplaintType, CreatedDate, City)]
q <- data %>% filter(CreatedDate < "2014-11-26 23:47:00", CreatedDate > "2014-09-16 23:45:00") %>%
select(ComplaintType, CreatedDate, City)
head_(q)
ComplaintTypeCreatedDateCityNoise - Street/Sidewalk2014-11-12 11:59:56BRONXTaxi Complaint2014-11-12 11:59:40BROOKLYNNoise - Commercial2014-11-12 11:58:53BROOKLYNNoise - Commercial2014-11-12 11:58:26NEW YORKNoise - Street/Sidewalk2014-11-12 11:58:14NEW YORK
使用strftime从时间戳中拉出小时单位
# dt[, hour := strftime('%H', CreatedDate), .(ComplaintType, CreatedDate, City)]
q <- data %>% mutate(hour = strftime('%H', CreatedDate)) %>%
select(ComplaintType, CreatedDate, City, hour)
head_(q)
ComplaintTypeCreatedDateCityhourNoise - Street/Sidewalk2015-11-04 02:13:04BROOKLYN02Senior Center Complaint2015-11-04 02:12:05ELMHURST02Noise - Commercial2015-11-04 02:11:46JAMAICA02Noise - Street/Sidewalk2015-11-04 02:11:02BROOKLYN02Noise - Street/Sidewalk2015-11-04 02:10:45NEW YORK02




汇总时间序列
首先,创建一个时间戳记四舍五入到前15分钟间隔的新列
# Using lubridate::new_period()
# dt[, interval := CreatedDate - new_period(900, 'seconds')][, .(CreatedDate, interval)]
q <- data %>%
mutate(interval = sql("datetime((strftime('%s', CreatedDate) / 900) * 900, 'unixepoch')")) %>%
select(CreatedDate, interval)
head_(q, 10)
CreatedDateinterval2015-11-04 02:13:042015-11-04 02:00:002015-11-04 02:12:052015-11-04 02:00:002015-11-04 02:11:462015-11-04 02:00:002015-11-04 02:11:022015-11-04 02:00:002015-11-04 02:10:452015-11-04 02:00:002015-11-04 02:09:072015-11-04 02:00:002015-11-04 02:05:472015-11-04 02:00:002015-11-04 02:03:432015-11-04 02:00:002015-11-04 02:03:292015-11-04 02:00:002015-11-04 02:02:172015-11-04 02:00:00
绘制2003年的结果



