[实验数据]
本实验所用数据为某宽带运营商从2014年11月11日0时到1时的用户上网点击流数据,包含了31,487,488条记录。数据集已经存放在HDFS上,路径为“/data/13/4/dpi/dpi.csv”,各字段以制表符分隔。数据集还存放在了Hive上,表名为“bigdata_cases.dpi”。
各字段的定义为:
字段 定义
srcip 用户所在客户端IP
ad 用户上网设备号
ts 用户请求的时间戳,1970年1月1日0时至今的毫秒数
url 用户请求的当前页面URL
ref 用户请求的来源页面URL,存储的是base64编码后的值
ua 用户所用浏览器,存储的是base64编码后的值
dstip 用户访问的服务器IP
cookie 用户Cookie,存储的是base64编码后的值
数据集的前5行为:
student1@master:~$ hdfs dfs -cat /data/13/4/dpi/dpi.csv | head -5
1.171.51.162 none 1415638064719 http://www.zhiyoula.com/album/355986 aHR0cDovL3d3dy56aGl5b3VsYS5jb20v TW96aWxsYS81LjAgKGNvbXBhdGlibGU7IE1TSUUgOS4wOyBXaW5kb3dzIE5UIDYuMTsgV09XNjQ7IFRyaWRlbnQvNS4wOyBTTENDMjsgLk5FVCBDTFIgMi4wLjUwNzI3OyAuTkVUIENMUiAzLjUuMzA3Mjk7IC5ORVQgQ0xSIDMuMC4zMDcyOTsgTWVkaWEgQ2VudGVyIFBDIDYuMDsgTlAwNjsgLk5FVDQuMEM7IC5ORVQ0LjBFOyAzNjBTRSk= 180.166.52.52 UEhQU0VTU0lEPW1uczIyMW5rb200cG5iaml2NjhldGV1cjczOyBDTlpaREFUQTEwMDAwNTc5NzM9MTIzNzAwOTIzOC0xNDE1NjM2NDU5LWh0dHAlMjUzQSUyNTJGJTI1MkZ3ZWliby5jb20lMjUyRiU3QzE0MTU2MzY0NTk7IEhtX2x2dF9lMWU5YjY5MDNkMDFlNGI4NjMxOWRmMzY1ZWQxNTQwZD0xNDE1NDY4MzUyLDE0MTU0ODM2MDMsMTQxNTU0OTM2MywxNDE1NjM4MDIzOyBIbV9scHZ0X2UxZTliNjkwM2QwMWU0Yjg2MzE5ZGYzNjVlZDE1NDBkPTE0MTU2MzgwMjM= 20141111 00
1.193.64.174 none 1415635901445 http://www.jinjianginns.com/Url.aspx?sid=734&tn=title&gotourl=Default.html&utm_source=baidu&utm_medium=cpc&utm_campaign=baidupinpaizhuanqu&utm_content=title aHR0cDovL3d3dy5iYWlkdS5jb20vcz93ZD0lRTklOTQlQTYlRTYlQjElOUYlRTQlQjklOEIlRTYlOTglOUYlRTUlQUUlOTglRTclQkQlOTEmcnN2X3NwdD0xJmlzc3A9MSZmPTMmcnN2X2JwPTAmcnN2X2lkeD0yJmllPXV0Zi04JnRuPWJhaWR1aG9tZV9wZyZyc3ZfZW50ZXI9MSZyc3Zfc3VnMz03JnJzdl9zdWc0PTg4MyZyc3Zfc3VnMT02Jm9xPSVFOSU5NCVBNiVFNiVCMSU5RiZyc3Zfc3VnMj0xJnJzcD0xJmlucHV0VD01MDgxJl9fZWlzPTEmX19laXN0PTM4MQ== TW96aWxsYS81LjAgKFdpbmRvd3MgTlQgNi4xOyBXT1c2NCkgQXBwbGVXZWJLaXQvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgTWF4dGhvbi80LjMuMS4yMDAwIENocm9tZS8zMC4wLjE1OTkuMTAxIFNhZmFyaS81MzcuMzY= 116.236.224.20 YmRzaGFyZV9maXJzdGltZT0xMzgyMTkzMzE5NTUxOyBhZ19maWQ9WUx6bE1Dd3dCUk9zeUJtRjsgX19hZ19jbV89MTQxMTkxNzQxMjI4NzsgSG1fbHZ0X2Q2MGJjZTQyNjY3NTM5OWIxZGY0NTcyOGI0OGYxMDdkPTE0MTE5MTUwNzksMTQxMTkzMDI3NCwxNDExOTMxMzM4LDE0MTE5OTUxNjM7IF9fdXRtYT0yMTc0ODU3MDkuMTM4NDQ5NTY2OS4xMzgzNzEzMDkzLjE0MTE5MzEzMzkuMTQxMTk5NTE2Ni4yMDsgX191dG16PTIxNzQ4NTcwOS4xNDExOTk1MTY2LjIwLjUudXRtY3NyPWJhaWR1fHV0bWNjbj1iYWlkdXBpbnBhaXpodWFucXV8dXRtY21kPWNwY3x1dG1jY3Q9dGl0bGU7IG9vb29vb289NWExOTk2M2Fvb29vb29vXzVhMTk5NjNh 20141111 00
1.193.64.174 none 1415635909805 http://www.jinjianginns.com/js/datetime/Utility/My97DatePicker/My97DatePicker.htm aHR0cDovL3d3dy5qaW5qaWFuZ2lubnMuY29tL0RlZmF1bHQuaHRtbD91dG1fc291cmNlPWJhaWR1JnV0bV9tZWRpdW09Y3BjJnV0bV9jYW1wYWlnbj1iYWlkdXBpbnBhaXpodWFucXUmdXRtX2NvbnRlbnQ9dGl0bGU= TW96aWxsYS81LjAgKFdpbmRvd3MgTlQgNi4xOyBXT1c2NCkgQXBwbGVXZWJLaXQvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgTWF4dGhvbi80LjMuMS4yMDAwIENocm9tZS8zMC4wLjE1OTkuMTAxIFNhZmFyaS81MzcuMzY= 116.236.224.20 YmRzaGFyZV9maXJzdGltZT0xMzgyMTkzMzE5NTUxOyBhZ19maWQ9WUx6bE1Dd3dCUk9zeUJtRjsgX19hZ19jbV89MTQxMTkxNzQxMjI4NzsgSG1fbHZ0X2Q2MGJjZTQyNjY3NTM5OWIxZGY0NTcyOGI0OGYxMDdkPTE0MTE5MTUwNzksMTQxMTkzMDI3NCwxNDExOTMxMzM4LDE0MTE5OTUxNjM7IG9vb29vb289NWExOTk2M2Fvb29vb29vXzVhMTk5NjNhOyBBU1AuTkVUX1Nlc3Npb25JZD16ZDBnc3o1NW9ha2h6ZDQ1a3p6dHVoNDU7IENvb2tpZVRlc3Q9Y29va2lldGVzdDsgdHJhY2U9dHJhY2Vsb2dpZD02MDUxOTY4NjQmdHJhY2VpZD03MzQmY2xpY2tfaWQ9MDsgX191dG10PTE7IF9fdXRtYT0yMTc0ODU3MDkuMTM4NDQ5NTY2OS4xMzgzNzEzMDkzLjE0MTE5OTUxNjYuMTQxNTYzNTg3NC4yMTsgX191dG1iPTIxNzQ4NTcwOS4xLjEwLjE0MTU2MzU4NzQ7IF9fdXRtYz0yMTc0ODU3MDk7IF9fdXRtej0yMTc0ODU3MDkuMTQxNTYzNTg3NC4yMS42LnV0bWNzcj1iYWlkdXx1dG1jY249YmFpZHVwaW5wYWl6aHVhbnF1fHV0bWNtZD1jcGN8dXRtY2N0PXRpdGxl 20141111 00
1.198.181.118 none 1415638269814 http://m.ecpic.com.cn/index.shtml?cmpid=jl_mob_000015&app=jlfh NoDef TW96aWxsYS81LjAgKExpbnV4OyBVOyBBbmRyb2lkIDQuMC40OyB6aC1jbjsgWlRFIFU3OTUrIEJ1aWxkL0lNTTc2RCkgQXBwbGVXZWJLaXQvNTM0LjMwIChLSFRNTCwgbGlrZSBHZWNrbykgVmVyc2lvbi80LjAgTW9iaWxlIFNhZmFyaS81MzQuMzA= 116.228.143.243 X19hZ19jbV89MTQwNTM5MDY4MzE5OA== 20141111 00
1.206.172.59 none 1415636904699 http://xngz.xhedu.sh.cn/cms/app/info/doc/index.php/24944 aHR0cDovL3d3dy5iYWlkdS5jb20vcz90bj0wNjAwODAwNl8yX3BnJndkPSVCOSVBNCVENyVGNyVDRSVERSVENiVEOCVCNCVGMyVDQSVDMiVCOSVDQSVDNyVFOSVCRiVGNiVDQiVCNSVDMyVGNw== TW96aWxsYS80LjAgKGNvbXBhdGlibGU7IE1TSUUgOC4wOyBXaW5kb3dzIE5UIDUuMTsgVHJpZGVudC80LjAp 222.66.2.231 NoDef 20141111 00
[实验步骤提示]
在以下提示步骤中,步骤1是用Hive做数据分析和数据准备,所有代码在大数据计算集群上执行,步骤2是用R语言做数据可视化。
-
用Hive做数据分析和数据准备
a. 统计所有域名的页面访问量
统计各域名的页面访问量。hive -e \ "select regexp_extract(parse_url(url, 'HOST'),'([a-z0-9]+).(.*)',2) as domain,count(1) as PV from bigdata_cases.dpi group by regexp_extract(parse_url(url, 'HOST'),'([a-z0-9]+).(.*)',2) order by PV desc;"
1.csv
得到结果的前10行:
domain pv
tmall.com 6279156
baidu.com 2674312
taobao.com 2547403
win007.com 1452202
qq.com 1420461
jd.com 847723
m.taobao.com 533678
cart.taobao.com 388944
buyer.trade.taobao.com 291111
yhd.com 266182
b. 统计页面访问量前10的域名每分钟的访问量
统计页面访问量前10的域名每分钟的访问量。
hive -e \
"select round((ts-1000*unix_timestamp('2014-11-11 00:00:00'))/60000) as time, regexp_extract(parse_url(url, 'HOST'),'([a-z0-9]+).(.*)',2) as domain, count(1) as PV, count(distinct srcip) as IP
from bigdata_cases.dpi a
inner join (
select regexp_extract(parse_url(url, 'HOST'),'([a-z0-9]+).(.*)',2) as domain,count(1) as PV
from bigdata_cases.dpi
group by regexp_extract(parse_url(url, 'HOST'),'([a-z0-9]+).(.*)',2)
order by PV desc limit 10) b on regexp_extract(parse_url(a.url, 'HOST'),'([a-z0-9]+).(.*)',2)=b.domain
where round((ts-1000*unix_timestamp('2014-11-11 00:00:00'))/60000) between 0 and 60
group by round((ts-1000*unix_timestamp('2014-11-11 00:00:00'))/60000), regexp_extract(parse_url(url, 'HOST'),'([a-z0-9]+).(.*)',2)
order by time, domain;" \
2.csv
得到结果的前10行:
time domain PV IP
0.0 baidu.com 22382 8460
0.0 buyer.trade.taobao.com 553 464
0.0 cart.taobao.com 15418 11964
0.0 jd.com 9199 5385
0.0 m.taobao.com 3573 3038
0.0 qq.com 15609 6656
0.0 taobao.com 30682 21880
0.0 tmall.com 56171 36787
0.0 win007.com 10917 1051
0.0 yhd.com 3325 1921
c. 统计电商网站每分钟的访问量
统计电商网站每分钟的访问量。
hive -e \
"select round((ts-1000*unix_timestamp('2014-11-11 00:00:00'))/60000) as time,
case when parse_url(url, 'HOST') like '%.taobao.com' then 'Taobao'
when parse_url(url, 'HOST') like '%.tmall.com' then 'Tmall'
when parse_url(url, 'HOST') like '%.jd.com' then 'JD'
when parse_url(url, 'HOST') like '%.yhd.com' then 'YHD'
when parse_url(url, 'HOST') like '%.amazon.cn' then 'Amazon'
when parse_url(url, 'HOST') like '%.vip.com' then 'VIP'
else 'Suning' end as website,
count(1) as PV, count(distinct srcip) as IP
from bigdata_cases.dpi a
where (parse_url(url, 'HOST') like '%.taobao.com' or parse_url(url, 'HOST') like '%.tmall.com' or parse_url(url, 'HOST') like '%.jd.com'
or parse_url(url, 'HOST') like '%.yhd.com' or parse_url(url, 'HOST') like '%.amazon.cn' or parse_url(url, 'HOST') like '%.vip.com'
or parse_url(url, 'HOST') like '%.suning.com')
and round((ts-1000*unix_timestamp('2014-11-11 00:00:00'))/60000) between 0 and 60
group by round((ts-1000*unix_timestamp('2014-11-11 00:00:00'))/60000),
case when parse_url(url, 'HOST') like '%.taobao.com' then 'Taobao'
when parse_url(url, 'HOST') like '%.tmall.com' then 'Tmall'
when parse_url(url, 'HOST') like '%.jd.com' then 'JD'
when parse_url(url, 'HOST') like '%.yhd.com' then 'YHD'
when parse_url(url, 'HOST') like '%.amazon.cn' then 'Amazon'
when parse_url(url, 'HOST') like '%.vip.com' then 'VIP'
else 'Suning' end
order by time, website;" \
3.csv
得到结果的前10行:
time website PV IP
0.0 Amazon 267 202
0.0 JD 9591 5545
0.0 Suning 343 274
0.0 Taobao 55040 39080
0.0 Tmall 65845 41833
0.0 VIP 87 69
0.0 YHD 3328 1923
1.0 Amazon 143 111
1.0 JD 4742 3087
1.0 Suning 232 181
d. 统计百度关键词的搜索量
统计百度关键词的搜索量,仅保留搜索量不少于20的关键词。这里返回的都是经过URL编码的关键词,之后会在R语言中进一步解析这些关键词。
hive -e \
"select parse_url(url, 'QUERY', 'wd') as keyword, count(1) as PV, count(distinct srcip) as IP
from bigdata_cases.dpi
where url like '%.baidu.%/%wd=%'
group by parse_url(url, 'QUERY', 'wd')
having PV>=20
order by PV desc;" \
4.csv
得到结果的前10行:
keyword PV IP
%E5%A4%A9%E6%B0%94 93396 1051
ip 23507 265
%E6%B7%98%E5%AE%9D 9372 4413
%E4%BA%AC%E4%B8%9C 8392 4076
%E5%A4%A9%E7%8C%AB 7032 3352
%E6%B7%98%E5%AE%9D%E7%BD%91 6183 2962
%E4%BA%AC%E4%B8%9C%E5%95%86%E5%9F%8E 4198 2154
%E6%94%AF%E4%BB%98%E5%AE%9D 3710 1832
ji 3613 2561
sh 3267 2126
e. 统计给淘宝或天猫引流的域名
统计给淘宝或天猫引流的域名,排除淘宝和天猫自己。
hive -e \
"select regexp_extract(parse_url(cast(unbase64(ref) as string), 'HOST'),'([a-z0-9]+).(.*)',2) as domain,count(1) as PV
from bigdata_cases.dpi
where (parse_url(url, 'HOST') like '%.taobao.com' or parse_url(url, 'HOST') like '%.tmall.com')
and parse_url(cast(unbase64(ref) as string), 'HOST') not like '%.taobao.%'
and parse_url(cast(unbase64(ref) as string), 'HOST') not like '%.tmall.%'
and parse_url(cast(unbase64(ref) as string), 'HOST') not rlike '([0-9]+).([0-9]+).([0-9]+).([0-9]+)'
group by regexp_extract(parse_url(cast(unbase64(ref) as string), 'HOST'),'([a-z0-9]+).(.*)',2)
order by PV desc;" \
5.csv
得到结果的前10行:
domain PV
atm.youku.com 32317
360.cn 23525
baidu.com 22678
sandai.net 12695
zhe800.com 10371
hao123.com 8739
sogou.com 5836
sina.com.cn 5562
duba.com 5110
mi.com 4952
f. 统计给淘宝或天猫引流的百度关键词搜索量
统计给淘宝或天猫引流的百度关键词搜索量量。这里返回的都是经过URL编码的关键词,之后会在R语言中进一步解析这些关键词。
hive -e \
"select parse_url(cast(unbase64(ref) as string), 'QUERY', 'wd') as keyword, count(1) as PV, count(distinct srcip) as IP
from bigdata_cases.dpi
where (parse_url(url, 'HOST') like '%.taobao.com' or parse_url(url, 'HOST') like '%.tmall.com')
and cast(unbase64(ref) as string) like '%.baidu.%/%wd=%'
group by parse_url(cast(unbase64(ref) as string), 'QUERY', 'wd')
order by PV desc;" \
6.csv
得到结果的前10行:
keyword PV IP
%E5%A4%A9%E7%8C%AB 2576 1745
%E6%B7%98%E5%AE%9D 2378 1766
%E6%B7%98%E5%AE%9D%E7%BD%91 2181 1559
%E5%A4%A9%E7%8C%AB%E5%95%86%E5%9F%8E 673 470
%E8%81%9A%E5%88%92%E7%AE%97 395 268
%E6%B7%98%E5%AE%9D%E5%95%86%E5%9F%8E 262 174
taobao 239 175
%E6%B7%98%E5%AE%9D%E7%BD%91%E9%A6%96%E9%A1%B5 148 116
tmall 131 85
tianmao 110 69
-
用R语言做数据可视化
a. 载入相关程序包
载入相关程序包。将Hive输出的结果文件复制到R语言可访问的路径如“D:\workspace\”。> library(ggplot2) > library(wordcloud) Loading required package: RColorBrewer
b. 画出所有域名页面访问量的直方图
画出所有域名页面访问量的直方图,其中横坐标为对数尺度表示访问量,纵坐标为平方根尺度表示域名的数量。
> data1 <- read.table("D:/workspace/1.csv", sep = "\t", stringsAsFactors = FALSE)
> names(data1) <- c("domain", "PV")
> ggplot(data1, aes(x = PV)) + geom_histogram(aes(fill = ..count..)) + scale_fill_gradient("Count",
+ trans = "sqrt", low = "green", high = "red") + scale_x_log10() + scale_y_sqrt()
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
c. 画出页面访问量前10的域名每分钟的访问量和独立IP数
画出页面访问量前10的域名每分钟的访问量和独立IP数,其中横坐标表示2014年11月11日0时过后的分钟数,纵坐标为对数尺度表示域名的访问量,线条粗细表示平均每个独立IP访问的页面数。
> data2 <- read.table("D:/workspace/2.csv", sep = "\t")
> names(data2) <- c("time", "domain", "PV", "IP")
> ggplot(data2, aes(x = time, y = PV, group = domain)) + geom_line(aes(colour = domain,
+ size = PV/IP)) + scale_y_log10()
d. 画出7大电商网站每分钟的访问量和独立IP数
画出7大电商网站每分钟的访问量和独立IP数,其中横坐标表示2014年11月11日0时过后的分钟数,纵坐标为对数尺度表示域名的访问量,线条粗细表示平均每个独立IP访问的页面数。
> data3 <- read.table("D:/workspace/3.csv", sep = "\t")
> names(data3) <- c("time", "website", "PV", "IP")
> ggplot(data3, aes(x = time, y = PV, group = website)) + geom_line(aes(colour = website,
+ size = PV/IP)) + scale_y_log10()
e. 画出百度搜索关键词的词云图
画出百度搜索关键词的词云图,字体越大说明搜索频率越高,仅画出搜索量前100的关键词。
> data4 <- read.table("D:/workspace/4.csv", sep = "\t", stringsAsFactors = FALSE)
> names(data4) <- c("keyword", "PV", "IP")
> keywords <- iconv(sapply(data4$keyword[1:200], URLdecode, simplify = TRUE),
+ "UTF-8", "gbk")
> wordcloud(keywords, data4$IP, c(4, 1), random.order = FALSE, random.color = TRUE,
+ col = rainbow(10))
f. 画出向淘宝或天猫引流的百度搜索关键词的词云图
画出向淘宝或天猫引流的百度搜索关键词的词云图,字体越大说明搜索频率越高,仅画出搜索量大于1的关键词。
> data6 <- read.table("D:/workspace/6.csv", sep = "\t", stringsAsFactors = FALSE)
> names(data6) <- c("keyword", "PV", "IP")
> keywords <- iconv(sapply(data6$keyword, URLdecode, simplify = TRUE), "UTF-8",
+ "gbk")
> wordcloud(keywords, data6$IP, c(4, 1), 5, random.order = FALSE, random.color = TRUE,
+ col = rainbow(10))