用户上网行为分析-大数据项目

  • Post author:
  • Post category:其他


[实验数据]

本实验所用数据为某宽带运营商从2014年11月11日0时到1时的用户上网点击流数据,包含了31,487,488条记录。数据集已经存放在HDFS上,路径为“/data/13/4/dpi/dpi.csv”,各字段以制表符分隔。数据集还存放在了Hive上,表名为“bigdata_cases.dpi”。

各字段的定义为:

字段		定义
srcip	用户所在客户端IP
ad		用户上网设备号
ts		用户请求的时间戳,1970年1月1日0时至今的毫秒数
url		用户请求的当前页面URL
ref		用户请求的来源页面URL,存储的是base64编码后的值
ua		用户所用浏览器,存储的是base64编码后的值
dstip	用户访问的服务器IP
cookie	用户Cookie,存储的是base64编码后的值

数据集的前5行为:

student1@master:~$ hdfs dfs -cat /data/13/4/dpi/dpi.csv | head -5
1.171.51.162    none    1415638064719   http://www.zhiyoula.com/album/355986    aHR0cDovL3d3dy56aGl5b3VsYS5jb20v    TW96aWxsYS81LjAgKGNvbXBhdGlibGU7IE1TSUUgOS4wOyBXaW5kb3dzIE5UIDYuMTsgV09XNjQ7IFRyaWRlbnQvNS4wOyBTTENDMjsgLk5FVCBDTFIgMi4wLjUwNzI3OyAuTkVUIENMUiAzLjUuMzA3Mjk7IC5ORVQgQ0xSIDMuMC4zMDcyOTsgTWVkaWEgQ2VudGVyIFBDIDYuMDsgTlAwNjsgLk5FVDQuMEM7IC5ORVQ0LjBFOyAzNjBTRSk=    180.166.52.52   UEhQU0VTU0lEPW1uczIyMW5rb200cG5iaml2NjhldGV1cjczOyBDTlpaREFUQTEwMDAwNTc5NzM9MTIzNzAwOTIzOC0xNDE1NjM2NDU5LWh0dHAlMjUzQSUyNTJGJTI1MkZ3ZWliby5jb20lMjUyRiU3QzE0MTU2MzY0NTk7IEhtX2x2dF9lMWU5YjY5MDNkMDFlNGI4NjMxOWRmMzY1ZWQxNTQwZD0xNDE1NDY4MzUyLDE0MTU0ODM2MDMsMTQxNTU0OTM2MywxNDE1NjM4MDIzOyBIbV9scHZ0X2UxZTliNjkwM2QwMWU0Yjg2MzE5ZGYzNjVlZDE1NDBkPTE0MTU2MzgwMjM=    20141111    00
1.193.64.174    none    1415635901445   http://www.jinjianginns.com/Url.aspx?sid=734&tn=title&gotourl=Default.html&utm_source=baidu&utm_medium=cpc&utm_campaign=baidupinpaizhuanqu&utm_content=title    aHR0cDovL3d3dy5iYWlkdS5jb20vcz93ZD0lRTklOTQlQTYlRTYlQjElOUYlRTQlQjklOEIlRTYlOTglOUYlRTUlQUUlOTglRTclQkQlOTEmcnN2X3NwdD0xJmlzc3A9MSZmPTMmcnN2X2JwPTAmcnN2X2lkeD0yJmllPXV0Zi04JnRuPWJhaWR1aG9tZV9wZyZyc3ZfZW50ZXI9MSZyc3Zfc3VnMz03JnJzdl9zdWc0PTg4MyZyc3Zfc3VnMT02Jm9xPSVFOSU5NCVBNiVFNiVCMSU5RiZyc3Zfc3VnMj0xJnJzcD0xJmlucHV0VD01MDgxJl9fZWlzPTEmX19laXN0PTM4MQ==    TW96aWxsYS81LjAgKFdpbmRvd3MgTlQgNi4xOyBXT1c2NCkgQXBwbGVXZWJLaXQvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgTWF4dGhvbi80LjMuMS4yMDAwIENocm9tZS8zMC4wLjE1OTkuMTAxIFNhZmFyaS81MzcuMzY= 116.236.224.20 YmRzaGFyZV9maXJzdGltZT0xMzgyMTkzMzE5NTUxOyBhZ19maWQ9WUx6bE1Dd3dCUk9zeUJtRjsgX19hZ19jbV89MTQxMTkxNzQxMjI4NzsgSG1fbHZ0X2Q2MGJjZTQyNjY3NTM5OWIxZGY0NTcyOGI0OGYxMDdkPTE0MTE5MTUwNzksMTQxMTkzMDI3NCwxNDExOTMxMzM4LDE0MTE5OTUxNjM7IF9fdXRtYT0yMTc0ODU3MDkuMTM4NDQ5NTY2OS4xMzgzNzEzMDkzLjE0MTE5MzEzMzkuMTQxMTk5NTE2Ni4yMDsgX191dG16PTIxNzQ4NTcwOS4xNDExOTk1MTY2LjIwLjUudXRtY3NyPWJhaWR1fHV0bWNjbj1iYWlkdXBpbnBhaXpodWFucXV8dXRtY21kPWNwY3x1dG1jY3Q9dGl0bGU7IG9vb29vb289NWExOTk2M2Fvb29vb29vXzVhMTk5NjNh    20141111    00
1.193.64.174    none    1415635909805   http://www.jinjianginns.com/js/datetime/Utility/My97DatePicker/My97DatePicker.htm   aHR0cDovL3d3dy5qaW5qaWFuZ2lubnMuY29tL0RlZmF1bHQuaHRtbD91dG1fc291cmNlPWJhaWR1JnV0bV9tZWRpdW09Y3BjJnV0bV9jYW1wYWlnbj1iYWlkdXBpbnBhaXpodWFucXUmdXRtX2NvbnRlbnQ9dGl0bGU= TW96aWxsYS81LjAgKFdpbmRvd3MgTlQgNi4xOyBXT1c2NCkgQXBwbGVXZWJLaXQvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgTWF4dGhvbi80LjMuMS4yMDAwIENocm9tZS8zMC4wLjE1OTkuMTAxIFNhZmFyaS81MzcuMzY=   116.236.224.20  YmRzaGFyZV9maXJzdGltZT0xMzgyMTkzMzE5NTUxOyBhZ19maWQ9WUx6bE1Dd3dCUk9zeUJtRjsgX19hZ19jbV89MTQxMTkxNzQxMjI4NzsgSG1fbHZ0X2Q2MGJjZTQyNjY3NTM5OWIxZGY0NTcyOGI0OGYxMDdkPTE0MTE5MTUwNzksMTQxMTkzMDI3NCwxNDExOTMxMzM4LDE0MTE5OTUxNjM7IG9vb29vb289NWExOTk2M2Fvb29vb29vXzVhMTk5NjNhOyBBU1AuTkVUX1Nlc3Npb25JZD16ZDBnc3o1NW9ha2h6ZDQ1a3p6dHVoNDU7IENvb2tpZVRlc3Q9Y29va2lldGVzdDsgdHJhY2U9dHJhY2Vsb2dpZD02MDUxOTY4NjQmdHJhY2VpZD03MzQmY2xpY2tfaWQ9MDsgX191dG10PTE7IF9fdXRtYT0yMTc0ODU3MDkuMTM4NDQ5NTY2OS4xMzgzNzEzMDkzLjE0MTE5OTUxNjYuMTQxNTYzNTg3NC4yMTsgX191dG1iPTIxNzQ4NTcwOS4xLjEwLjE0MTU2MzU4NzQ7IF9fdXRtYz0yMTc0ODU3MDk7IF9fdXRtej0yMTc0ODU3MDkuMTQxNTYzNTg3NC4yMS42LnV0bWNzcj1iYWlkdXx1dG1jY249YmFpZHVwaW5wYWl6aHVhbnF1fHV0bWNtZD1jcGN8dXRtY2N0PXRpdGxl    20141111    00
1.198.181.118   none    1415638269814   http://m.ecpic.com.cn/index.shtml?cmpid=jl_mob_000015&app=jlfh  NoDef   TW96aWxsYS81LjAgKExpbnV4OyBVOyBBbmRyb2lkIDQuMC40OyB6aC1jbjsgWlRFIFU3OTUrIEJ1aWxkL0lNTTc2RCkgQXBwbGVXZWJLaXQvNTM0LjMwIChLSFRNTCwgbGlrZSBHZWNrbykgVmVyc2lvbi80LjAgTW9iaWxlIFNhZmFyaS81MzQuMzA=    116.228.143.243 X19hZ19jbV89MTQwNTM5MDY4MzE5OA==    20141111    00
1.206.172.59    none    1415636904699   http://xngz.xhedu.sh.cn/cms/app/info/doc/index.php/24944    aHR0cDovL3d3dy5iYWlkdS5jb20vcz90bj0wNjAwODAwNl8yX3BnJndkPSVCOSVBNCVENyVGNyVDRSVERSVENiVEOCVCNCVGMyVDQSVDMiVCOSVDQSVDNyVFOSVCRiVGNiVDQiVCNSVDMyVGNw==    TW96aWxsYS80LjAgKGNvbXBhdGlibGU7IE1TSUUgOC4wOyBXaW5kb3dzIE5UIDUuMTsgVHJpZGVudC80LjAp    222.66.2.231    NoDef   20141111    00

[实验步骤提示]

在以下提示步骤中,步骤1是用Hive做数据分析和数据准备,所有代码在大数据计算集群上执行,步骤2是用R语言做数据可视化。

  1. 用Hive做数据分析和数据准备

    a. 统计所有域名的页面访问量

    统计各域名的页面访问量。

     hive -e \
     "select regexp_extract(parse_url(url, 'HOST'),'([a-z0-9]+).(.*)',2) as domain,count(1) as PV
     from bigdata_cases.dpi
     group by regexp_extract(parse_url(url, 'HOST'),'([a-z0-9]+).(.*)',2)
     order by PV desc;"
    

1.csv

得到结果的前10行:

domain	pv
tmall.com	6279156
baidu.com	2674312
taobao.com	2547403
win007.com	1452202
qq.com	1420461
jd.com	847723
m.taobao.com	533678
cart.taobao.com	388944
buyer.trade.taobao.com	291111
yhd.com	266182

b. 统计页面访问量前10的域名每分钟的访问量

统计页面访问量前10的域名每分钟的访问量。

hive -e \
"select round((ts-1000*unix_timestamp('2014-11-11 00:00:00'))/60000) as time, regexp_extract(parse_url(url, 'HOST'),'([a-z0-9]+).(.*)',2) as domain, count(1) as PV, count(distinct srcip) as IP
from bigdata_cases.dpi a
inner join (
select regexp_extract(parse_url(url, 'HOST'),'([a-z0-9]+).(.*)',2) as domain,count(1) as PV
from bigdata_cases.dpi
group by regexp_extract(parse_url(url, 'HOST'),'([a-z0-9]+).(.*)',2)
order by PV desc limit 10) b on regexp_extract(parse_url(a.url, 'HOST'),'([a-z0-9]+).(.*)',2)=b.domain
where round((ts-1000*unix_timestamp('2014-11-11 00:00:00'))/60000) between 0 and 60
group by round((ts-1000*unix_timestamp('2014-11-11 00:00:00'))/60000), regexp_extract(parse_url(url, 'HOST'),'([a-z0-9]+).(.*)',2)
order by time, domain;" \

2.csv

得到结果的前10行:

time	domain	PV	IP
0.0	baidu.com	22382	8460
0.0	buyer.trade.taobao.com	553	464
0.0	cart.taobao.com	15418	11964
0.0	jd.com	9199	5385
0.0	m.taobao.com	3573	3038
0.0	qq.com	15609	6656
0.0	taobao.com	30682	21880
0.0	tmall.com	56171	36787
0.0	win007.com	10917	1051
0.0	yhd.com	3325	1921

c. 统计电商网站每分钟的访问量

统计电商网站每分钟的访问量。

hive -e \
"select round((ts-1000*unix_timestamp('2014-11-11 00:00:00'))/60000) as time, 
case when parse_url(url, 'HOST') like '%.taobao.com' then 'Taobao'
when parse_url(url, 'HOST') like '%.tmall.com' then 'Tmall'
when parse_url(url, 'HOST') like '%.jd.com' then 'JD'
when parse_url(url, 'HOST') like '%.yhd.com' then 'YHD'
when parse_url(url, 'HOST') like '%.amazon.cn' then 'Amazon'
when parse_url(url, 'HOST') like '%.vip.com' then 'VIP'
else 'Suning' end as website,
count(1) as PV, count(distinct srcip) as IP
from bigdata_cases.dpi a
where (parse_url(url, 'HOST') like '%.taobao.com' or parse_url(url, 'HOST') like '%.tmall.com' or parse_url(url, 'HOST') like '%.jd.com' 
or parse_url(url, 'HOST') like '%.yhd.com' or parse_url(url, 'HOST') like '%.amazon.cn' or parse_url(url, 'HOST') like '%.vip.com'
or parse_url(url, 'HOST') like '%.suning.com')
and round((ts-1000*unix_timestamp('2014-11-11 00:00:00'))/60000) between 0 and 60
group by round((ts-1000*unix_timestamp('2014-11-11 00:00:00'))/60000),
case when parse_url(url, 'HOST') like '%.taobao.com' then 'Taobao'
when parse_url(url, 'HOST') like '%.tmall.com' then 'Tmall'
when parse_url(url, 'HOST') like '%.jd.com' then 'JD'
when parse_url(url, 'HOST') like '%.yhd.com' then 'YHD'
when parse_url(url, 'HOST') like '%.amazon.cn' then 'Amazon'
when parse_url(url, 'HOST') like '%.vip.com' then 'VIP'
else 'Suning' end
order by time, website;" \

3.csv

得到结果的前10行:

time	website	PV	IP
0.0	Amazon	267	202
0.0	JD	9591	5545
0.0	Suning	343	274
0.0	Taobao	55040	39080
0.0	Tmall	65845	41833
0.0	VIP	87	69
0.0	YHD	3328	1923
1.0	Amazon	143	111
1.0	JD	4742	3087
1.0	Suning	232	181

d. 统计百度关键词的搜索量

统计百度关键词的搜索量,仅保留搜索量不少于20的关键词。这里返回的都是经过URL编码的关键词,之后会在R语言中进一步解析这些关键词。

hive -e \
"select parse_url(url, 'QUERY', 'wd') as keyword, count(1) as PV, count(distinct srcip) as IP
from bigdata_cases.dpi
where url like '%.baidu.%/%wd=%'
group by parse_url(url, 'QUERY', 'wd')
having PV>=20
order by PV desc;" \

4.csv

得到结果的前10行:

keyword	PV	IP
%E5%A4%A9%E6%B0%94	93396	1051
ip	23507	265
%E6%B7%98%E5%AE%9D	9372	4413
%E4%BA%AC%E4%B8%9C	8392	4076
%E5%A4%A9%E7%8C%AB	7032	3352
%E6%B7%98%E5%AE%9D%E7%BD%91	6183	2962
%E4%BA%AC%E4%B8%9C%E5%95%86%E5%9F%8E	4198	2154
%E6%94%AF%E4%BB%98%E5%AE%9D	3710	1832
ji	3613	2561
sh	3267	2126

e. 统计给淘宝或天猫引流的域名

统计给淘宝或天猫引流的域名,排除淘宝和天猫自己。

hive -e \
"select regexp_extract(parse_url(cast(unbase64(ref) as string), 'HOST'),'([a-z0-9]+).(.*)',2) as domain,count(1) as PV
from bigdata_cases.dpi
where (parse_url(url, 'HOST') like '%.taobao.com' or parse_url(url, 'HOST') like '%.tmall.com')
and parse_url(cast(unbase64(ref) as string), 'HOST') not like '%.taobao.%'
and parse_url(cast(unbase64(ref) as string), 'HOST') not like '%.tmall.%'
and parse_url(cast(unbase64(ref) as string), 'HOST') not rlike '([0-9]+).([0-9]+).([0-9]+).([0-9]+)'
group by regexp_extract(parse_url(cast(unbase64(ref) as string), 'HOST'),'([a-z0-9]+).(.*)',2)
order by PV desc;" \

5.csv

得到结果的前10行:

domain	PV
atm.youku.com	32317
360.cn	23525
baidu.com	22678
sandai.net	12695
zhe800.com	10371
hao123.com	8739
sogou.com	5836
sina.com.cn	5562
duba.com	5110
mi.com	4952

f. 统计给淘宝或天猫引流的百度关键词搜索量

统计给淘宝或天猫引流的百度关键词搜索量量。这里返回的都是经过URL编码的关键词,之后会在R语言中进一步解析这些关键词。

hive -e \
"select parse_url(cast(unbase64(ref) as string), 'QUERY', 'wd') as keyword, count(1) as PV, count(distinct srcip) as IP
from bigdata_cases.dpi
where (parse_url(url, 'HOST') like '%.taobao.com' or parse_url(url, 'HOST') like '%.tmall.com')
and cast(unbase64(ref) as string) like '%.baidu.%/%wd=%'
group by parse_url(cast(unbase64(ref) as string), 'QUERY', 'wd')
order by PV desc;" \

6.csv

得到结果的前10行:

keyword	PV	IP
%E5%A4%A9%E7%8C%AB	2576	1745
%E6%B7%98%E5%AE%9D	2378	1766
%E6%B7%98%E5%AE%9D%E7%BD%91	2181	1559
%E5%A4%A9%E7%8C%AB%E5%95%86%E5%9F%8E	673	470
%E8%81%9A%E5%88%92%E7%AE%97	395	268
%E6%B7%98%E5%AE%9D%E5%95%86%E5%9F%8E	262	174
taobao	239	175
%E6%B7%98%E5%AE%9D%E7%BD%91%E9%A6%96%E9%A1%B5	148	116
tmall	131	85
tianmao	110	69
  1. 用R语言做数据可视化

    a. 载入相关程序包

    载入相关程序包。将Hive输出的结果文件复制到R语言可访问的路径如“D:\workspace\”。

     > library(ggplot2)
     > library(wordcloud)
     Loading required package: RColorBrewer
    

b. 画出所有域名页面访问量的直方图

画出所有域名页面访问量的直方图,其中横坐标为对数尺度表示访问量,纵坐标为平方根尺度表示域名的数量。

> data1 <- read.table("D:/workspace/1.csv", sep = "\t", stringsAsFactors = FALSE)
> names(data1) <- c("domain", "PV")
> ggplot(data1, aes(x = PV)) + geom_histogram(aes(fill = ..count..)) + scale_fill_gradient("Count", 
+     trans = "sqrt", low = "green", high = "red") + scale_x_log10() + scale_y_sqrt()
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

c. 画出页面访问量前10的域名每分钟的访问量和独立IP数

画出页面访问量前10的域名每分钟的访问量和独立IP数,其中横坐标表示2014年11月11日0时过后的分钟数,纵坐标为对数尺度表示域名的访问量,线条粗细表示平均每个独立IP访问的页面数。

> data2 <- read.table("D:/workspace/2.csv", sep = "\t")
> names(data2) <- c("time", "domain", "PV", "IP")
> ggplot(data2, aes(x = time, y = PV, group = domain)) + geom_line(aes(colour = domain, 
+     size = PV/IP)) + scale_y_log10()

d. 画出7大电商网站每分钟的访问量和独立IP数

画出7大电商网站每分钟的访问量和独立IP数,其中横坐标表示2014年11月11日0时过后的分钟数,纵坐标为对数尺度表示域名的访问量,线条粗细表示平均每个独立IP访问的页面数。

> data3 <- read.table("D:/workspace/3.csv", sep = "\t")
> names(data3) <- c("time", "website", "PV", "IP")
> ggplot(data3, aes(x = time, y = PV, group = website)) + geom_line(aes(colour = website, 
+     size = PV/IP)) + scale_y_log10()

e. 画出百度搜索关键词的词云图

画出百度搜索关键词的词云图,字体越大说明搜索频率越高,仅画出搜索量前100的关键词。

> data4 <- read.table("D:/workspace/4.csv", sep = "\t", stringsAsFactors = FALSE)
> names(data4) <- c("keyword", "PV", "IP")
> keywords <- iconv(sapply(data4$keyword[1:200], URLdecode, simplify = TRUE), 
+     "UTF-8", "gbk")
> wordcloud(keywords, data4$IP, c(4, 1), random.order = FALSE, random.color = TRUE, 
+     col = rainbow(10))

f. 画出向淘宝或天猫引流的百度搜索关键词的词云图

画出向淘宝或天猫引流的百度搜索关键词的词云图,字体越大说明搜索频率越高,仅画出搜索量大于1的关键词。

> data6 <- read.table("D:/workspace/6.csv", sep = "\t", stringsAsFactors = FALSE)
> names(data6) <- c("keyword", "PV", "IP")
> keywords <- iconv(sapply(data6$keyword, URLdecode, simplify = TRUE), "UTF-8", 
+     "gbk")
> wordcloud(keywords, data6$IP, c(4, 1), 5, random.order = FALSE, random.color = TRUE, 
+     col = rainbow(10))



版权声明:本文为qq_43115606原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。