Fan Shiqing @Xiamen University

实验环境安装

Linux：Ubuntu16.04
Java：1.7.0_80
Hadoop：2.7.1
Python：2.7
PyCharm：2019.1.2(Community Edition)
matplotlib：2.0.0
Spark：2.1.0

下载数据集

数据集为某音乐平台歌曲《同桌的你》评论者的信息数据，包含评论者的用户ID、动态总数、关注总数、粉丝总数、所在地区、个人介绍、年龄、累计听歌总数属性。共4752条数据，部分如下图：

数据集的预处理

将txt文件转为csv文件
修改文件属性名称方便读写

使用Spark进行数据分析

读入数据并筛选需要用到的属性

sc =SparkContext()
sqlContext = SQLContext(sc)
data = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('commenters.csv')
list = ['ID', 'fans', 'province', 'city', 'age', 'songs']
data = data.select([column for column in data.columns if column in list])

查看数据、显示数据的结构

print "【展示10行数据】"
data.show(10)
print "【数据结构】"
data.printSchema()

【展示10行数据】
19/05/28 14:18:41 WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
+---------+----+--------+----+----+-----+
|       ID|fans|province|city| age|songs|
+---------+----+--------+----+----+-----+
|132708526|   0|     浙江省| 金华市|  19| 1142|
|126403842|   7|      海外|  其它|   0| 1033|
|358013382|   3|     山东省| 烟台市|未知年龄|  232|
| 31322471|   8|     河南省| 郑州市|未知年龄|  341|
|398405071|   0|     安徽省| 合肥市|未知年龄|  141|
|321142743|   0|     山东省| 临沂市|未知年龄|   75|
|252135807|  13|     山东省| 烟台市|未知年龄|  704|
| 10729784|  11|     浙江省| 杭州市|   1| 5585|
|116980492|   8|     重庆市| 万州区|未知年龄|  174|
|118405606|   0|     陕西省| 汉中市|  27| 1343|
+---------+----+--------+----+----+-----+
only showing top 10 rows

【数据结构】
root
 |-- ID: integer (nullable = true)
 |-- fans: integer (nullable = true)
 |-- province: string (nullable = true)
 |-- city: string (nullable = true)
 |-- age: string (nullable = true)
 |-- songs: string (nullable = true)

将年龄、听歌数量改为integer类型，以便进行数据统计

1 2	data = data.withColumn("age", data["age"].cast(IntegerType())) data = data.withColumn("songs", data["songs"].cast(IntegerType()))``

【数据类型转换】
root
 |-- ID: integer (nullable = true)
 |-- fans: integer (nullable = true)
 |-- province: string (nullable = true)
 |-- city: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- songs: integer (nullable = true)

查找评论者当中听歌最多的人的歌曲数量
1
data.select(max('songs')).show()

【评论者最多听歌数量】
+----------+
|max(songs)|
+----------+
|     35180|
+----------+

统计《同桌的你》的评论者所在地区的分布情况

1	area = data.groupBy('province').count().orderBy(col("count").desc())

【评论者所在地区分布】
+--------+-----+
|province|count|
+--------+-----+
|     广东省|  534|
|      海外|  271|
|     四川省|  263|
|     山东省|  263|
|     江苏省|  254|
|     河南省|  245|
|     北京市|  217|
|     浙江省|  207|
|      新疆|  205|
|     安徽省|  193|
|     湖南省|  188|
|     湖北省|  185|

...

|     青海省|   15|
|     台湾省|    9|
|      西藏|    5|
|      澳门|    5|
+--------+-----+

统计《同桌的你》的评论者的年龄的分布情况

1	age = data.groupBy('age').count().orderBy(col("count").desc())

【评论者年龄TOP10】
+----+-----+
| age|count|
+----+-----+
|null| 2451|
|  21|  245|
|  27|  228|
|  19|  228|
|  20|  215|
|  22|  204|
|  23|  163|
|  18|  154|
|  17|  153|
|  24|  115|
+----+-----+

统计《同桌的你》的评论者所在各个地区的听众平均年龄
1
mean = data.groupBy('province').agg({"age": "mean"})

【各个地区平均年龄情况】
+--------+------------------+
|province|          avg(age)|
+--------+------------------+
|     北京市|22.432692307692307|
|      海外|19.030075187969924|
|     辽宁省| 22.69811320754717|
|     浙江省| 21.06451612903226|
|     内蒙古|             20.75|
|      新疆| 20.00943396226415|
|     海南省|20.272727272727273|

...

|     吉林省|            22.375|
|    未知地区|3850.4285714285716|
|     上海市| 22.50793650793651|
|      澳门|              27.0|
|     青海省|20.333333333333332|
|     江西省|19.904109589041095|
|     安徽省| 20.22826086956522|
|     江苏省|21.293233082706767|
|     云南省|22.295454545454547|
+--------+------------------+

统计《同桌的你》的评论者的粉丝数量情况

1	fans = data.groupBy('fans').count().orderBy(col("count").desc())

【评论者粉丝数量】
+----+-----+
|fans|count|
+----+-----+
|   0| 1163|
|   1|  714|
|   2|  494|
|   3|  351|
|   4|  283|
|   5|  205|
|   6|  173|
|   7|  149|
|   8|  110|
|   9|  101|
|  10|   84|
+----+-----+

可视化呈现

将数据分析的结果通过matplotlib可视化显示出来
（在实验时，中文一直出错无法显示，暂用拼音和英文代替）

《同桌的你》的听众所在地区的分布情况
从图中看出，这首歌曲广东省的听众远超过其他地区。
《同桌的你》听众年龄分布
从图中看出，这首歌的听众20-30岁之间的居多，另外33岁的听众非常多。
有部分因素是网络歌曲平台的受众大部分在于这个年龄阶段。

《同桌的你》各个区域听众平均年龄
平均年龄基本上在17-27岁之间。
听众粉丝数量情况
大部分听众的粉丝数量低于10。