项目实践——动态新闻标题热点挖掘

该系列为南京大学课程《用Python玩转数据》学习笔记,主要以思维导图的记录

8.2 动态新闻标题热点挖掘

课件是通过正则获取新浪新闻热点标题并绘制词云,现通过抓取今日头条热点新闻进行挖掘

  • 获取热点新闻

    通过chrome访问https://www.toutiao.com/ch/news_hot/,在开发者工具中可以看到,是有api接口可以直接获取热点新闻的json数据的。

    chrome_f12

    通过分析这几个请求的网址可以发现:

    首次访问:https://www.toutiao.com/api/pc/feed/?category=news_hot&utm_source=toutiao&widen=1&max_behot_time=0&max_behot_time_tmp=0

    即可获取数据,并会有一个next_behot_time值,将该值替换掉上述url中就可获取后续的数据。

    注意:直接从chrome工具中看到的网址比较长https://www.toutiao.com/api/pc/feed/?category=news_hot&utm_source=toutiao&widen=1&max_behot_time=0&max_behot_time_tmp=0&tadrequire=true&as=A1258C7D00422B9&cp=5CD032F26B29AE1&_signature=z.t.3AAAkzMhvSCXMjx438.7f8,后面的一些被删除的参数代表的是浏览器等信息,如果不删除会导致python请求异常。

  • 分词

    分词用到了jieba分词,可以直接调用jieba第三方包进行分词,要注意停用词的使用,词频需要自行统计。

  • 词云绘制

    词云绘制用到了wordcloud第三方库,课件中采用默认词云生成,一般默认的词云比较难看,可以通过numpy计算原图的颜色,并对生成的词云进行重新上色的方式改善效果。

    需要注意的是图片不能太小,可能会导致异常。

实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# -*- coding: utf-8 -*-
# author:zhengk

import requests
import json
import re
import jieba.posseg as pseg
from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt
import numpy as np
import PIL.Image as Image
import os


def get_news_hot(loop=5):
max_behot_time = 0
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'}
with open('hot_news.txt', 'w', encoding='utf-8') as f:
for x in range(loop):
url = "https://www.toutiao.com/api/pc/feed/?category=news_hot&utm_source=toutiao" \
"&widen=1&max_behot_time=" + str(max_behot_time) + "&max_behot_time_tmp=" + str(max_behot_time)
res = requests.get(url, headers=headers)
if res.status_code == 200:
data = json.loads(res.text)
for news in data['data']:
f.write(news['title'])
max_behot_time = data['next']['max_behot_time']


def extract_words():
with open('hot_news.txt', 'r', encoding='utf-8') as f:
news_subjects = f.readlines()

stop_words = set(line.strip() for line in open('stopwords.txt', encoding='utf-8'))

news_list = []

for subject in news_subjects:
if subject.isspace():
continue

p = re.compile("n[a-z0-9]{0,2}")
word_list = pseg.cut(subject)
for word, flag in word_list:
if not word in stop_words and p.search(flag) != None:
news_list.append(word)

content = {}
for item in news_list:
content[item] = content.get(item, 0) + 1

d = os.path.dirname(__file__)
img = Image.open(os.path.join(d, "toutiao.jpg"))
width = img.width / 80
height = img.height / 80
alice_coloring = np.array(img)
my_wordcloud = WordCloud(background_color="white",
max_words=500, mask=alice_coloring,
max_font_size=200, random_state=42,
font_path=(os.path.join(d, "PingFang.ttc")))
my_wordcloud = my_wordcloud.generate_from_frequencies(content)

image_colors = ImageColorGenerator(alice_coloring)
plt.figure(figsize=(width, height))
plt.imshow(my_wordcloud.recolor(color_func=image_colors))
plt.imshow(my_wordcloud)
plt.axis("off")
# 通过设置subplots_adjust来控制画面外边框
plt.subplots_adjust(bottom=.01, top=.99, left=.01, right=.99)
plt.savefig("jupiter_wordcloud_1.png")
plt.show()


if __name__ == '__main__':
get_news_hot(5)
extract_words()

原图:

toutiao

最终效果:

wordcloud