该系列为南京大学课程《用Python玩转数据》学习笔记,主要以思维导图的记录
进阶:爬虫小项目(3项)
“迷你爬虫编程小练习”进阶:抽取某本书的前 50 条短评内容并计算评分(star)的平 均值。提示:有的评论中并不包含评分。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34import requests
from bs4 import BeautifulSoup
import time
if __name__ == '__main__':
num = 1
count = 1
stars = []
# 都挺好
while count < 50:
url = 'https://book.douban.com/subject/20492971/comments/hot?p=' + str(num)
try:
res = requests.get(url)
except Exception as e:
print(e)
break
if res.status_code == 200:
soup = BeautifulSoup(res.content)
for comment in soup.find_all('div', 'comment'):
comment_text = comment.find('span', 'short').string
# 过滤一些没有评分的短评
try:
star = int(''.join(filter(str.isdigit, comment.find('span', 'user-stars').attrs['class'][1])))
except Exception as e:
continue
stars.append(star)
print('{}:{}:{}'.format(count, star, comment_text))
count += 1
if count > 50:
break
num += 1
time.sleep(5)
print('最新50条短评平均得分:{}'.format(sum(stars)//50))在“http://money.cnn.com/data/dow30/”上抓取道指成分股数据并将 30 家公司 的代码、公司名称和最近一次成交价放到一个列表中输出。
1
2
3
4
5
6
7
8
9
10import requests
import re
if __name__ == '__main__':
url = 'http://money.cnn.com/data/dow30/'
res = requests.get(url)
if res.status_code == 200:
pattern = re.compile('class="wsod_symbol">(.*?)<\/a>.*?<span.*?>(.*?)<\/span>.*?\n.*?class="wsod_aRight"><span.*?class="wsod_stream">(.*?)<\/span>')
out_list = re.findall(pattern, res.text)
print(out_list)请爬取网页(http://www.volleyball.world/en/vnl/2018/women/results-and-ranking/round1)上的数据(包括 TEAMS and TOTAL, WON, LOST of MATCHES)
1
2
3
4
5
6
7
8
9
10
11import requests
import re
if __name__ == '__main__':
url = 'http://www.volleyball.world/en/vnl/2018/women/results-and-ranking/round1'
res = requests.get(url)
if res.status_code == 200:
pattern = re.compile('href="/en/vnl/2018/women/teams.*?>(.*?)</a></figcaption>\s+</figure>\s+</td>\s+<td>(.*?)</td>\s+<td class="table-td-bold">(.*?)</td>\s+<td class="table-td-rightborder">(.*?)</td>')
out_list = re.findall(pattern, res.text)
print(out_list)
基础编程练习
1.从键盘输入整数 n(1-9 之间),对于 1-100之间的整数删除包含 n 并且能被 n 整除的数,例如如果 n 为 6,则要删掉包含 6 的如 6,16 这样的数及是 6 的倍数的如 12 和18 这样的数,输出所有满足条件的数,要求每满 10 个数换行。
测试数据:
Enter the number: 6
屏幕输出:
1,2,3,4,5,7,8,9,10,11
13,14,15,17,19,20,21,22,23,25
27,28,29,31,32,33,34,35,37,38
39,40,41,43,44,45,47,49,50,51
52,53,55,57,58,59,70,71,73,74
75,77,79,80,81,82,83,85,87,88
89,91,92,93,94,95,97,98,99,100方法一:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18if __name__ == '__main__':
n = eval(input('Please input a number(1-9):'))
out_string = ''
if 1 <= n <= 9:
count = 1
for num in range(101):
if num % n == 0 or str(num).find(str(n)) != -1:
continue
else:
out_string = out_string + str(num) + ','
if count % 10 == 0:
print(out_string[:-1])
out_string = ''
count += 1
if out_string:
print(out_string[:-1])
else:
print('Wrong Number! Please input a number between 1 and 9!')方法二:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16if __name__ == '__main__':
n = eval(input('Please input a number(1-9):'))
out_string = ''
if 1 <= n <= 9:
list = list(filter(lambda x: x % n != 0 and str(x).find(str(n)) == -1, range(1, 101)))
count = 1
for num in list:
out_string = out_string + str(num) + ','
if count % 10 == 0:
print(out_string[:-1])
out_string = ''
count += 1
if out_string:
print(out_string[:-1])
else:
print('Wrong Number! Please input a number between 1 and 9!')2.请用随机函数产生 500 行 1-100 之间的随机整数存入文件 random.txt 中,编程寻找这些整数的众数并输出,众数即为一组数中出现最多的数。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23import random
if __name__ == '__main__':
with open('random.txt', 'w+') as f:
for i in range(500):
f.write(str(random.randint(1, 100)))
f.write('\n')
f.seek(0)
nums = f.readlines()
nums_dict = {}
for num in nums:
num = num.strip()
if num in nums_dict:
nums_dict[num] += 1
else:
nums_dict[num] = 1
sorted_nums = sorted(nums_dict.items(), key= lambda d: d[1], reverse=True)
# 可能存在重复次数的数字
max_num = sorted_nums[0][1]
for k, v in sorted_nums:
if v == max_num:
print(k)文件 article.txt 中存放了一篇英文文章(请自行创建并添加测试文本),假设文章中的标点符号仅包括“,”、“.”、“!”、“?”和“…”,编程找出其中最长的单词并输出。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19if __name__ == '__main__':
with open('article.txt', 'r') as f:
data = f.read()
raw_words = data.split()
words = set()
for word in raw_words:
if word[-3:] == '...':
words.add(word[:-3])
elif word[-1:] in ',.?!':
words.add(word[:-1])
else:
words.add(word)
result = sorted(words, key=len, reverse=True)
max_len = len(result[0])
for word in result:
if len(word) == max_len:
print(word)
else:
break
数据表示编程题
1. 统计字符串中的字符个数
题目内容:
定义函数countchar()按字母表顺序统计字符串中所有出现的字母的个数(允许输入大写字符,并且计数时不区分大小写)。形如:
输入格式:
字符串
输出格式:
列表
输入样例:
Hello, World!
输出样例:
[0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 3, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0]
1
2
3
4
5
6
7
8
9
10
11
12
13
14def countchar(string):
alphbet = 'abcdefghijklmnopkrstuvwxyz'
count_list = [0 for x in range(len(alphbet))]
for char in string:
try:
count_list[alphbet.index(char.lower())] += 1
except ValueError:
continue
return count_list
if __name__ == "__main__":
string = input()
print(countchar(string))2.寻找输入数字中的全数字(pandigital)
题目内容:
如果一个n位数刚好包含了1至n中所有数字各一次则称它们是全数字(pandigital)的,例如四位数1324就是1至4全数字的。从键盘上输入一组整数,输出其中的全数字,若找不到则输出“not found”。
输入格式:
多个数字串,中间用一个逗号隔开
输出格式:
满足条件的数字串,分行输出
输入样例:
1243,322,321,1212,2354
输出样例:
1243
321
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20def pandigital(nums):
lst = []
if type(nums) == int:
nums = [nums]
for num in nums:
origin_list = list(str(str(num)))
check_list = list(map(lambda x: str(x), range(1, len(str(num)) + 1)))
origin_list.sort()
if origin_list == check_list:
lst.append(num)
return lst
if __name__ == "__main__":
lst = pandigital(eval(input()))
if lst:
for num in lst:
print(num)
else:
print('not found')