数据获取与表示2

该系列为南京大学课程《用Python玩转数据》学习笔记,主要以思维导图的记录

进阶:爬虫小项目(3项)

  1. “迷你爬虫编程小练习”进阶:抽取某本书的前 50 条短评内容并计算评分(star)的平 均值。提示:有的评论中并不包含评分。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    import requests
    from bs4 import BeautifulSoup
    import time


    if __name__ == '__main__':
    num = 1
    count = 1
    stars = []
    # 都挺好
    while count < 50:
    url = 'https://book.douban.com/subject/20492971/comments/hot?p=' + str(num)
    try:
    res = requests.get(url)
    except Exception as e:
    print(e)
    break
    if res.status_code == 200:
    soup = BeautifulSoup(res.content)
    for comment in soup.find_all('div', 'comment'):
    comment_text = comment.find('span', 'short').string
    # 过滤一些没有评分的短评
    try:
    star = int(''.join(filter(str.isdigit, comment.find('span', 'user-stars').attrs['class'][1])))
    except Exception as e:
    continue
    stars.append(star)
    print('{}:{}:{}'.format(count, star, comment_text))
    count += 1
    if count > 50:
    break
    num += 1
    time.sleep(5)
    print('最新50条短评平均得分:{}'.format(sum(stars)//50))
  2. 在“http://money.cnn.com/data/dow30/”上抓取道指成分股数据并将 30 家公司 的代码、公司名称和最近一次成交价放到一个列表中输出。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    import requests
    import re

    if __name__ == '__main__':
    url = 'http://money.cnn.com/data/dow30/'
    res = requests.get(url)
    if res.status_code == 200:
    pattern = re.compile('class="wsod_symbol">(.*?)<\/a>.*?<span.*?>(.*?)<\/span>.*?\n.*?class="wsod_aRight"><span.*?class="wsod_stream">(.*?)<\/span>')
    out_list = re.findall(pattern, res.text)
    print(out_list)
  3. 请爬取网页(http://www.volleyball.world/en/vnl/2018/women/results-and-ranking/round1)上的数据(包括 TEAMS and TOTAL, WON, LOST of MATCHES)

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    import requests
    import re


    if __name__ == '__main__':
    url = 'http://www.volleyball.world/en/vnl/2018/women/results-and-ranking/round1'
    res = requests.get(url)
    if res.status_code == 200:
    pattern = re.compile('href="/en/vnl/2018/women/teams.*?>(.*?)</a></figcaption>\s+</figure>\s+</td>\s+<td>(.*?)</td>\s+<td class="table-td-bold">(.*?)</td>\s+<td class="table-td-rightborder">(.*?)</td>')
    out_list = re.findall(pattern, res.text)
    print(out_list)

基础编程练习

  • 1.从键盘输入整数 n(1-9 之间),对于 1-100之间的整数删除包含 n 并且能被 n 整除的数,例如如果 n 为 6,则要删掉包含 6 的如 6,16 这样的数及是 6 的倍数的如 12 和18 这样的数,输出所有满足条件的数,要求每满 10 个数换行。

    测试数据:

    Enter the number: 6
    屏幕输出:
    1,2,3,4,5,7,8,9,10,11
    13,14,15,17,19,20,21,22,23,25
    27,28,29,31,32,33,34,35,37,38
    39,40,41,43,44,45,47,49,50,51
    52,53,55,57,58,59,70,71,73,74
    75,77,79,80,81,82,83,85,87,88
    89,91,92,93,94,95,97,98,99,100

    方法一:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    if __name__ == '__main__':
    n = eval(input('Please input a number(1-9):'))
    out_string = ''
    if 1 <= n <= 9:
    count = 1
    for num in range(101):
    if num % n == 0 or str(num).find(str(n)) != -1:
    continue
    else:
    out_string = out_string + str(num) + ','
    if count % 10 == 0:
    print(out_string[:-1])
    out_string = ''
    count += 1
    if out_string:
    print(out_string[:-1])
    else:
    print('Wrong Number! Please input a number between 1 and 9!')

    方法二:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    if __name__ == '__main__':
    n = eval(input('Please input a number(1-9):'))
    out_string = ''
    if 1 <= n <= 9:
    list = list(filter(lambda x: x % n != 0 and str(x).find(str(n)) == -1, range(1, 101)))
    count = 1
    for num in list:
    out_string = out_string + str(num) + ','
    if count % 10 == 0:
    print(out_string[:-1])
    out_string = ''
    count += 1
    if out_string:
    print(out_string[:-1])
    else:
    print('Wrong Number! Please input a number between 1 and 9!')
  • 2.请用随机函数产生 500 行 1-100 之间的随机整数存入文件 random.txt 中,编程寻找这些整数的众数并输出,众数即为一组数中出现最多的数。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    import random


    if __name__ == '__main__':
    with open('random.txt', 'w+') as f:
    for i in range(500):
    f.write(str(random.randint(1, 100)))
    f.write('\n')
    f.seek(0)
    nums = f.readlines()
    nums_dict = {}
    for num in nums:
    num = num.strip()
    if num in nums_dict:
    nums_dict[num] += 1
    else:
    nums_dict[num] = 1
    sorted_nums = sorted(nums_dict.items(), key= lambda d: d[1], reverse=True)
    # 可能存在重复次数的数字
    max_num = sorted_nums[0][1]
    for k, v in sorted_nums:
    if v == max_num:
    print(k)
  • 文件 article.txt 中存放了一篇英文文章(请自行创建并添加测试文本),假设文章中的标点符号仅包括“,”、“.”、“!”、“?”和“…”,编程找出其中最长的单词并输出。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    if __name__ == '__main__':
    with open('article.txt', 'r') as f:
    data = f.read()
    raw_words = data.split()
    words = set()
    for word in raw_words:
    if word[-3:] == '...':
    words.add(word[:-3])
    elif word[-1:] in ',.?!':
    words.add(word[:-1])
    else:
    words.add(word)
    result = sorted(words, key=len, reverse=True)
    max_len = len(result[0])
    for word in result:
    if len(word) == max_len:
    print(word)
    else:
    break

数据表示编程题

  • 1. 统计字符串中的字符个数

    题目内容:

    定义函数countchar()按字母表顺序统计字符串中所有出现的字母的个数(允许输入大写字符,并且计数时不区分大小写)。形如:

    输入格式:

    字符串

    输出格式:

    列表

    输入样例:

    Hello, World!

    输出样例:

    [0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 3, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0]

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    def countchar(string):
    alphbet = 'abcdefghijklmnopkrstuvwxyz'
    count_list = [0 for x in range(len(alphbet))]
    for char in string:
    try:
    count_list[alphbet.index(char.lower())] += 1
    except ValueError:
    continue
    return count_list


    if __name__ == "__main__":
    string = input()
    print(countchar(string))
  • 2.寻找输入数字中的全数字(pandigital)

    题目内容:

    如果一个n位数刚好包含了1至n中所有数字各一次则称它们是全数字(pandigital)的,例如四位数1324就是1至4全数字的。从键盘上输入一组整数,输出其中的全数字,若找不到则输出“not found”。

    输入格式:

    多个数字串,中间用一个逗号隔开

    输出格式:

    满足条件的数字串,分行输出

    输入样例:

    1243,322,321,1212,2354

    输出样例:

    1243

    321

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    def pandigital(nums):
    lst = []
    if type(nums) == int:
    nums = [nums]
    for num in nums:
    origin_list = list(str(str(num)))
    check_list = list(map(lambda x: str(x), range(1, len(str(num)) + 1)))
    origin_list.sort()
    if origin_list == check_list:
    lst.append(num)
    return lst


    if __name__ == "__main__":
    lst = pandigital(eval(input()))
    if lst:
    for num in lst:
    print(num)
    else:
    print('not found')