使用selenium把网页保存为PDF

背景

前面通过selenium爬取了微信公众号“新世相”的所有文章链接，详见使用Selenium获取微信公众号的所有文章。那么接下来就该获取具体文章了。由于网页是含有图片的，想想还是通过浏览器把网页打印成PDF保存好了，同时保存一份不含图片的文本文件，可以用于后续分析。

那么怎么使用selenium打印PDF呢？

思路

在网上找了找解决方案，主要有如下几种：

利用第三方包：pdfkit，可参考：https://www.cnblogs.com/silence-cc/p/9463227.html
使用chrome的—print-to-pdf模式，将请求到html导出为pdf，可参考：http://osask.cn/front/ask/view/1029784
使用js命令'window.print();来调用浏览器打印，可参考：https://gitee.com/shinemic/codes/09y87ph6vf2c5zamwls3q48

这里我们选用第三种，相对来说适应性比较好，也方便查看进展，如果想隐藏页面，只需要加入—headlss选项即可。

实现如下：

配置chromedriver的options

appState = {
        "recentDestinations": [
            {
                "id": "Save as PDF",
                "origin": "local"
            }
        ],
        "selectedDestinationId": "Save as PDF",
        "version": 2
    }
profile = {
    'printing.print_preview_sticky_settings.appState': json.dumps(appState),
    'savefile.default_directory': './articles'
}
chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option('prefs', profile)
chrome_options.add_argument('--kiosk-printing')

这里savefile.default_directory用来指定保存文章的路径，需自行配置。

保存pdf
1
2
3
4
5
driver.get(url)
time.sleep(5)
# 保存PDF
temp_title = driver.title
driver.execute_script('window.print();')
这里chrome打印网页时默认文件名为网页的title，所以这里先保存一下temp_title=driver.title。
改名
1
os.rename('./articles/' + temp_title + '.pdf', './articles/' + title + '.pdf')
由于如果打开同一个网站的多个页面并保存pdf，那么很可能就会出现由于网站title相同而覆盖的情况，所以每次保存完毕后，改一下pdf的文件名。

注意：当网页异常等情况可能出现title为空的情况，那么这里改名的时候就会报异常错误，需要进行异常处理。

实现

根据上述思路，在打开网页、导出pdf、改名之后加上sleep，防止异常。实现如下：

def get_articles():
    appState = {
        "recentDestinations": [
            {
                "id": "Save as PDF",
                "origin": "local"
            }
        ],
        "selectedDestinationId": "Save as PDF",
        "version": 2
    }
    profile = {
        'printing.print_preview_sticky_settings.appState': json.dumps(appState),
        'savefile.default_directory': './articles'
    }
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_experimental_option('prefs', profile)
    chrome_options.add_argument('--kiosk-printing')
    driver = webdriver.Chrome(executable_path='./chromedriver', options=chrome_options)
    driver.implicitly_wait(60)
    count = 1
    with open('articles.csv', newline='') as csvfile:
        spamreader = csv.reader(csvfile, delimiter=';')
        for line in spamreader:
            try:
                title = line[0].split(';')[1]
                url = line[1]
                print("下载第" + str(count) + "篇，标题：" + title)
                driver.get(url)
                time.sleep(5)
                # 保存PDF
                temp_title = driver.title
                driver.execute_script('window.print();')
                time.sleep(10)
                os.rename('./articles/' + temp_title + '.pdf', './articles/' + title + '.pdf')
                # 保存txt
                content = driver.find_element_by_id('js_article').text
                with open('./text/' + title + '.txt', 'w') as f:
                    f.write(content)
                count += 1
            except Exception as e:
                logging.exception(e)
    driver.quit()
    return

完整代码参考：https://github.com/keejo125/web_scraping_and_data_analysis/tree/master/weixin

如果大家有更好的方法，也欢迎分享。