Python爬取起点网小说|Leo的个人博客

前言

最近朋友在搞文案的时候提出了个需求，想要下载某站的小说，但是该网站的小说现在是不允许用户下载小说了，因此这个脚本出现了，用于扒取该网站的小说，并将扒取的内容成.txt格式的文件。

需求分析

这里我想的有两个解决方案

通过网络请求工具获取网页内部的文章并写入文本，同时获取下一章的链接，然后调用新的请求，直至没有解析到下一章的按钮就结束写入
模拟用户在浏览器操作，获取当前网页内容并写入文本，获取完毕后点击下一章继续获取，直至没有下一章按钮就结束写入

这里我们使用第二种方式来实现

技术框架

用到的库有如下几种：

Python： 3.7.7
selenium webdriver：4.4.3
Selenium是一个用于Web应用程序测试的工具。Selenium测试直接运行在浏览器中，就像真正的用户在操作一样。
BeautifulSoup（bs4）：4.11.1
BeautifulSoup是一个可以从HTML或XML文件中提取数据的Python库。

开发准备

下载浏览器驱动

由于我们使用的selenium是模拟用户在浏览器操作，它需要通过webdriver驱动来调用本机的浏览器，所以我们需要下载与本机浏览器版本一致的驱动，这里以谷歌浏览器为示例，除此之外，selenium还支持其他浏览器的驱动，这里不进行深入了解

查看本机浏览器版本

点击帮助，关于 Google Chrome，弹出以下页面

在这里我们记录下自己的浏览器版本

下载浏览器驱动

完成上面的操作后，我们打开下面的链接，去下载谷歌浏览器驱动：http://chromedriver.storage.googleapis.com/index.html
这里我们选择与本机浏览器相近的版本

下载 chromedriver_win32.zip

下载完成查找python安装路径

C:\Users\liaoy>where python
C:\Users\liaoy\AppData\Local\Programs\Python\Python37\python.exe
C:\Users\liaoy\AppData\Local\Microsoft\WindowsApps\python.exe

这里我们进入到python根目录，并将chromedriver_win32.zip 解压后的chromedriver.exe复制到该目录。

至此，我们的环境就安装好了，接下来直接上代码

代码实现

import time

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.common import NoSuchElementException
from selenium.webdriver.common.by import By


def print_text(page_source, note):
    if page_source is not None:
        # 获取访问页面的内容
        soup = BeautifulSoup(page_source)
        # 遍历所有span标签
        for span in soup.findAll("span", class_="content-wrap"):
            note.write(span.get_text())


if __name__ == '__main__':
    options = webdriver.ChromeOptions()  # 实例化浏览器选项
    options.add_argument('--headless')  # 添加无头模式
    options.add_argument('ignore-certificate-errors')  # 解决“您的链接不是私密链接”错误
    driver = webdriver.Chrome(executable_path='chromedriver',
                              options=options)  # 启动chrome浏览器，executable_path指定的是驱动路径
    driver.get("https://read.qidian.com/chapter/hbIfVTSpixsEGYrhBm4H8w2/UtU3j7SgUdnM5j8_3RRvhw2/")  # 访问起点
    # 要保存的文本路径
    Note = open('C:\\Users\\liaoy\\Desktop\\' + driver.title.split('_', 1)[0] + '.txt', mode='w', encoding='utf-8')
    while True:
        # 模拟浏览器滚动
        for i in range(1, 5):
            driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')
        time.sleep(2)
        # 写入网页内容到文本
        print_text(driver.page_source, Note)
        # 关闭指南弹窗
        try:
            znBtn = driver.find_element(By.CSS_SELECTOR, "a.lbf-panel-close.lbf-icon.lbf-icon-close")
        except NoSuchElementException:
            znBtn = None
        if znBtn is not None:
            time.sleep(2)
            # 点击弹窗关闭按钮
            webdriver.ActionChains(driver).move_to_element(znBtn).click(znBtn).perform()
        # 切换至登录弹窗iframe
        driver.switch_to.frame(driver.find_element(By.ID, "loginIfr"))
        # 关闭登录弹窗
        try:
            closeBtn = driver.find_element(By.ID, "close")
        except NoSuchElementException:
            closeBtn = None
        if closeBtn is not None:
            time.sleep(2)
            # 点击弹窗关闭按钮
            webdriver.ActionChains(driver).move_to_element(closeBtn).click(closeBtn).perform()
        # 切回主页
        driver.switch_to.default_content()
        # 跳转至下一页
        try:
            btn = driver.find_element(By.ID, "j_chapterNext")
        except NoSuchElementException:
            btn = None
            # 结束文本输入
            Note.close()
            # 关闭浏览器
            driver.close()
        if btn is not None:
            # 点击下一页按钮
            webdriver.ActionChains(driver).move_to_element(btn).click(btn).perform()

上面的代码使用了很多try和except关键字，由于我们在使用driver.find_element函数来获取页面元素的时候，如果找不到该元素则会报错，并且终止程序，由于弹窗比较多，有的页面有的时候有有的时候没有这里无奈只能一个一个的捕获异常避免程序中断