构建一个 Python 机器人来查找你网站的死链接

无效链接和图片会让访客感到沮丧。手动检查更是令人抓狂！我们将构建一个机器人，仅使用 Python 标准库即可爬取网站中缺失的资源。

我们来谈谈设计目标。我们想运行一个命令，检查整个网站是否存在无效资源。这意味着会涉及到一些爬虫操作。

$ python deadseeker.py 'https://healeycodes.com/'
> 404 - https://docs.python.org/3/library/missing.html
> 404 - https://github.com/microsoft/solitare2

更确切地说，机器人应该解析给定页面上的所有 HTML 标签，查找href和src属性。如果找到，它应该发送 GET 请求并记录所有 HTTP 错误代码。如果找到本地页面（例如/about/、/projects/），它应该将它们加入队列以便稍后扫描。在检查链接时，我们将它们添加到一个集合中，这样我们只需检查一次。

Python 为我们提供了html.parser——一个简单的 HTML 和 XHTML 解析器。让我们来看看它是如何工作的。

from html.parser import HTMLParser

# extend HTMLParser
class MyHTMLParser(HTMLParser):
    # override `handle_starttag`
    def handle_starttag(self, tag, attrs):
        print(f'Encountered a start tag: {tag}')
        print(f'And some attributes: {attrs}')

parser = MyHTMLParser()
parser.feed('<html><body><a href="https://google.com">Google</a></body></html>')

这将打印：

> Encountered a start tag: a
> And some attributes: [('href', 'https://google.com')]

这就是我们处理的最繁重的工作了。那么请求呢？Python 有urllib.request库。urllib.request.urlopen我们将用它来发送 GET 请求。大多数情况下，你会使用第三方库请求，但我们的需求足够小，可以用原生库来实现！

让我们检查一下 Google 是否启动，我们预期 HTTP 状态代码为 200（OK）。

>>> import urllib.request
>>> r = urllib.request.urlopen('https://google.com')
>>> r.status
200

有些网站会返回 403（禁止访问），因为我们的用户代理会暴露我们是机器人。默认情况下，它看起来像User-Agent: Python-urllib/3.7。我们可以通过使用不同的用户代理来伪装自己来解决这个问题。负责任的机器人会先检查robots.txt 文件，确认网站的规则！

我们从导入开始，抓取所有需要的内容。我们还存储了对用户代理字符串的引用。这意味着用户正在使用最新的 Chrome 版本进行浏览。

import sys
import urllib
from urllib import request, parse
from urllib.parse import urlparse, urljoin
from urllib.request import Request
from html.parser import HTMLParser
from collections import deque

search_attrs = set(['href', 'src'])
agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'

我们还导入了一个数据结构，并声明了另一个数据结构——双端队列 (deque) 和集合 (set)。双端队列是一种“类似列表的容器，两端均可快速添加和弹出”。我们将把它用作一个简单的队列——在找到本地页面时添加它们，并以先进先出的方式扫描它们。集合的用法更简单，我们会在添加链接之前检查它是否已经发送过请求。在这两种情况下，我们都可以采用列表 (List)，但计算效率会更低。

我们将代码扩展HTMLParser到LinkParser程序的核心。我们使用super()来引用我们要覆盖的父构造函数。

class LinkParser(HTMLParser):
    def __init__(self, home):
        super().__init__()
        self.home = home
        self.checked_links = set()
        self.pages_to_check = deque()
        self.pages_to_check.appendleft(home)
        self.scanner()

当我们创建这个类的实例时，我们会将网站的主页传递给它。我们存储它，以便self.home用它来检查我们遇到的链接是否是本地页面。正如你在构造函数末尾看到的，我们立即开始扫描——但是“扫描”是什么意思呢？

def scanner(self):
    # as long as we still have pages to parse
    while self.pages_to_check:

        # take the first page added
        page = self.pages_to_check.pop()

        # send a request to it using our custom header
        req = Request(page, headers={'User-Agent': agent})
        res = request.urlopen(req)

        # check that we're about to parse HTML (e.g. not CSS)
        if 'html' in res.headers['content-type']:
            with res as f:

                # read the HTML and assume that it's UTF-8
                body = f.read().decode('utf-8', errors='ignore')
                self.feed(body)

当feed解析 HTML 时，它会遇到标签并调用handle_starttag、handle_endtag和其他方法。我们重写了handle_starttag自己的方法，检查属性中是否存在我们要查找的键。当遇到键时，<a href="http://google.com">Google</a>我们想要提取href值。同样，对于，<img src="/cute_dog.png">我们想要提取src值。

def handle_starttag(self, tag, attrs):
    for attr in attrs:
        # ('href', 'https://google.com')
        if attr[0] in search_attrs and attr[1] not in self.checked_links:
            self.checked_links.add(attr[1]) 
            self.handle_link(attr[1])

提醒：要循环遍历可迭代对象，可以使用for thing in things:和，在下面的代码块中，通过第一个变量引用每个项thing。要检查某个对象是否在集合中，可以使用item in a_set返回布尔值的。您可以将和相加a_set.add(item)。

def handle_link(self, link):
    # check for a relative link (e.g. /about/, /blog/)
    if not bool(urlparse(link).netloc):

        # fix if we need to, we can't send a request to `/about/`
        link = urljoin(self.home, link)

    # attempt to send a request, seeking the HTTP status code
    try:
        req = Request(link, headers={'User-Agent': agent})
        status = request.urlopen(req).getcode()

    # we're expecting errors (dead resources) so let's handle them
    except urllib.error.HTTPError as e:
        print(f'HTTPError: {e.code} - {link}')  # (e.g. 404, 501, etc)
    except urllib.error.URLError as e:
        print(f'URLError: {e.reason} - {link}')  # (e.g. conn. refused)

    # otherwise, we got a 200 (OK) or similar code!
    else:

        # remove this in production or we won't spot our errors
        print(f'{status} - {link}')

    # build a queue of local pages so we crawl the entire website
    if self.home in link:
        self.pages_to_check.appendleft(link)

一个完整的资源爬虫，不到 50 行原生代码。赞叹 Python。很多人认为其强大的标准库是它受欢迎的原因之一。最后一步就是调用我们的类，并在脚本之后传递第一个参数。

LinkParser(sys.argv[1])  # e.g. 'https://healeycodes.com/'

这是包含最终代码的仓库。教程评论已清理完毕。祝您使用愉快！

我在我的每周新闻通讯📧上发布独特的内容。

文章来源：https://dev.to/healeycodes/build-a-python-bot-to-find-your-website-s-dead-links-563c