如何使用 Crawlee 在 Python 中创建 LinkedIn 职位爬虫

介绍

在本文中，我们将使用 Crawlee 和 Streamlit 构建一个从 LinkedIn 上抓取招聘信息的 Web 应用程序。

我们将使用 Crawlee for Python 在 Python 中创建一个 LinkedIn 职位抓取工具，从通过 Web 应用程序动态接收的用户输入中提取公司名称、职位名称、发布时间以及职位发布链接。

注意：我们的一位社区成员撰写了这篇博文，并将其贡献给了 Crawlee 博客。如果您想为 Crawlee 博客贡献类似的博文，请通过我们的Discord 频道
与我们联系。

在本教程结束时，您将拥有一个功能齐全的 Web 应用程序，您可以使用它从 LinkedIn 上抓取招聘信息。

让我们开始吧。

先决条件

让我们首先使用以下命令创建一个新的 Crawlee for Python 项目：

pipx run crawlee create linkedin-scraper

PlaywrightCrawler当 Crawlee 要求时在终端中选择。

安装完成后，Crawlee for Python 将为您创建样板代码。您可以将目录（cd）更改为项目文件夹，然后运行此命令来安装依赖项。

poetry install

我们将开始编辑 Crawlee 提供给我们的文件，以便我们可以构建我们的抓取工具。

注意：如果您喜欢阅读此博客，在继续阅读之前，如果您在 GitHub 上为Crawlee for Python
加一颗星，我们会非常高兴！

在 GitHub 上为我们点赞⭐️

使用 Crawlee 在 Python 中构建 LinkedIn 职位抓取工具

在本节中，我们将使用 Crawlee for Python 包构建爬虫。要了解有关 Crawlee 的更多信息，请查看其文档。

1. 检查 LinkedIn 职位搜索页面

在浏览器中打开 LinkedIn，然后退出网站（如果您已经登录了 LinkedIn 账户）。您应该会看到如下界面。

导航到工作部分，搜索您选择的工作和地点，然后复制 URL。

你应该有类似这样的内容：

https://www.linkedin.com/jobs/search?keywords=Backend%20Developer&location=Canada&geoId=101174742&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0

我们将重点关注搜索参数，也就是“？”后面的部分。关键词和位置参数对我们来说是最重要的。

用户提供的职位将被输入到关键字参数中，而用户提供的地点将被输入到位置参数中。最后，该geoId参数将被删除，同时我们保持其他参数不变。

我们将对main.py文件进行一些修改。请将以下代码复制并粘贴到您的main.py文件中。

from crawlee.playwright_crawler import PlaywrightCrawler
from .routes import router                                     
import urllib.parse

async def main(title: str, location: str, data_name: str) -> None:
    base_url = "https://www.linkedin.com/jobs/search"

    # URL encode the parameters
    params = {
        "keywords": title,
        "location": location,
        "trk": "public_jobs_jobs-search-bar_search-submit",
        "position": "1",
        "pageNum": "0"
    }

    encoded_params = urlencode(params)

    # Encode parameters into a query string
    query_string = '?' + encoded_params

    # Combine base URL with the encoded query string
    encoded_url = urljoin(base_url, "") + query_string

    # Initialize the crawler
    crawler = PlaywrightCrawler(
        request_handler=router,
    )

    # Run the crawler with the initial list of URLs
    await crawler.run([encoded_url])

    # Save the data in a CSV file
    output_file = f"{data_name}.csv"
    await crawler.export_data(output_file)

现在我们已经对 URL 进行了编码，下一步就是调整生成的路由器来处理 LinkedIn 招聘信息。

2. 路由爬虫

我们将为您的应用程序使用两个处理程序：

默认处理程序

处理default_handler起始 URL

职位列表

处理程序job_listing提取单个作业的详细信息。

剧作家爬虫将爬取招聘信息页面并提取页面上所有招聘信息的链接。

当您检查招聘信息时，您会发现招聘信息链接位于一个名为的类的有序列表中jobs-search__results-list。然后，我们将使用 Playwright 定位器对象提取链接，并将其添加到job_listing路由中进行处理。

router = Router[PlaywrightCrawlingContext]()

@router.default_handler
async def default_handler(context: PlaywrightCrawlingContext) -> None:
    """Default request handler."""

    #select all the links for the job posting on the page
    hrefs = await context.page.locator('ul.jobs-search__results-list a').evaluate_all("links => links.map(link => link.href)")

    #add all the links to the job listing route
    await context.add_requests(
            [Request.from_url(rec, label='job_listing') for rec in hrefs]
        )

现在我们有了职位列表，下一步就是抓取他们的详细信息。

我们将提取每个职位的标题、公司名称、发布时间以及职位链接。打开您的开发工具，使用 CSS 选择器提取每个元素。

在抓取每个列表后，我们将从文本中删除特殊字符以使其干净，并使用该context.push_data函数将数据推送到本地存储。

@router.handler('job_listing')
async def listing_handler(context: PlaywrightCrawlingContext) -> None:
    """Handler for job listings."""

    await context.page.wait_for_load_state('load')

    job_title = await context.page.locator('div.top-card-layout__entity-info h1.top-card-layout__title').text_content()

    company_name  = await context.page.locator('span.topcard__flavor a').text_content()   

    time_of_posting= await context.page.locator('div.topcard__flavor-row span.posted-time-ago__text').text_content()


    await context.push_data(
        {
            # we are making use of regex to remove special characters for the extracted texts

            'title': re.sub(r'[\s\n]+', '', job_title),
            'Company name': re.sub(r'[\s\n]+', '', company_name),
            'Time of posting': re.sub(r'[\s\n]+', '', time_of_posting),
            'url': context.request.loaded_url,
        }
    )

3.创建您的应用程序

在本项目中，我们将使用 Streamlit 作为 Web 应用程序。在继续之前，我们将app.py在项目目录中创建一个名为 Streamlit 的新文件。此外，在继续本节之前，请确保已在全局 Python 环境中安装了Streamlit。

import streamlit as st
import subprocess

# Streamlit form for inputs 
st.title("LinkedIn Job Scraper")

with st.form("scraper_form"):
    title = st.text_input("Job Title", value="backend developer")
    location = st.text_input("Job Location", value="newyork")
    data_name = st.text_input("Output File Name", value="backend_jobs")

    submit_button = st.form_submit_button("Run Scraper")

if submit_button:

    # Run the scraping script with the form inputs
    command = f"""poetry run python -m linkedin-scraper --title "{title}"  --location "{location}" --data_name "{data_name}" """

    with st.spinner("Crawling in progress..."):
         # Execute the command and display the results
        result = subprocess.run(command, shell=True, capture_output=True, text=True)

        st.write("Script Output:")
        st.text(result.stdout)

        if result.returncode == 0:
            st.success(f"Data successfully saved in {data_name}.csv")
        else:
            st.error(f"Error: {result.stderr}")

Streamlit Web 应用程序接受用户的输入并使用 Python Subprocess 包运行 Crawlee 抓取脚本。

4. 测试你的应用

在测试应用程序之前，我们需要对文件做一些修改，__main__以使其能够适应命令行参数。

import asyncio
import argparse

from .main import main

def get_args():
    # ArgumentParser object to capture command-line arguments
    parser = argparse.ArgumentParser(description="Crawl LinkedIn job listings")


    # Define the arguments
    parser.add_argument("--title", type=str, required=True, help="Job title")
    parser.add_argument("--location", type=str, required=True, help="Job location")
    parser.add_argument("--data_name", type=str, required=True, help="Name for the output CSV file")


    # Parse the arguments
    return parser.parse_args()

if __name__ == '__main__':
    args = get_args()
    # Run the main function with the parsed command-line arguments
    asyncio.run(main(args.title, args.location, args.data_name))