13 个开源工具，让你获得任何 AI 工作的可能性提高 99% 🪄✨

我从事人工智能领域已有一段时间了，那时候顶尖的语言模型还是 BERT 和 T5。这段时间，人工智能领域取得了惊人的进步。

我们现在拥有更好的模型、工具、框架和机器。

如果你正在考虑进入人工智能领域，现在正是最佳时机。而理想的方法是掌握那些能让你领先于竞争对手的工具。

因此，我编制了一份令人垂涎的开源软件列表，涵盖了人工智能开发的各个方面，从人工智能模型训练和监控到构建人工智能代理。

如果还有什么需要补充的，欢迎留言。另外，请点赞并为代码库做出有意义的贡献。这可能是提升简历可信度的最佳策略。

1. Composio 👑：通过将流行应用程序与 AI 集成来实现工作流程自动化

人工智能代理时代已经到来，许多财富500强企业已开始引入代理工作流程。然而，实现复杂工作流程的自动化绝非易事。

要将 AI 模型与外部应用程序连接起来，你需要专门的工具集。例如，为了实现软件开发各个环节的自动化，AI 模型必须能够访问 GitHub、Jira、代码解释器、代码索引器、互联网等。

这就是 Composio 发挥作用的地方。

它允许您集成 100 多个可用于生产的工具集，例如 Gmail、Google Sheets、Jira、Notion 等，以自动化复杂的实际工作流程。

那么，您可以按照以下方法开始使用它。

Python

pip install composio-core

添加 GitHub 集成。

composio add github

Composio 代表您处理用户身份验证和授权。

以下是如何使用 GitHub 集成来为存储库加注星标的方法。

from openai import OpenAI
from composio_openai import ComposioToolSet, App

openai_client = OpenAI(api_key="******OPENAIKEY******")

# Initialise the Composio Tool Set 
composio_toolset = ComposioToolSet(api_key="**\\*\\***COMPOSIO_API_KEY**\\*\\***")

## Step 4
# Get GitHub tools that are pre-configured
actions = composio_toolset.get_actions(actions=[Action.GITHUB_ACTIVITY_STAR_REPO_FOR_AUTHENTICATED_USER])

## Step 5
my_task = "Star a repo ComposioHQ/composio on GitHub"

# Create a chat completion request to decide on the action
response = openai_client.chat.completions.create(
model="gpt-4-turbo",
tools=actions, # Passing actions we fetched earlier.
messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": my_task}
  ]
)

运行此 Python 脚本以使用代理执行给定的指令。

JavaScript

您可以使用 npm、、 yarn或来安装它pnpm。

npm install composio-core

定义一种方法让用户连接他们的 GitHub 帐户。

import { OpenAI } from "openai";
import { OpenAIToolSet } from "composio-core";

const toolset = new OpenAIToolSet({
  apiKey: process.env.COMPOSIO_API_KEY,
});

async function setupUserConnectionIfNotExists(entityId) {
  const entity = await toolset.client.getEntity(entityId);
  const connection = await entity.getConnection('github');

  if (!connection) {
      // If this entity/user hasn't already connected, the account
      const connection = await entity.initiateConnection(appName);
      console.log("Log in via: ", connection.redirectUrl);
      return connection.waitUntilActive(60);
  }

  return connection;
}

将所需的工具添加到 OpenAI SDK 并将实体名称传递给 executeAgent 函数。

async function executeAgent(entityName) {
  const entity = await toolset.client.getEntity(entityName)
  await setupUserConnectionIfNotExists(entity.id);

  const tools = await toolset.get_actions({ actions: ["github_activity_star_repo_for_authenticated_user"] }, entity.id);
  const instruction = "Star a repo ComposioHQ/composio on GitHub"

  const client = new OpenAI({ apiKey: process.env.OPEN_AI_API_KEY })
  const response = await client.chat.completions.create({
      model: "gpt-4-turbo",
      messages: [{
          role: "user",
          content: instruction,
      }],
      tools: tools,
      tool_choice: "auto",
  })

  console.log(response.choices[0].message.tool_calls);
  await toolset.handle_tool_call(response, entity.id);
}

executeGithubAgent("joey")

执行代码并让代理为您完成工作。

Composio 与 LangChain、LlamaIndex、CrewAi 等著名框架兼容。

欲了解更多信息，请访问官方文档，欲查看更复杂的示例，请参阅存储库的示例部分。

为 Composio 代码库加星标 ⭐

2. HuggingFace 的 TRL：利用强化学习训练 Transformer 语言模型

你通常需要 LLM 和扩散模型以特定的方式运行，例如添加护栏或确保它们遵循人类的指令。这时你就需要 TRL 了。

TRL，即由 HuggingFace 支持的 Transformer 强化学习，是一个广泛使用的开源库，可轻松微调和调整语言模型。

它支持多种模型对齐方法，例如使用 PPO（近端策略优化）的强化学习、监督微调和 DPO（直接偏好优化）。

它很简单，Pythonic 的界面让初学者更容易快速上手。

trl使用安装pip。

pip install trl

让我们快速浏览一下SFTTrainerLLM 的监督微调课程。

# imports
from datasets import load_dataset
from trl import SFTTrainer

# get dataset
dataset = load_dataset("imdb", split="train")

# get trainer
trainer = SFTTrainer(
    "facebook/opt-350m",
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=512,
)

# train
trainer.train()

该代码块使用创建一个 SFTTrainer 实例facebook/opt-350m。该train()方法将开始使用 IMDB 数据训练模型。

请查看示例部分以了解更多。

为 trl 仓库加星标⭐

Pytorch-Lightning：大规模构建、训练和微调模型

没有 Pytorch，AI 开发就无法进行，而 Pytorch 监听则让 AI 开发更进一步。

它是一个通用框架，有助于构建和扩展基于 PyTorch 的深度学习项目，提供跨各个领域的培训、实验和部署工具。

Lightning 相对于 Pytorch 的几个优势。

它使 Pytorch 代码更具可读性、结构化和用户友好性。
使用预定义的训练循环和实用程序减少重复代码。
使用更少的样板代码简化培训、实验和部署。

使用 Lightning 开始使用pip

pip install lightning

使用 Lightning 模块定义自动编码器。

import os
from torch import optim, nn, utils, Tensor
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
import lightning as L 

# define any number of nn.Modules (or use your current ones)
encoder = nn.Sequential(nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 3))
decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28 * 28))

# define the LightningModule
class LitAutoEncoder(L.LightningModule):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

    def training_step(self, batch, batch_idx):
        # training_step defines the train loop.
        # it is independent of forward
        x, _ = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = nn.functional.mse_loss(x_hat, x)
        # Logging to TensorBoard (if installed) by default
        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

# init the autoencoder
autoencoder = LitAutoEncoder(encoder, decoder)

加载 MNIST 数据。

# setup data
dataset = MNIST(os.getcwd(), download=True, transform=ToTensor())
train_loader = utils.data.DataLoader(dataset)

Lightning Trainer 将任何 LightningModule 与任何数据集“混合”，并抽象出扩展所需的所有工程复杂性。

# train the model (hint: here are some helpful Trainer arguments for rapid idea iteration)
trainer = L.Trainer(limit_train_batches=100, max_epochs=1)
trainer.fit(model=autoencoder, train_dataloaders=train_loader)

有关 Lightning 的更多信息，请查看官方文档。

为闪电 AI 存储库加星标⭐

4. 权重和偏差：监控机器学习管道的所有部分

假设您想要微调或训练一个模型。在这种情况下，您必须跟踪多个组件，例如模型超参数、训练和验证指标、数据预处理步骤、模型架构版本以及实验配置。

了解您正在训练的模型是否走在正确的方向上至关重要。

Wandb 是最好的开源解决方案之一。它允许你跟踪指标并与团队成员协作。

通过四个步骤开始使用 W&B：

首先，注册一个 W&B账号。
其次，使用pip安装 W&B SDK 。导航到终端并输入以下命令：

pip install wandb

第三，登录W&B：

wandb.login()

使用下面的示例代码片段作为模板将 W&B 集成到您的 Pytorch Lightning 脚本中：

# This script needs these libraries to be installed:
#   torch, torchvision, pytorch_lightning

import wandb

import os
from torch import optim, nn, utils
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor

import pytorch_lightning as pl
from pytorch_lightning.loggers import WandbLogger

class LitAutoEncoder(pl.LightningModule):
    def __init__(self, lr=1e-3, inp_size=28, optimizer="Adam"):
        super().__init__()

        self.encoder = nn.Sequential(
            nn.Linear(inp_size * inp_size, 64), nn.ReLU(), nn.Linear(64, 3)
        )
        self.decoder = nn.Sequential(
            nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, inp_size * inp_size)
        )
        self.lr = lr

        # save hyperparameters to self.hparamsm auto-logged by wandb
        self.save_hyperparameters()

    def training_step(self, batch, batch_idx):
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = nn.functional.mse_loss(x_hat, x)

        # log metrics to wandb
        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=self.lr)
        return optimizer

# init the autoencoder
autoencoder = LitAutoEncoder(lr=1e-3, inp_size=28)

# setup data
batch_size = 32
dataset = MNIST(os.getcwd(), download=True, transform=ToTensor())
train_loader = utils.data.DataLoader(dataset, shuffle=True)

# initialise the wandb logger and name your wandb project
wandb_logger = WandbLogger(project="my-awesome-project")

# add your batch size to the wandb config
wandb_logger.experiment.config["batch_size"] = batch_size

# pass wandb_logger to the Trainer
trainer = pl.Trainer(limit_train_batches=750, max_epochs=5, logger=wandb_logger)

# train the model
trainer.fit(model=autoencoder, train_dataloaders=train_loader)

# [optional] Finish the wandb run, which is necessary for the notebook
wandb.finish()

您可以实时观察 Wandb 仪表板上的指标。

欲了解更多信息，请参阅开发人员指南。

为 Wandb 仓库加星标⭐

5. MlFlow：机器学习生命周期平台

Mlflow 是一个跨行业使用的综合性 Mlops 框架。

它允许您跟踪 AI 模型的整个生命周期，从训练、微调到部署。MLflow 提供了一组轻量级 API，可与任何现有的机器学习应用程序或库（TensorFlow、PyTorch、XGBoost 等）配合使用，无论您当前在何处运行 ML 代码（例如在笔记本、独立应用程序或云端）。

它不仅可以用于 AI 模型，还可以让您跟踪和监控使用 LangChain、OpenAI SDK 等构建的 AI 代理。

它是构建完整的端到端 Ml/AI 管道的重要工具。

为 MlFlow 代码库加星标 ⭐

6. Pgvector：Postgres 的开源向量相似性搜索

RAG 应用程序配备了向量数据库。向量数据库将非结构化数据以高维向量或嵌入的形式进行管理。

许多组织已经使用 Postgres 数据库来存储结构化数据，这使得 Pgvector 成为所有这些公司的矢量数据库的最佳选择。

在众多可用选项中，从长远来看，Pgvector 是最有意义的。

在 Linux 和 Mac 中安装 PGVector。

Compile and install the extension (supports Postgres 12+)

cd /tmp
git clone --branch v0.7.4 https://github.com/pgvector/pgvector.git
cd pgvector
make
make install # may need sudo
See the installation notes if you run into issues

您可以使用Docker、 Homebrew、 PGXN、 APT、 Yum、 pkg或 conda-forge安装它。它预装了Postgres 应用程序和许多托管提供商。此外，还有 GitHub Actions的说明。

启用扩展（在每个想要使用它的数据库中执行一次）

CREATE EXTENSION vector;

创建一个具有 3 个维度的向量列

CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));

插入载体

INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');

通过 L2 距离获取最近邻居

SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

还支持内积（<#>）、余弦距离（<=>）和 L1 距离（<+>，在 0.7.0 中添加）

注意：由于 Postgres 仅支持运算符上的顺序索引扫描，因此<#> 返回有害的内积 ASC

储存

创建一个带有向量列的新表

CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));

或者向现有表中添加向量列

ALTER TABLE items ADD COLUMN embedding vector(3);

插入载体

INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');

COPY 或者使用（示例）批量加载向量

COPY items (embedding) FROM STDIN WITH (FORMAT BINARY);

更新插入向量

INSERT INTO items (id, embedding) VALUES (1, '[1,2,3]'), (2, '[4,5,6]')
    ON CONFLICT (id) DO UPDATE SET embedding = EXCLUDED.embedding;

更新向量

UPDATE items SET embedding = '[1,2,3]' WHERE id = 1;

删除向量

DELETE FROM items WHERE id = 1;

查询

获取向量的最近邻

SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

有关 PGVector 的更多信息，请参阅其存储库。

为 PgVector 仓库加星标⭐

7. Llama Cpp：C/C++ 中的 LLM 推理

许多组织希望自行托管或开源 LLM。这需要一个高度优化且高效的推理引擎。

Llama Cpp 在这里最有意义。它由Georgi Gerganov开发，是服务 LLM 的最佳开源解决方案之一。

顾名思义，它是用 C++ 构建的，因此速度很快。它还支持几乎所有开放访问模型，例如 Llama 3、Mistral、Gemma、Nous Hermes 等。

查看本指南以获取有关如何自行构建 llama cpp 的说明。

为 Llama Cpp 仓库加星标 ⭐

8. LangGraph：以图表形式构建弹性语言代理

LangGraph 无疑是构建高效可靠的 AI 代理的最强大框架之一。顾名思义，它遵循循环图形架构（例如节点和边）来构建 AI 代理。

它是 LangChain 的扩展，因此拥有庞大的 AI 开发者社区。

开始使用它pip。

pip install -U langgraph

如果您想使用 LangGraph 构建代理/机器人，请查看我们关于构建 Gmail 和日历助手的详细博客。

有关 LangGraph 的更多信息，请访问文档。

为 LangGraph 代码库加星标 ⭐

9. Pydantic：使用 Python 类型提示进行数据验证

这无疑是 Python 生态系统近期发生的最好的事情之一。

Pydantic 的核心价值主张是数据验证。

从构建弹性 API 到从 LLM 获取结构化输出，Pydantic 的受欢迎程度已大幅提升。许多公司都在使用 Pydantic，甚至 OpenAI 也宣布使用 Pydantic 从 LLM 获取结构化输出。

使用安装 Pydantic pip。

pip install pydantic

一个小例子。

from datetime import datetime
from typing import List, Optional
from pydantic import BaseModel

class User(BaseModel):
    id: int
    name: str = 'John Doe'
    signup_ts: Optional[datetime] = None
    friends: List[int] = []

external_data = {'id': '123', 'signup_ts': '2017-06-01 12:22', 'friends': [1, '2', b'3']}
user = User(**external_data)
print(user)
#> User id=123 name='John Doe' signup_ts=datetime.datetime(2017, 6, 1, 12, 22) friends=[1, 2, 3]
print(user.id)
#> 123

查看文档以了解更多信息。

为 Pydantic 仓库加星标 ⭐

10. FastAPI：快速、简单、易用的 Python 框架

FastAPI 也因其高性能、简单易学的特性而获得了很多赞誉。

许多人工智能公司主要使用 FastAPI 来构建 API，使用 FasAPI 来公开端点以从模型推断或创建 Web 应用程序。

掌握 FastAPI 将使您能够很好地处理 AI 和 API 开发。

它基于 Starllete 构建，使其成为最快的 Python 框架。

使用开始使用 FastAPI pip。

pip install "fastapi[standard]"

构建一个简单的 API。

from typing import Union

from fastapi import FastAPI

app = FastAPI()

@app.get("/")
def read_root():
    return {"Hello": "World"}

@app.get("/items/{item_id}")
def read_item(item_id: int, q: Union[str, None] = None):
    return {"item_id": item_id, "q": q}