LLAMA 3.1 与 GPT4：哪个分析起来更智能？

中间件是一个平台，它使工程领导者能够从数据中获取可操作的洞察并改进流程，从而提高开发团队的效率。随着人工智能领域的快速发展，我们一直在尝试将机器学习模型集成到整个产品中，以期从数据中获取可操作的洞察。我们花了一些时间，发现我们想要使用的开源 LLAMA 或 Mistral 模型虽然不错，但在处理以数据为中心的问题时，GPT4o 更可靠。因此，我们决定朝着更复杂的方向发展，构建 RAG 管道并使用函数调用。

当我们听说 Meta 放弃 LLAMA 3.1 模型时，一切都改变了。70B 和 405B 模型无疑是目前最好的开源模型之一，与 GPT4o 旗鼓相当。因此，我们决定将 AI 驱动的 DORA 报告作为实验的一部分，看看 GPT4 和 LLAMA 3.1 在数据分析和推理方面的表现。

背景

DORA 指标为软件交付流程的性能和可靠性提供了关键见解。

1）变更准备时间

前置时间包括首次提交 PR 打开时间、首次响应时间、返工时间、合并时间和合并部署时间。

2）部署频率

该指标衡量代码更改部署到生产中的频率。

3）平均恢复时间（MTTR）

MTTR 衡量生产发生故障后团队恢复服务的速度。
团队的平均事件解决时间用于计算其MTTR。

4）变更失败率（CFR）

CFR 量化导致服务受损或生产中断的变化的百分比，有助于评估部署过程的稳定性和可靠性。
CFR 是通过将事件与一定时间间隔内的部署联系起来计算得出的；每次部署可能有多个事件或没有事件。

您可以从这里了解更多关于 Dora 指标的信息。通过利用先进的 LLM，我们旨在实现这些指标的自动化分析，从而为团队提供更深入、更可操作的洞察。

目标

将 LLM 集成到中间件中以分析 DORA 指标。
比较不同大型语言模型在以下方面的性能：
- 数学准确性：它能多好地计算 DORA 分数？
- 数据分析：LLM 能否分析输入数据并得出正确的推论？
- 总结：模型能多好地总结数据？
- 可操作性：模型能够多好地根据输入数据提出行动计划？

执行

数据处理：中间件来拯救

中间件同步来自不同来源的所有数据并为您的团队计算 DORA 指标。
检查middlewarehq/middleware并使用 docker 设置开发服务器。

模型集成：FireworksAi 和 OpenAI

我们整合了 OpenAI GPT4o 和 LLAMA 3.1（70B 和 405B）模型。
OpenAI 模型在底层使用官方的 OpenAI API，而Fireworks AI API则用于集成 70B 和 405B LLAMA 3.1 模型。
这些 AI 分析由分析服务器中的 AIAnalyticsService 提供支持。此服务可以扩展，以使用 OpenAI 的更多闭源模型或使用FireworksAi 的开源模型。
前端的变化引入了组件和BFF逻辑，允许用户输入他们的令牌，选择一个大型语言模型并为他们的 DORA 指标生成 AI 报告。
每当用户尝试生成 AI 分析时，UI 都会向 BFF API 发出 POST 请求：internal/ai/dora_metrics包含所有预处理的 DORA 指标和趋势数据。
此 BFF API 内部使用 dora 指标和趋势数据调用多个分析 API，进而根据处理后的数据和精选的提示生成分析。
最后，将每个单独指标趋势的分析再次输入到 LLM 中进行总结，并将所有数据发送到前端。

更多实现细节可在此拉取请求中找到。

评估与结果：GPT4o 与 LLAMA 3.1

我们对以下开源存储库进行了 7 月份的 DORA AI 分析：facebook/react、middlewarehq/middlware、meta-llama/llama和facebookresearch/dora。

数学准确性

中间件根据dora.dev 的指南为团队生成了 DORA 性能分数
为了测试模型的计算准确性，我们为其提供了四个关键指标，并提示 LLM 生成 DORA 分数并将结果与中间件进行比较。

这四个键是 JSON 格式的。



{
    "lead_time": 4000,
    "mean_time_to_recovery": 200000,
    "change_failure_rate": 20,
    "weekly_deployment_frequency": 2
}

- The Actual Dora Score for the repositories was around 5. While OpenAi’s GPT4o was able to predict the score to be 4-5 most of the times, LLAMA 3.1 405B a margin away.

_DORA Metrics score: 5/10_
![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/saepp6t4su3j86fm1g3j.png)

_GPT 4o with DORA score 5/10_
![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9u7nln407p0rhhqkag71.png)

_LLAMA 3.1 with DORA Score 8/10 (incorrect)_
![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lwpladuhj66ij2s5j1l7.png)


GPT 4o  DORA Score was closer to the actual DORA score than LLAMA 3.1 in 9/10 cases, hence GPT4o was more accurate compared to LLAMA 3.1 in this scenario.

### Data Analysis
- The trend data for the four keys dora metrics, calculated by Middleware, was fed to the LLMs as input along with different experimental prompts to ensure a concrete data analysis.
- The trend data is usually a JSON object with date strings as keys, representing weeks' start dates mapped to the metric data.

{
   "2024-01-01": {
           ...
       },
       "2024-01-08": {
           ...
       }
}


- *Mapping Data*: Both the models were at par at extracting data from the JSON and inferring the data in the correct manner. Example: Both GPT and LLAMA were able to map the correct data to the input weeks without errors or hallucinations.


     _Deployment Trends Summarised: GPT4o_
     ![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ey4jlh2o1nk5xkvg4tt0.png)


     _Deployment Trends Summarised: LLAMA 3.1 405B_
     ![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ziiymc6tl0l360uam8hs.png)


- **Extracting Inferences**: Both the models were able to derive solid inferences from data. 
  - LLAMA 3.1 identified week with maximum lead time along with the reason for the high lead time.![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/evww5o0tg6bu4m941z6h.png)


  - This inference could be verified by the Middleware Trend Charts.![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lmu39pip49f0brsbd0ti.png)


  - GPT4o was also able to extract the week with the maximum lead time and the reason too, which was, high first-response time.![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/74eepbadfzk24z0i5i80.png)


- **Data Presentation**: Data representation has been a hit or miss with LLMs. There are cases where GPT performs better at data presentation but lacks behind LLAMA 3.1 in accuracy and there have been cases like the DORA score where GPT was able to do the math better.
  - LLAMA and GPT were both given the lead time value in seconds. LLAMA was able to round off the data closer to the actual value of 16.99 days while GPT rounded off the data to 17 days 2 hours but presented the data in a  more detailed format.

     _GPT4o_![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3dpwmlcscgehi47zlx0c.png)


     _LLAMA 3.1 405B_![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/c3owjv94wjfrtxetf91a.png)



### Actionability
<img width="100%" style="width:100%" src="https://i.giphy.com/media/v1.Y2lkPTc5MGI3NjExZXFmcmM2cno2c3liN3doeXJ6Z282NmxrZDN0ZGd3c2xta2RwOXp5eCZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/jsrfOEfEHkHPFSNlir/giphy.gif">

- The models were able to output similar actionables for improving teams' efficiency based on all the metrics.
- Example: Both the models identified the reason for high lead-time to be first-response time and suggested the team to use an alerting tool to avoid delayed PR Reviews. The models also suggested better planning to avoid rework where rework was high in a certain week.

_GPT4o_![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tq9uaapz50z3dsom7jhd.png)

_LLAMA 3.1 405B_![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gbw7ovecvj3rc6fhykz3.png)


### Summarisation
To test out the summarisation capabilities of the models we asked the model to summarise each metric trend individually and then feed the output results for all the trends back into the LLMs to get a summary or in Internet's slang *DORA TLDR* for the team.

The summarisation capability of large data is similar in both the LLMs.

_LLAMA 3.1 405B_
![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ewsg3cgyqp3mikx1pb92.png)

_GPT4o_
![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/iq6qgz104pacq7nhpyku.png)

## Conclusion
For a long time LLAMA was trying to catch up with GPT in terms of data processing and analytical abilities. Our earlier experimentation with older LLAMA models led us to believe that GPT is way ahead, but the recent LLAMA 3.1 405B model is at par with the GPT4o.

If you value data privacy of your customers and want to try out the open-source LLAMA 3.1 models instead of GPT4, go ahead! There will be negligible difference in performance and you will be able to ensure data privacy if you use self hosted models. Open-Source LLMs have finally started to compete with all the closed-source competitors.

Both LLAMA 3.1 and GPT4o are super capable of deriving inferences from processed data and making Middleware’s DORA metrics more actionable and digestible for engineering leaders, leading to more efficient teams.

## Future Work
This was an experiment to build an AI powered DORA solution and in the future we will be focusing on adding greater support for self hosted or locally running LLMs from Middleware. Enhanced support for AI powered action-plans throughout the product using self hosted LLMs, while ensuring data privacy, will be our goal for the coming months. 

In the mean time you can try out the AI DORA summary feature [here](https://github.com/middlewarehq/middleware/tree/ai-beta).


  
    
      
      
        middlewarehq
       / 
        middleware
      
    
    
      ✨ Open-source DORA metrics platform for engineering teams ✨
    
  
  
    



Open-source engineering management that unlocks developer potential

   
 

Join our Open Source Community




        Introduction

Middleware is an open-source tool designed to help engineering leaders measure and analyze the effectiveness of their teams using the DORA metrics. The DORA metrics are a set of four key values that provide insights into software delivery performance and operational efficiency.

They are:


        Deployment Frequency: The frequency of code deployments to production or an operational environment.
        Lead Time for Changes: The time it takes for a commit to make it into production.
        Mean Time to Restore: The time it takes to restore service after an incident or failure.
        Change Failure Rate: The percentage of deployments that result in failures or require remediation.

Table of Contents


        Middleware - Open Source
         
          Features
          Quick Start
           
            Installing Middleware
            Troubleshooting
           
          Developer Setup
           
            Using Gitpod
            Using Docker
            Manual Setup
           
          Usage
          How we Calculate DORA
          Roadmap
          Contributing guidelines
          …
         


  


  View on GitHub