- The Actual Dora Score for the repositories was around 5. While OpenAi’s GPT4o was able to predict the score to be 4-5 most of the times, LLAMA 3.1 405B a margin away.
_DORA Metrics score: 5/10_

_GPT 4o with DORA score 5/10_

_LLAMA 3.1 with DORA Score 8/10 (incorrect)_

GPT 4o DORA Score was closer to the actual DORA score than LLAMA 3.1 in 9/10 cases, hence GPT4o was more accurate compared to LLAMA 3.1 in this scenario.
### Data Analysis
- The trend data for the four keys dora metrics, calculated by Middleware, was fed to the LLMs as input along with different experimental prompts to ensure a concrete data analysis.
- The trend data is usually a JSON object with date strings as keys, representing weeks' start dates mapped to the metric data.
{
"2024-01-01": {
...
},
"2024-01-08": {
...
}
}
- *Mapping Data*: Both the models were at par at extracting data from the JSON and inferring the data in the correct manner. Example: Both GPT and LLAMA were able to map the correct data to the input weeks without errors or hallucinations.
_Deployment Trends Summarised: GPT4o_

_Deployment Trends Summarised: LLAMA 3.1 405B_

- **Extracting Inferences**: Both the models were able to derive solid inferences from data.
- LLAMA 3.1 identified week with maximum lead time along with the reason for the high lead time.
- This inference could be verified by the Middleware Trend Charts.
- GPT4o was also able to extract the week with the maximum lead time and the reason too, which was, high first-response time.
- **Data Presentation**: Data representation has been a hit or miss with LLMs. There are cases where GPT performs better at data presentation but lacks behind LLAMA 3.1 in accuracy and there have been cases like the DORA score where GPT was able to do the math better.
- LLAMA and GPT were both given the lead time value in seconds. LLAMA was able to round off the data closer to the actual value of 16.99 days while GPT rounded off the data to 17 days 2 hours but presented the data in a more detailed format.
_GPT4o_
_LLAMA 3.1 405B_
### Actionability
<img width="100%" style="width:100%" src="https://i.giphy.com/media/v1.Y2lkPTc5MGI3NjExZXFmcmM2cno2c3liN3doeXJ6Z282NmxrZDN0ZGd3c2xta2RwOXp5eCZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/jsrfOEfEHkHPFSNlir/giphy.gif">
- The models were able to output similar actionables for improving teams' efficiency based on all the metrics.
- Example: Both the models identified the reason for high lead-time to be first-response time and suggested the team to use an alerting tool to avoid delayed PR Reviews. The models also suggested better planning to avoid rework where rework was high in a certain week.
_GPT4o_
_LLAMA 3.1 405B_
### Summarisation
To test out the summarisation capabilities of the models we asked the model to summarise each metric trend individually and then feed the output results for all the trends back into the LLMs to get a summary or in Internet's slang *DORA TLDR* for the team.
The summarisation capability of large data is similar in both the LLMs.
_LLAMA 3.1 405B_

_GPT4o_

## Conclusion
For a long time LLAMA was trying to catch up with GPT in terms of data processing and analytical abilities. Our earlier experimentation with older LLAMA models led us to believe that GPT is way ahead, but the recent LLAMA 3.1 405B model is at par with the GPT4o.
If you value data privacy of your customers and want to try out the open-source LLAMA 3.1 models instead of GPT4, go ahead! There will be negligible difference in performance and you will be able to ensure data privacy if you use self hosted models. Open-Source LLMs have finally started to compete with all the closed-source competitors.
Both LLAMA 3.1 and GPT4o are super capable of deriving inferences from processed data and making Middleware’s DORA metrics more actionable and digestible for engineering leaders, leading to more efficient teams.
## Future Work
This was an experiment to build an AI powered DORA solution and in the future we will be focusing on adding greater support for self hosted or locally running LLMs from Middleware. Enhanced support for AI powered action-plans throughout the product using self hosted LLMs, while ensuring data privacy, will be our goal for the coming months.
In the mean time you can try out the AI DORA summary feature [here](https://github.com/middlewarehq/middleware/tree/ai-beta).
Middleware is an open-source tool designed to help engineering leaders measure and analyze the effectiveness of their teams using the DORA metrics. The DORA metrics are a set of four key values that provide insights into software delivery performance and operational efficiency.
They are:
Deployment Frequency: The frequency of code deployments to production or an operational environment.
Lead Time for Changes: The time it takes for a commit to make it into production.
Mean Time to Restore: The time it takes to restore service after an incident or failure.
Change Failure Rate: The percentage of deployments that result in failures or require remediation.