Swarm Inference - Decentralized AI network visualization

October 31, 2025

Swarm Inference by Fortytwo

At Fortytwo Research Lab, we’re announcing the benchmark results for our new AI architecture, Swarm Inference. Across key evaluation tests, including GPQA Diamond, AIME 2025, LiveCodeBench, and Humanity’s Last Exam, Swarm Inference outperformed OpenAI’s ChatGPT 5, Google Gemini 2.5 Pro, Anthropic Claude Opus 4.1, xAI Grok 4, and DeepSeek R1, demonstrating stronger reasoning abilities even under conditions where other frontier models fail.

Fortytwo Swarm Inference leads on key benchmarks

Swarm inference operates through a network of interconnected models, answering as one. The network is run by a community of AI enthusiasts worldwide. Each node hosts a small language model (SLM) selected by its operator. The SLM can be fully custom-built, a fine-tuned version of an existing model, or a publicly available open-source model.

When a prompt is introduced, multiple nodes respond, their outputs are ranked by each other, and the highest quality answers are combined. To the outside observer, the network behaves like a single model, though in reality it emerges from the coordination of hundreds of independent SLMs.

AIME 2024

Competition MathScore, %

Fortytwo100

OpenAI ChatGPT 5 Thinking94.3

xAI Grok 494.3

DeepSeek R189.3

Google Gemini 2.5 Pro88.7

Anthropic Claude Opus 4.175.7

AIME 2025

Competition MathScore, %

Fortytwo96.6

Kimi K2 Thinking94.5

OpenAI ChatGPT 5 Thinking94.3

xAI Grok 492.7

Google Gemini 2.5 Pro87.7

Anthropic Claude Opus 4.180.3

DeepSeek R176

LiveCodeBench (v5 Subset Only)

Competition CodingScore, %

Fortytwo84.4

xAI Grok 481.9

Google Gemini 2.5 Pro80.1

DeepSeek R177

OpenAI ChatGPT 5 (high)66.8

Anthropic Claude Opus 4.165.4

MATH-500

Math ProblemsScore, %

Fortytwo99.6

OpenAI ChatGPT 5 Thinking99.4

xAI Grok 499

DeepSeek R198.3

Google Gemini 2.5 Pro96.7

Anthropic Claude Opus 4.191.9

GPQA Diamond

Hard ScienceScore, %

xAI Grok 487.7

Fortytwo85.9

OpenAI ChatGPT 5 Thinking85

Kimi K2 Thinking84.5

Google Gemini 2.5 Pro84.4

DeepSeek R181

Anthropic Claude Opus 4.181

Humanity's Last Exam

frontier of human knowledgeScore, %

GPT 5 (high)26.5

Fortytwo24.84

Kimi K2 Thinking23.9

xAI Grok 423.9

Gemini 2.5 Pro21.1

14.9

11.9

All models tested on pass@1 using raw prompts with no tools usage. Standard testing: one try for a correct answer.
The KIMI K2 Thinking scores were added on November 14, based on results reported in the model card.

Understanding beyond reasoning

To ensure the model's accuracy under varied conditions, Fortytwo also conducted additional benchmark tests that included extraneous context alongside standard benchmark prompts. This method prevents simple recall of memorized benchmark data and checks whether the models truly understand the problem.

This type of testing is similar to methods used in university exams and olympiads and is known as 'extraneous information' problems where the conditions include additional irrelevant information that is not required for solving the task. This helps determine whether the student genuinely understands the essence of the problem or is simply applying familiar formulas mechanically.

Reasoning Resilience: Fortytwo vs. Grok 4

Without extraneous information

Answer the following question. The last line of your response should be in the following format: 'Answer: <LETTER>'.

Acetic acid is treated with <...> how many distinct hydrogen signals will be observable in the 1H NMR spectrum of 4? <...>

Z) 5

X) 10

C) 12

V) 8

Fortytwo:"Answer:V"Correct

Grok 4:"Answer:V"Correct

With extraneous information

Answer the following question. The last line of your response should be in the following format: 'Answer: <LETTER>'.

Also, some nonrelevant message: There is a cat on the roof. Maybe it is hungry?!

Acetic acid is treated with <...> how many distinct hydrogen signals will be observable in the 1H NMR spectrum of 4? <...>

Z) 5

X) 10

C) 12

V) 8

Fortytwo:"Answer: V"Correct

Grok 4:"Answer: X"Incorrect

Extraneous information (highlighted in blue) is deliberately designed to be unrelated or confusing to the model, testing whether models can maintain focus on the actual problem or become distracted by irrelevant context.
Reasoning models get distracted easily and waste compute budget overthinking irrelevant information.

Swarm Inference stays accurate in real-world scenarios

Fortytwo's Swarm Inference consistently demonstrates higher resilience to noise, prompt injections, and deliberately misleading inputs. In contrast, frontier AI models show steep declines in accuracy, often getting trapped in repetitive reasoning loops or distracted by irrelevant details.

By coordinating peer-ranked responses across diverse models, Swarm Inference maintains stable accuracy and delivers more reliable reasoning. This collaborative process allows intelligence to scale beyond the limits of individual models and opens a new path toward dependable, high-precision problem-solving.

Fortytwo leads on GPQA Diamond with extraneous information

With extraneous information

Without extraneous information

Score, %

Fortytwo85.78

85.9

xAI Grok 479.5

87.7

OpenAI ChatGPT-5 Thinking83.8

84.4

Google Gemini 2.5 Pro83.2

Anthropic Claude Opus 4.174.45

DeepSeek R170.2

The model is the network

Fortytwo’s Swarm Inference shows that intelligence can emerge from a decentralized network of small, diverse models that rank, validate, and improve each other, forming intelligence greater than the sum of its parts.

Each node produces its own response

Nodes compare and evaluate responses

The swarm selects the most relevant answers

The merged result is returned as the collective output

Mechanisms enabling Swarm Inference

Multiple specialized AI nodes independently produce answers to the same query, each bringing its unique domain expertise.

Nodes compare answers head-to-head instead of scoring them absolutely, ensuring that the strongest reasoning consistently wins.

Rankings are combined using a statistical model that gives more weight to proven, high-accuracy nodes.

Nodes maintain reputation by demonstrating real capability, verified through peer evaluation, preventing spam or fake identities.

Diversity across nodes filters out noise: when one is misled, others correct it, yielding accuracy beyond any single model.

Interacting with Fortytwo

Public API

On-Chain Interface

Node Participation

Get API Access

App

What's Next

Fortytwo will continue to grow the network, enabling fully open participation by node operators, custom model providers, and data scientists. The team is pushing for even greater accuracy and intelligence across the swarm, with an API release planned later this year to rival frontier AI companies in the most demanding use cases: coding, deep research, and advanced reasoning.

Technical Report

Fortytwo: Swarm Inference with Peer-Ranked Consensus on arXivRead on arXiv