Swarm Inference - Decentralized AI network visualization

October 31, 2025

Swarm Inference by Fortytwo

At Fortytwo Research Lab, we’re announcing the benchmark results for our new AI architecture, Swarm Inference. Across key evaluation tests, including GPQA Diamond, AIME 2025, LiveCodeBench, and Humanity’s Last Exam, Swarm Inference outperformed OpenAI’s ChatGPT 5, Google Gemini 2.5 Pro, Anthropic Claude Opus 4.1, xAI Grok 4, and DeepSeek R1, demonstrating stronger reasoning abilities even under conditions where other frontier models fail.

Fortytwo Swarm Inference leads on key benchmarks

Swarm inference operates through a network of interconnected models, answering as one. The network is run by a community of AI enthusiasts worldwide. Each node hosts a small language model (SLM) selected by its operator. The SLM can be fully custom-built, a fine-tuned version of an existing model, or a publicly available open-source model.

When a prompt is introduced, multiple nodes respond, their outputs are ranked by each other, and the highest quality answers are combined. To the outside observer, the network behaves like a single model, though in reality it emerges from the coordination of hundreds of independent SLMs.

AIME 2024

Competition MathScore, %
FortytwoFortytwo100
OpenAIOpenAI ChatGPT 5 Thinking94.3
xAIxAI Grok 494.3
DeepSeekDeepSeek R189.3
GeminiGoogle Gemini 2.5 Pro88.7
ClaudeAnthropic Claude Opus 4.175.7

AIME 2025

Competition MathScore, %
FortytwoFortytwo96.6
KimiKimi K2 Thinking94.5
OpenAIOpenAI ChatGPT 5 Thinking94.3
xAIxAI Grok 492.7
GeminiGoogle Gemini 2.5 Pro87.7
ClaudeAnthropic Claude Opus 4.180.3
DeepSeekDeepSeek R176

LiveCodeBench (v5 Subset Only)

Competition CodingScore, %
FortytwoFortytwo84.4
xAIxAI Grok 481.9
GeminiGoogle Gemini 2.5 Pro80.1
DeepSeekDeepSeek R177
OpenAIOpenAI ChatGPT 5 (high)66.8
ClaudeAnthropic Claude Opus 4.165.4

MATH-500

Math ProblemsScore, %
FortytwoFortytwo99.6
OpenAIOpenAI ChatGPT 5 Thinking99.4
xAIxAI Grok 499
DeepSeekDeepSeek R198.3
GeminiGoogle Gemini 2.5 Pro96.7
ClaudeAnthropic Claude Opus 4.191.9

GPQA Diamond

Hard ScienceScore, %
xAIxAI Grok 487.7
FortytwoFortytwo85.9
OpenAIOpenAI ChatGPT 5 Thinking85
KimiKimi K2 Thinking84.5
GeminiGoogle Gemini 2.5 Pro84.4
DeepSeekDeepSeek R181
ClaudeAnthropic Claude Opus 4.181

Humanity's Last Exam

frontier of human knowledgeScore, %
OpenAIGPT 5 (high)26.5
FortytwoFortytwo24.84
KimiKimi K2 Thinking23.9
xAIxAI Grok 423.9
GeminiGemini 2.5 Pro21.1
DeepSeek14.9
Claude11.9

All models tested on pass@1 using raw prompts with no tools usage. Standard testing: one try for a correct answer.
The KIMI K2 Thinking scores were added on November 14, based on results reported in the model card.

Understanding beyond reasoning

To ensure the model's accuracy under varied conditions, Fortytwo also conducted additional benchmark tests that included extraneous context alongside standard benchmark prompts. This method prevents simple recall of memorized benchmark data and checks whether the models truly understand the problem.

This type of testing is similar to methods used in university exams and olympiads and is known as 'extraneous information' problems where the conditions include additional irrelevant information that is not required for solving the task. This helps determine whether the student genuinely understands the essence of the problem or is simply applying familiar formulas mechanically.

Reasoning Resilience: Fortytwo vs. Grok 4

Without extraneous information

Answer the following question. The last line of your response should be in the following format: 'Answer: <LETTER>'.


Acetic acid is treated with <...> how many distinct hydrogen signals will be observable in the 1H NMR spectrum of 4? <...>


Z) 5

X) 10

C) 12

V) 8

Fortytwo:"Answer:V"Correct
Grok 4:"Answer:V"Correct

With extraneous information

Answer the following question. The last line of your response should be in the following format: 'Answer: <LETTER>'.


Also, some nonrelevant message: There is a cat on the roof. Maybe it is hungry?!


Acetic acid is treated with <...> how many distinct hydrogen signals will be observable in the 1H NMR spectrum of 4? <...>


Z) 5

X) 10

C) 12

V) 8

Fortytwo:"Answer: V"Correct
Grok 4:"Answer: X"Incorrect

Extraneous information (highlighted in blue) is deliberately designed to be unrelated or confusing to the model, testing whether models can maintain focus on the actual problem or become distracted by irrelevant context.
Reasoning models get distracted easily and waste compute budget overthinking irrelevant information.

Swarm Inference stays accurate in real-world scenarios

Fortytwo's Swarm Inference consistently demonstrates higher resilience to noise, prompt injections, and deliberately misleading inputs. In contrast, frontier AI models show steep declines in accuracy, often getting trapped in repetitive reasoning loops or distracted by irrelevant details.

By coordinating peer-ranked responses across diverse models, Swarm Inference maintains stable accuracy and delivers more reliable reasoning. This collaborative process allows intelligence to scale beyond the limits of individual models and opens a new path toward dependable, high-precision problem-solving.

Fortytwo leads on GPQA Diamond with extraneous information

With extraneous information
Without extraneous information
Score, %
FortytwoFortytwo85.78
85.9
xAIxAI Grok 479.5
87.7
OpenAIOpenAI ChatGPT-5 Thinking83.8
84.4
GeminiGoogle Gemini 2.5 Pro83.2
85
ClaudeAnthropic Claude Opus 4.174.45
81
DeepSeekDeepSeek R170.2
81

The model is the network

Fortytwo’s Swarm Inference shows that intelligence can emerge from a decentralized network of small, diverse models that rank, validate, and improve each other, forming intelligence greater than the sum of its parts.

Each node produces its own response

Each node produces its own response

Nodes compare and evaluate responses

Nodes compare and evaluate responses

The swarm selects the most relevant answers

The swarm selects the most relevant answers

The merged result is returned as the collective output

The merged result is returned as the collective output

Mechanisms enabling Swarm Inference

1

Multiple specialized AI nodes independently produce answers to the same query, each bringing its unique domain expertise.

2

Nodes compare answers head-to-head instead of scoring them absolutely, ensuring that the strongest reasoning consistently wins.

3

Rankings are combined using a statistical model that gives more weight to proven, high-accuracy nodes.

4

Nodes maintain reputation by demonstrating real capability, verified through peer evaluation, preventing spam or fake identities.

5

Diversity across nodes filters out noise: when one is misled, others correct it, yielding accuracy beyond any single model.

Interacting with Fortytwo

API

Public API

Blockchain

On-Chain Interface

Node

Node Participation

What's Next

Fortytwo will continue to grow the network, enabling fully open participation by node operators, custom model providers, and data scientists. The team is pushing for even greater accuracy and intelligence across the swarm, with an API release planned later this year to rival frontier AI companies in the most demanding use cases: coding, deep research, and advanced reasoning.