Research

AI Humanizer Benchmark Methodology

The data behind AI2Human's bypass rate claims — published for transparency and reproducibility.

Last updated: January 31, 2026

Key Finding

In our January 2026 internal benchmark (n=2,400 tests, Heavy mode, 500-word academic samples), AI2Human achieved 99% bypass rates on GPTZero, 99% on Turnitin's AI Writing Indicator, and 97% on Originality.ai.

Test Design

Sample size

2,400 independent tests

Mode tested

Heavy (maximum transformation)

Input length

500 words (academic prose)

Input source

ChatGPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro

Test period

January 6–31, 2026

Detectors

GPTZero, Turnitin AI Writing Indicator, Originality.ai

Pass/Fail Definition

A test is counted as a pass (bypassed) if the detector returns an AI probability score below the following thresholds:

GPTZero: <20% "AI generated" probability
Turnitin AI Writing Indicator: 0% or "No AI writing detected"
Originality.ai: <20% AI score

These thresholds match the practical pass standards used by universities and educators based on published institutional guidance as of January 2026.

Results — January 2026

Detector	Tests Run	Passed	Bypass Rate
GPTZero	800	792	99.0%
Turnitin AI Writing Indicator	800	792	99.0%
Originality.ai	800	776	97.0%
Combined (all three)	2,400	2,360	98.3%

Methodology Notes

Input generation: Source texts were generated by prompting ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro with 20 distinct academic topics across 5 disciplines (history, biology, economics, literature, engineering). Each model generated 80 texts per discipline at 500 words, for 400 texts per model and 1,200 total source texts.

Humanization: Each source text was humanized once using AI2Human's Heavy mode with Standard style. No cherry-picking — all outputs were tested regardless of quality.

Testing: Each humanized text was submitted to each of the three detectors via their standard web interfaces. Tests were run during January 2026 using the detector versions active during that period.

Limitations: Results apply specifically to Heavy mode at ~500 words. Light and Medium modes produce lower bypass rates. Detector model versions change over time — bypass rates may vary as detectors update. Results are for academic prose; other content types were not benchmarked in this round.

Bias disclosure: This benchmark was conducted internally by AI2Human. We publish the methodology here to allow independent verification. Third-party audits are welcome.

Historical Performance

Period	GPTZero	Turnitin	Originality.ai	Notes
Q3 2024	91%	88%	85%	Initial model, n=400
Q4 2024	95%	93%	91%	Model v2, n=800
Q1 2025	97%	96%	94%	Model v3, GPTZero update
Q3 2025	98%	98%	96%	Model v4, Turnitin update
Jan 2026	99%	99%	97%	Model v5, n=2,400

Earlier benchmarks used smaller sample sizes and may use different pass/fail thresholds. January 2026 is the definitive benchmark.

Try the Tool That Produced These Results

5 free humanizations — no account required. Heavy mode available immediately.

Start Humanizing Free

Have questions about methodology? Contact us