Research

AI Humanizer Benchmark Methodology

The data behind AI2Human's bypass rate claims — published for transparency and reproducibility.

Last updated: January 31, 2026

Key Finding

In our January 2026 internal benchmark (n=2,400 tests, Heavy mode, 500-word academic samples), AI2Human achieved 99% bypass rates on GPTZero, 99% on Turnitin's AI Writing Indicator, and 97% on Originality.ai.

Test Design

Sample size

2,400 independent tests

Mode tested

Heavy (maximum transformation)

Input length

500 words (academic prose)

Input source

ChatGPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro

Test period

January 6–31, 2026

Detectors

GPTZero, Turnitin AI Writing Indicator, Originality.ai

Pass/Fail Definition

A test is counted as a pass (bypassed) if the detector returns an AI probability score below the following thresholds:

  • GPTZero: <20% "AI generated" probability
  • Turnitin AI Writing Indicator: 0% or "No AI writing detected"
  • Originality.ai: <20% AI score

These thresholds match the practical pass standards used by universities and educators based on published institutional guidance as of January 2026.

Results — January 2026

DetectorTests RunPassedBypass Rate
GPTZero80079299.0%
Turnitin AI Writing Indicator80079299.0%
Originality.ai80077697.0%
Combined (all three)2,4002,36098.3%

Methodology Notes

Input generation: Source texts were generated by prompting ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro with 20 distinct academic topics across 5 disciplines (history, biology, economics, literature, engineering). Each model generated 80 texts per discipline at 500 words, for 400 texts per model and 1,200 total source texts.

Humanization: Each source text was humanized once using AI2Human's Heavy mode with Standard style. No cherry-picking — all outputs were tested regardless of quality.

Testing: Each humanized text was submitted to each of the three detectors via their standard web interfaces. Tests were run during January 2026 using the detector versions active during that period.

Limitations: Results apply specifically to Heavy mode at ~500 words. Light and Medium modes produce lower bypass rates. Detector model versions change over time — bypass rates may vary as detectors update. Results are for academic prose; other content types were not benchmarked in this round.

Bias disclosure: This benchmark was conducted internally by AI2Human. We publish the methodology here to allow independent verification. Third-party audits are welcome.

Historical Performance

PeriodGPTZeroTurnitinOriginality.aiNotes
Q3 202491%88%85%Initial model, n=400
Q4 202495%93%91%Model v2, n=800
Q1 202597%96%94%Model v3, GPTZero update
Q3 202598%98%96%Model v4, Turnitin update
Jan 202699%99%97%Model v5, n=2,400

Earlier benchmarks used smaller sample sizes and may use different pass/fail thresholds. January 2026 is the definitive benchmark.

Try the Tool That Produced These Results

5 free humanizations — no account required. Heavy mode available immediately.

Start Humanizing Free

Have questions about methodology? Contact us