Research
AI Humanizer Benchmark Methodology
The data behind AI2Human's bypass rate claims — published for transparency and reproducibility.
Last updated: January 31, 2026
Key Finding
In our January 2026 internal benchmark (n=2,400 tests, Heavy mode, 500-word academic samples), AI2Human achieved 99% bypass rates on GPTZero, 99% on Turnitin's AI Writing Indicator, and 97% on Originality.ai.
Test Design
Sample size
2,400 independent tests
Mode tested
Heavy (maximum transformation)
Input length
500 words (academic prose)
Input source
ChatGPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro
Test period
January 6–31, 2026
Detectors
GPTZero, Turnitin AI Writing Indicator, Originality.ai
Pass/Fail Definition
A test is counted as a pass (bypassed) if the detector returns an AI probability score below the following thresholds:
- GPTZero: <20% "AI generated" probability
- Turnitin AI Writing Indicator: 0% or "No AI writing detected"
- Originality.ai: <20% AI score
These thresholds match the practical pass standards used by universities and educators based on published institutional guidance as of January 2026.
Results — January 2026
| Detector | Tests Run | Passed | Bypass Rate |
|---|---|---|---|
| GPTZero | 800 | 792 | 99.0% |
| Turnitin AI Writing Indicator | 800 | 792 | 99.0% |
| Originality.ai | 800 | 776 | 97.0% |
| Combined (all three) | 2,400 | 2,360 | 98.3% |
Methodology Notes
Input generation: Source texts were generated by prompting ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro with 20 distinct academic topics across 5 disciplines (history, biology, economics, literature, engineering). Each model generated 80 texts per discipline at 500 words, for 400 texts per model and 1,200 total source texts.
Humanization: Each source text was humanized once using AI2Human's Heavy mode with Standard style. No cherry-picking — all outputs were tested regardless of quality.
Testing: Each humanized text was submitted to each of the three detectors via their standard web interfaces. Tests were run during January 2026 using the detector versions active during that period.
Limitations: Results apply specifically to Heavy mode at ~500 words. Light and Medium modes produce lower bypass rates. Detector model versions change over time — bypass rates may vary as detectors update. Results are for academic prose; other content types were not benchmarked in this round.
Bias disclosure: This benchmark was conducted internally by AI2Human. We publish the methodology here to allow independent verification. Third-party audits are welcome.
Historical Performance
| Period | GPTZero | Turnitin | Originality.ai | Notes |
|---|---|---|---|---|
| Q3 2024 | 91% | 88% | 85% | Initial model, n=400 |
| Q4 2024 | 95% | 93% | 91% | Model v2, n=800 |
| Q1 2025 | 97% | 96% | 94% | Model v3, GPTZero update |
| Q3 2025 | 98% | 98% | 96% | Model v4, Turnitin update |
| Jan 2026 | 99% | 99% | 97% | Model v5, n=2,400 |
Earlier benchmarks used smaller sample sizes and may use different pass/fail thresholds. January 2026 is the definitive benchmark.
Try the Tool That Produced These Results
5 free humanizations — no account required. Heavy mode available immediately.
Start Humanizing FreeHave questions about methodology? Contact us
