How crawlers read robots.txt
Well-behaved bots fetch `/robots.txt` before aggressive crawling. The file is plain text: `User-agent` groups, `Allow` / `Disallow` path rules, optional `Crawl-delay`, and `Sitemap` URLs. It lives at the site root on the same host — `https://example.com/robots.txt`, not tucked under `/blog/`.
Rules are prefix matches on paths, not regular expressions (unless you are on a stack that documents otherwise). `Disallow: /` means “do not fetch anything on this host.” That is correct for a staging mirror; it is catastrophic on production if you forgot to swap files during deploy.
Google may still index a URL that is disallowed if other sites link to it — robots blocks crawling, not necessarily listing. For sensitive content you need auth, `noindex`, or both — not robots alone.
Mistakes we keep seeing
Copy-pasting a “block everything” template from an old project is the classic. So is leaving `Disallow: /api/` on a site where your entire app routes through `/api/` because of a framework quirk. Another favorite: two conflicting `User-agent: *` blocks where the later one wins in ways you did not intend.
Trailing spaces and wrong line endings rarely matter, but typos in `User-agent` names do — `User-agent: Googlebot` only applies to that bot. A blanket `*` group is what most people want for global defaults, with specific overrides above or below depending on your generator’s ordering.
Forgetting the `Sitemap:` line does not block indexing, but it slows discovery of new URLs. After a redesign, we always regenerate sitemap and robots together so Search Console stops guessing.
Build robots.txt deliberately
Start from intent: allow all public marketing pages, disallow admin paths, staging hosts, and raw export endpoints. Write that down before touching syntax.
The Robots.txt Generator on DroidXP outputs standards-style groups with presets (allow all, disallow all, common private paths) plus custom lines. Everything runs locally — paste into your deploy artifact, diff in git, ship.
After deploy, verify with Search Console’s robots tester and a real `curl https://yoursite.com/robots.txt`. We have caught CDN caches serving an old deny-all file days after the repo was fixed.
robots.txt in a larger SEO habit
Pair robots with an XML sitemap and sensible canonical tags. Robots tells crawlers where not to spend budget; sitemaps highlight what you want discovered. They solve different problems.
When you migrate domains, update robots and sitemap on both hosts during the redirect window. Old host deny-all plus forgotten 301s is a recipe for a quiet quarter.
Treat robots.txt like firewall config: small file, high impact, deserves a checklist on every launch.