How to Block Pages With robots.txt

How to Block Pages With robots.txt CorrectlyBlocking pages with robots.txt requires exact syntax and careful placement, but common mistakes like blocking crawl instead of indexing often backfire. This guide shows the correct directives, testing methods, and when to use alternatives like noindex meta tags instead.What robots.txt Actually ControlsA robots.txt file tells crawlers which URLs they're allowed to request. It doesn't remove pages from search results or prevent indexing. If Google already indexed a page before you added a Disallow rule, that URL can remain in the index with a brief snippet or cached copy, sometimes for months. This confuses site owners who expect immediate removal.The file must sit at the domain root — example.com/robots.txt — because crawlers look there first and ignore copies in subdirectories. Parameters and query strings in the file path don't work. If you run a subdomain like blog.example.com, that subdomain needs its own robots.txt at blog.example.com/robots.txt.robots.txt is plain text, case-sensitive on the URL paths, and publicly visible to anyone. Blocking something here doesn't hide it; it just tells compliant bots not to fetch it. Malicious scrapers ignore the file entirely.Syntax and Directive OrderEach block starts with a User-agent line naming the bot, followed by Disallow or Allow lines that specify paths. Google's crawler is Googlebot; Bingbot covers Microsoft search. An asterisk means all bots. Directives apply only to the User-agent block above them, so structure matters.Disallow: /admin/ blocks everything under that folder. Disallow: /search?query= blocks URLs starting with that parameter string. Wildcard asterisks work mid-path, like /category/*/temp/, but avoid over-complicated regex-style patterns — simpler rules are easier to debug.Allow lines create exceptions inside broader Disallow rules. If you block /reports/ but want /reports/public/ crawled, place Allow: /reports/public/ before the Disallow line in the same block. The most specific match wins. Leaving Disallow blank — just the colon — means allow everything, which is rarely useful but valid.Comment lines start with a hash. Blank lines separate blocks. Trailing slashes matter: /admin blocks only that exact file, while /admin/ blocks the directory and everything under it.Common Blocking Scenarios and Correct PatternsBlocking staging or test folders is straightforward: Disallow: /staging/ and Disallow: /dev/ prevent crawlers from wasting budget on incomplete content. For internal search results, Disallow: /search covers URLs like /search?q=keyword, keeping infinite parameter combinations out of the index.E-commerce filter URLs — /shop?color=red&size=large — create massive duplicate content. Disallow: /*? blocks all query strings sitewide, but that's too aggressive if you rely on parameters for tracking or product pages. Instead, use parameter handling in Google Search Console to tell Google which parameters to ignore, and reserve robots.txt for truly junk patterns.PDF admin docs or login portals need blocking when they contain sensitive workflows but aren't confidential enough to require authentication. Disallow: /admin-docs/ works, but pair it with password protection or IP whitelisting for real security. Blocking crawl doesn't guarantee privacy.Canadian bilingual sites sometimes serve duplicate English and French content under /en/ and /fr/ paths. Don't block one language version in robots.txt — use hreflang tags and canonical signals instead, so Google knows both are legitimate and targets different audiences.When robots.txt Is the Wrong ToolIf you need a page removed from search results, robots.txt alone won't do it. A noindex meta tag in the HTML head or an X-Robots-Tag HTTP header tells Google not to index the page, even if it's crawled. For complete removal, combine noindex with allowing the crawl so Google can read the directive.Blocking crawl with Disallow prevents Google from seeing the noindex tag, which means already-indexed pages stay indexed. This is the most common mistake. The correct sequence: add noindex to the page, allow crawling in robots.txt, wait for Google to recrawl and drop the URL, then optionally block the crawl later.Password-protected pages or content behind authentication don't need robots.txt blocking because bots can't access them anyway. Overusing Disallow here just adds clutter to the file.Some CMSs auto-generate robots.txt or handle crawl directives through plugins. Check whether editing the file manually conflicts with those systems, especially on WordPress or Shopify where plugin settings might override your changes.Testing and Deployment Without Breaking Crawl BudgetGoogle Search Console's robots.txt Tester shows whether a specific URL is blocked before you publish. Paste your draft file into the tester, enter a sample URL, and see the result instantly. This catches typos like missing slashes or incorrect wildcard placement before they block your homepage by accident.After deploying the file, monitor crawl stats in Search Console. A sudden drop in crawled pages might mean you blocked something critical. Check the Coverage report for newly excluded URLs and compare against your intended blocks. If high-value pages disappear, revert the change and retest.Changes take effect immediately for new crawls, but cached copies of the old file may persist in Google's systems for hours. If you accidentally block the whole site with Disallow: /, fix it fast and request re-indexing for key pages through the URL Inspection tool.For large sites, stage robots.txt changes on a development domain first. Test with a small set of paths, verify crawl behavior over a few days, then roll out to production. This avoids sitewide crawl disasters during high-traffic periods or product launches.Balancing Crawl Efficiency and Indexation GoalsBlocking low-value pages preserves crawl budget for content that drives traffic. Infinite calendar pages, session ID parameters, or printer-friendly versions waste bot resources without adding search visibility. Identify these through server logs or analytics, then block the patterns that consume crawl volume without conversions.Avoid blocking entire categories unless you're certain they have no search value. A /news/archive/ section might look like clutter, but older articles often rank for long-tail queries. Audit traffic and rankings before removing access.Some Canadian sites block U.S.-focused subfolders when they only serve Canadian customers, but this throws away potential referral traffic or backlinks from cross-border searches. Consider allowing crawl but using geo-targeting signals in Search Console or rel=canonical to consolidate ranking signals instead.Regularly review your robots.txt as the site evolves. New features, CMS updates, or marketing campaigns can introduce URL patterns you didn't anticipate. Quarterly audits catch accidental blocks before they compound into lost rankings or revenue.Frequently asked questionsDoes blocking a page in robots.txt remove it from Google search results?No. robots.txt only stops crawling; it doesn't remove pages already in the index. To deindex a page, add a noindex meta tag or X-Robots-Tag header and allow crawling so Google can see the directive. Disallowing crawl on an indexed page can actually prevent Google from reading the noindex tag, leaving the URL visible in search.Can I block specific search engines while allowing others?Yes. Use separate User-agent blocks for each bot. For example, User-agent: Googlebot followed by Disallow: /private/ blocks Google, while User-agent: Bingbot with Allow: / permits Bing. Bots only obey directives under their own User-agent heading, so you can set different rules per crawler.What happens if I accidentally block my whole site with Disallow: /?New crawls stop immediately, and over time your pages will drop from search results as Google respects the block. Fix the file as soon as you notice, then use Google Search Console's URL Inspection tool to request re-indexing for critical pages. Cached rankings may take days to weeks to fully recover depending on crawl frequency.How do I block URLs with parameters like query strings?Use Disallow: /*? to block all URLs containing a question mark, which covers any parameter. For more precision, target specific parameters like Disallow: /search?q= or Disallow: /*sessionid=. Be cautious with wildcard rules — test thoroughly to avoid blocking legitimate product or category pages that use parameters.Do I need a separate robots.txt for a subdomain?Yes. Each subdomain is treated as a separate site and requires its own robots.txt at the root. If you run shop.example.com, place the file at shop.example.com/robots.txt. The main domain's robots.txt at example.com/robots.txt has no effect on the subdomain.Can robots.txt protect sensitive information from appearing in search?No. It's not a security measure. Anyone can read your robots.txt file, and malicious bots ignore it. For truly sensitive content, use password protection, IP whitelisting, or authentication. robots.txt is only for managing cooperative search engine crawlers, not for hiding confidential data.Related

References

https://developers.google.com/search/docs

https://moz.com/learn/seo