Building an XML sitemap for a large site (10,000+ URLs) requires choosing the right generation method, managing file-size limits, prioritizing crawl budget, and maintaining automation as your inventory grows. This tutorial walks through tooling decisions, index sitemap structure, and the tradeoffs between plugin solutions and custom builds.
XML sitemaps have hard limits: 50,000 URLs per file and a maximum uncompressed size of 50 MB. A typical URL entry with lastmod, changefreq, and priority tags consumes roughly 150-250 bytes, so even well under the URL count you can hit the size ceiling if you include verbose paths or parameters. Large ecommerce catalogs, news archives, multilingual sites, and any domain with pagination or faceted navigation quickly exceed these boundaries. The solution is a sitemap index file—a parent XML document that lists multiple child sitemaps—but implementing this correctly means rethinking your generation workflow. You cannot rely on real-time, on-publish hooks when you have tens of thousands of URLs; the server overhead becomes prohibitive. Instead, treat sitemap generation as a batch job that runs on a schedule, writes static XML files to disk or object storage, and separates high-priority pages from deep archive content. This architectural shift is the core difference between small-site and large-site sitemap strategies.
WordPress plugins like Yoast SEO and RankMath can generate sitemap indexes automatically, and they handle most sites comfortably up to about 20,000 URLs. Beyond that threshold, you start seeing PHP memory exhaustion, gateway timeouts, and incomplete XML output. If your hosting environment limits execution time to 30 or 60 seconds, the plugin may silently fail partway through a large post-type query. For sites above 30,000 URLs, consider a headless approach: use a Node.js or Python script that queries your database directly, batches URLs into 10,000-entry child sitemaps, and writes them to a CDN or static file host. Alternatively, enterprise CMSs like Contentful or Sanity often provide sitemap endpoints you can extend with custom logic. The tradeoff is maintenance burden—custom scripts require version control, error handling, and monitoring—but you gain full control over query optimization, URL filtering, and priority assignment. In the Canadian context, bilingual sites often need separate fr-CA and en-CA child sitemaps to avoid mixing hreflang signals, which tilts the decision toward custom builds.
Organize child sitemaps by content type and update frequency, not arbitrary chunks. A typical large-site index might include sitemap-posts.xml, sitemap-products.xml, sitemap-categories.xml, and sitemap-pages.xml. If your product catalog alone exceeds 50,000 SKUs, split it further by category or brand: sitemap-products-electronics.xml, sitemap-products-apparel.xml. This segmentation lets you assign different changefreq values—daily for products, weekly for blog posts, monthly for static pages—and makes it easier to troubleshoot coverage issues in Search Console. Each child sitemap should list URLs in descending priority order if possible, because Googlebot may not crawl every URL in a given fetch cycle. Put your most important landing pages, highest-traffic products, or freshest content at the top of the first child file. Compress each child sitemap with gzip to reduce bandwidth and stay well under the 50 MB limit; most servers can serve .xml.gz files with the correct Content-Encoding header automatically.
Real-time sitemap updates—regenerating the entire index every time a post is published—create dangerous load spikes on large sites. Instead, schedule a cron job or task runner (GitHub Actions, AWS Lambda, Google Cloud Functions) to rebuild sitemaps once daily during off-peak hours. Query your database for all public URLs, filter out noindex pages and canonicalized duplicates, batch the results into child sitemaps, write them to disk, and ping Google's submission endpoint. If you use a CDN like Cloudflare or Fastly, upload the XML files to your origin and purge the cache so the updated versions propagate immediately. For very large inventories with frequent stock changes, consider a hybrid model: regenerate product sitemaps nightly, but leave evergreen content sitemaps on a weekly cycle. This reduces processing time and keeps your server responsive. Monitor execution logs for memory spikes, timeout errors, or incomplete writes, and set up alerting so you know immediately if a scheduled job fails.
Pagination URLs and faceted navigation paths can balloon your sitemap into hundreds of thousands of near-duplicate entries. Decide early whether to include these. If your paginated series uses rel=next/prev or view-all canonicals, exclude page-2-onward from the sitemap and only list the canonical view-all URL or the first page. For faceted filters—color, size, price range—include only the most valuable combinations (typically top-level category + one high-demand filter) and noindex the rest. Avoid listing URL parameters that don't change content meaningfully: session IDs, tracking tokens, sort orders. Use your robots.txt to disallow parameter-heavy paths if they create infinite crawl loops, and keep those URLs out of your sitemap entirely. This discipline prevents your sitemap from misleading Googlebot into wasting crawl budget on low-value pages and keeps file sizes manageable.
Before submitting your sitemap index to Google Search Console, validate the XML syntax with an online checker or a local linter to catch malformed tags or encoding issues. Submit the index file URL—usually yoursite.com/sitemap_index.xml—in the Sitemaps report under each relevant property (HTTP vs HTTPS, www vs non-www). Google will fetch the index, discover the child sitemaps, and begin crawling the listed URLs. Check back after a few days to review the coverage report: it will show discovered, crawled, and indexed counts, plus any errors like 404s or noindex conflicts. Set up a monthly review cadence to compare submitted versus indexed URL counts and investigate discrepancies. If you add new content types or launch a subdirectory, append a new child sitemap to your index and resubmit. Large sites evolve constantly, so treat your sitemap as a living document that reflects your current URL inventory and priority structure, not a one-time build artifact.
The official limit is 50,000 URLs or 50 MB uncompressed per file. In practice, once you approach 30,000-40,000 URLs, server memory and execution-time constraints often force you to split into multiple child sitemaps managed by an index file, even if you are technically under the cap.
Be selective. Exclude low-value pages like paginated archives beyond page one, noindexed utility pages, and parameter-heavy faceted URLs. Focus on canonical, indexable URLs that you actively want Googlebot to discover and rank. Quality over quantity improves crawl budget efficiency.
A sitemap index is a parent XML file that lists the locations of multiple child sitemap files. Each child sitemap contains up to 50,000 URLs. The index itself does not list individual URLs—it only points to other sitemaps. You submit the index to Search Console, and Googlebot fetches the children automatically.
Daily regeneration works well for sites with frequent inventory changes or fresh content. Run the job during off-peak hours to avoid server load. For mostly static catalogs, weekly regeneration is sufficient. Avoid regenerating on every publish event; batch updates are more efficient at scale.
Yoast and RankMath start struggling above 20,000-30,000 URLs due to PHP memory limits and execution timeouts. For sites beyond that size, a custom script that queries the database directly and writes static XML files is more reliable and gives you finer control over batching and priority.
Yes, separate child sitemaps for en-CA and fr-CA content make it easier to manage hreflang tags and troubleshoot indexing issues in Search Console. List each language sitemap in your index file, and ensure every URL includes the correct hreflang annotations pointing to its translated counterparts.