XML sitemaps seem straightforward until configuration errors tank your indexing. This breakdown covers the structural, scope, and protocol-level mistakes that cause crawl waste, missed pages, and indexing delays—with specific fixes for each.
The most frequent XML sitemap mistake is listing URLs you've explicitly told search engines not to index. This includes pages with noindex meta tags, canonical tags pointing elsewhere, or 301/302 redirects. When Google fetches your sitemap and encounters a redirect, it has to follow the chain—burning crawl budget before reaching the actual destination. If the destination isn't in the sitemap, you've created confusion about which version matters.
Canonical mismatches are especially common in e-commerce and bilingual Canadian sites. A Toronto retailer might generate separate URLs for /en/product and /fr/product, then canonicalize both to /en/product, but include all variants in the sitemap. Google sees mixed signals: the sitemap says both matter, the canonical says otherwise. The fix is simple but requires discipline—audit your sitemap against your canonical declarations and meta robots tags. Only the canonical, indexable version belongs in the XML file. For redirects, remove the old URL entirely and list only the final destination.
Faceted navigation generates combinatorial URL explosion. A Vancouver furniture site with filters for color, material, price, and room type can produce thousands of URLs from a few dozen actual products. Most of those filtered views deliver thin, duplicate content that shouldn't be indexed—yet they often auto-populate into XML sitemaps because the CMS treats every routable URL as sitemap-worthy.
The core issue is failing to distinguish between crawlable-for-discovery and worthy-of-indexing. You might want Googlebot to crawl filtered pages to find product links, but you don't want those filter combinations competing in search results. The solution involves two layers: use robots meta noindex on filter pages, and exclude them from your sitemap. For platforms like Shopify or WordPress with automatic sitemap generation, you'll need a plugin or custom script to enforce exclusion rules. On larger sites, implement a whitelist approach—only add URLs that match approved patterns, rather than blacklisting the infinite tail of parameter combinations.
The lastmod tag tells search engines when a page meaningfully changed, helping them prioritize recrawls. Many sites either omit it entirely, update it on every build regardless of content changes, or set it to a CMS save-date that reflects trivial edits. All three patterns degrade the signal's value.
When lastmod updates for every deploy—common with static site generators or headless CMS builds—Google learns the timestamp is noise and starts ignoring it. Conversely, if you never update lastmod even when rewriting a page, search engines have no hint to recrawl. The correct approach: update lastmod only when content, schema markup, or on-page elements actually change. If you're running a news site or blog, timestamp accuracy matters for freshness ranking. If you're a Montreal B2B agency with mostly static service pages, lastmod changes should be rare and deliberate. Some CMSs tie lastmod to the last editorial save; override this with logic that excludes minor tweaks like fixing a typo or adjusting CSS classes. For pages that never change, either omit lastmod or set it once and freeze it.
Sites with more than fifty thousand URLs require a sitemap index—an XML file that points to multiple child sitemaps. Common errors include breaking the fifty-thousand-URL-per-file limit, referencing child sitemaps that return 404s or authentication walls, and failing to declare the index itself in robots.txt.
Another pitfall: creating a sitemap index but also submitting individual child sitemaps directly in Search Console. This creates ambiguity—Google may crawl both, or prioritize one and ignore updates to the other. The clean structure is robots.txt points to the index, the index points to children, and you submit only the index URL in Search Console. Each child sitemap should be logically segmented—by content type, language, or update frequency—not arbitrarily chunked by URL count. A bilingual Canadian site might split /en and /fr into separate child sitemaps, making it easier to track indexing per language. Ensure every child sitemap URL is publicly accessible and returns a proper XML content-type header. If you're using gzip compression for child sitemaps, verify the index references the .gz URLs and that your server sends the correct encoding headers.
Blocking your sitemap file itself via robots.txt is surprisingly common, especially after site migrations or staging-to-production deploys where overly restrictive rules carry over. If robots.txt disallows /sitemap.xml, search engines can't fetch it—period. Similarly, placing your sitemap behind authentication or a paywall makes it invisible to crawlers.
Another variant: the sitemap is accessible, but it lists URLs that are disallowed in robots.txt. Google will fetch the sitemap, see the URLs, then refuse to crawl them because of the disallow rule. This wastes the sitemap's purpose entirely. The fix requires cross-checking your robots.txt disallow directives against every URL in your sitemap. For Canadian bilingual sites, ensure both /en and /fr sitemaps are allowed, and that language-specific subdirectories aren't accidentally blocked. After any robots.txt change, use Search Console's robots.txt tester and resubmit your sitemap to confirm Googlebot can access it. If you're using IP whitelisting or bot-detection services, make sure Googlebot's user-agent and IP ranges are exempted—blocking verified Googlebot is an own-goal that kills indexing.
The priority and changefreq tags are widely misunderstood and largely ignored by Google, yet many SEOs still agonize over them. Priority is relative within your own site—setting every page to 1.0 renders the signal meaningless. Changefreq is a hint, not a directive, and Google has publicly stated it mostly disregards it in favor of actual observed change patterns.
Despite their limited impact, misusing these tags can signal poor sitemap hygiene. If your homepage is set to priority 0.3 and a random archive page to 1.0, it suggests either misconfiguration or lack of editorial oversight. The pragmatic approach: set priority based on genuine site hierarchy—homepage and key landing pages at 0.8-1.0, supporting pages at 0.5-0.7, ancillary content lower—and leave changefreq out entirely unless you have a legitimate daily or weekly update cycle, like a news site or event calendar. For most Canadian business sites, omitting changefreq is cleaner than guessing. Focus your energy on accurate lastmod timestamps and correct URL inclusion instead of tweaking priority decimals that move no needles.
Standard XML sitemaps list URLs, but Google supports extended schemas for images, videos, and news. Many sites with rich media never implement these extensions, leaving discoverability on the table. An Ottawa real estate site with property galleries should use image sitemap tags to declare each photo's URL, caption, and license. A video production company in Toronto should mark up video duration, thumbnail, and upload date.
The mistake isn't just omission—it's also sloppy implementation. Common errors include pointing to thumbnail URLs instead of full-resolution images, listing videos that autoplay or require Flash, and failing to update video sitemaps when content is removed. For image sitemaps, each image must be crawlable—if it's blocked by robots.txt or embedded via JavaScript without fallback, the sitemap entry is wasted. For video, the content URL must be the actual playable file or an embed URL that Googlebot can parse. If you're using a CDN, ensure the media URLs in the sitemap resolve correctly and aren't geofenced. Bilingual Canadian sites should also consider whether to duplicate media entries across language-specific sitemaps or centralize them—there's no single right answer, but inconsistency across /en and /fr sitemaps will cause indexing asymmetry.
No. Including noindex URLs creates a conflict—you're telling Google the page matters enough to list in the sitemap, but not to index. This wastes crawl budget and confuses priority signals. Only include URLs you want indexed. If a page needs to be crawlable for link equity but not indexed, omit it from the sitemap and rely on internal links for discovery instead.
Only when the page content, structured data, or key on-page elements actually change. Updating lastmod on every build or deploy trains Google to ignore the signal. For static service pages, lastmod might change once a year or less. For blogs or news content, it should reflect genuine editorial updates. Trivial changes like CSS tweaks or analytics code updates don't warrant a lastmod refresh.
No. Google has confirmed it largely ignores changefreq and treats priority as a relative internal hint, not a ranking factor. Setting every page to priority 1.0 is meaningless. These tags won't hurt you if used sensibly, but obsessing over them is wasted effort. Focus on accurate URL inclusion, proper canonicals, and correct lastmod values instead.
Yes, and it's recommended for large or segmented sites. Use a sitemap index to reference all child sitemaps, then submit only the index URL in Search Console. Submitting both the index and individual children creates ambiguity. Segmenting by content type, language, or update frequency makes indexing trends easier to track and isolate issues faster.
Google will attempt to crawl them, discover the redirect or error, and either follow the chain or mark the URL as problematic. This wastes crawl budget and dilutes the sitemap's credibility. Over time, repeated errors can cause Google to trust your sitemap less. Clean out dead URLs and redirects immediately, and automate validation to catch new errors before they accumulate.
Separate child sitemaps per language, referenced by a single sitemap index, is cleaner. This lets you track indexing per language in Search Console and makes it easier to isolate issues in /en versus /fr. Ensure each child sitemap includes only URLs for that language, and that hreflang tags are correctly declared on the pages themselves. A single mixed sitemap works functionally but obscures language-level indexing data.