Search engines run automated crawlers (Googlebot, Bingbot, etc.) that fetch web pages by following links. They obey robots.txt and respect crawl-rate limits. AI engines run their own additional crawlers (GPTBot, PerplexityBot, etc.) for training and live retrieval.
Crawled pages are processed, parsed, and stored in a massive index — essentially a structured database of every page the engine knows about, with metadata about content, links, and quality signals. Indexing is not the same as ranking; being indexed is the entry ticket.
When a user submits a query, the engine retrieves matching pages from the index and ranks them using hundreds of signals — relevance, quality, authority, freshness, user behavior. The top results are returned in milliseconds.
AI engines add a fourth step: when a query triggers an AI answer, the engine retrieves multiple passages from the index, synthesizes them into a coherent answer, and presents it with citations. This is the layer Generative Engine Optimization targets.
Crawling is fetching the page. Indexing is processing and storing it. A page can be crawled but not indexed (and often is, if quality signals are weak).
Hours to weeks, depending on site authority and submission method. Submit via Search Console for fastest indexing.
Mostly they layer on top of existing indexes (Google for Gemini, Bing for ChatGPT/Copilot). Perplexity and a few others maintain their own.
Not directly. You can block training-only bots if your business model requires it.