The numbers in the headline are real — drawn from an audit we ran on a client site. They had 83 pages indexed by Google. When we looked at what those pages actually were, 53 of them were tags, categories, author archives, old draft previews, parameter URLs, and thin content pages that had accumulated over years of WordPress use.
Google was dutifully crawling and indexing all 83 pages. But it was spreading its attention — and its assessment of the site's overall quality — across pages that offered nothing. The 30 pages that actually mattered were being diluted by the 53 that didn't.
This is one of the most common and most underestimated problems in technical SEO. And it's almost entirely fixable.
What crawl budget actually means
Google doesn't have unlimited time to spend on your website. It allocates a crawl budget to each site — a rough limit on how many pages it will crawl in a given period — based on your domain's authority and the perceived quality of your site.
If you have 83 pages indexed and 53 of them are low-quality, Google is spending a large portion of its crawl budget on pages that return a poor signal. Over time, this depresses Google's overall assessment of your site's quality — and can actively suppress the pages you actually want to rank.
Every low-quality page Google indexes on your site costs you twice — once in wasted crawl budget, and again in the diluted quality signal it sends about your domain as a whole.
The typical breakdown of a bloated site
The six types of pages that cause the most damage
WordPress creates a new indexed page for every tag and category you use. A site with 50 tags has 50 thin archive pages that typically contain nothing Google hasn't already seen on your actual posts. Almost always worth noindexing.
If your site has one author, your author archive page is a duplicate of your blog index. If it has multiple authors, the pages are typically thin. WordPress generates these automatically — they need to be turned off in your SEO plugin.
Page 2, page 3, page 4 of your blog listing. These are mostly duplicate content — the same posts just at a different URL. Google doesn't need to index every page of your pagination. Usually safe to noindex everything past page 1.
Posts from five or eight years ago that are 200 words long and no longer accurate. These actively drag down your content quality signal. The decision is: update and expand them to something genuinely useful, or noindex and eventually delete them.
URLs with query strings — ?sort=price, ?ref=newsletter, ?session=xyz. These are often duplicate pages with different URLs. If Google is indexing these, you have a canonical tag problem. Fix the canonicals and disallow parameter URLs in Search Console.
Some sites accidentally allow their internal search result pages to be indexed. These are completely dynamic, have no consistent content, and are worthless to Google. Check for /search?q= or /results/ patterns in your Search Console coverage report.
How to find your own problem pages
Step 1 — Google Search Console coverage report
Go to Search Console and open the Pages report (previously called Coverage). Look at the "Indexed" list and scan for page types that shouldn't be there — tags, categories, author pages, paginated archives, parameter URLs. This gives you the full picture of what Google has actually indexed.
Step 2 — site: search
Search Google for site:yourdomain.com.au and look at the results. Are tag pages appearing? Author pages? Old content you've forgotten about? This is a quick manual way to see what Google considers indexable on your site.
Step 3 — crawl your site
Tools like Screaming Frog (free up to 500 URLs) will crawl your site and show you every URL it finds, including ones that may not appear in Search Console. This often surfaces hidden parameter URLs, old redirects, and legacy pages you didn't know were still accessible.
How to fix it — the three tools
Noindex meta tag: Adding a noindex tag to a page tells Google to stop indexing it. This is the gentlest option — the page still exists and is accessible to users, but Google removes it from the index over time. In RankMath or Yoast, this is a one-click toggle on any page or post type.
Robots.txt disallow: Tells Google not to even crawl certain URL patterns. Use this for parameter URLs and other patterns rather than individual pages. Important: disallowing in robots.txt doesn't automatically remove already-indexed pages — combine it with a noindex tag for pages already in the index.
301 redirect or delete: For content that genuinely has no value and you don't want to keep, delete the page and redirect the URL to the most relevant page on your site. Don't just delete without redirecting — that creates 404 errors which waste crawl budget too.
On the client site mentioned at the start of this article, after noindexing 53 low-quality pages and cleaning up the crawl architecture, core page rankings improved measurably within 8 weeks. Google was now spending its crawl budget on 30 genuinely useful pages — and its quality assessment of the domain improved accordingly.
What this has to do with AI visibility
The same quality signal that affects Google rankings also affects AI citation. If your site is bloated with low-quality, duplicated, or thin content, AI systems building a model of your domain will pick up on that. A leaner, better-structured site with clear topical focus is more likely to be seen as an authoritative source worth citing.
Indexing hygiene isn't just an SEO technical fix — it's foundational to the kind of domain quality that both Google and AI systems reward.
Book an online session and we'll walk through your Search Console data, identify the pages dragging down your domain quality, and give you a clear action list.