The numbers in the headline are real — drawn from an audit we ran on a client site. They had 83 pages indexed by Google. When we looked at what those pages actually were, 53 of them were tags, categories, author archives, old draft previews, parameter URLs, and thin content pages that had accumulated over years of WordPress use.

Google was dutifully crawling and indexing all 83 pages. But it was spreading its attention — and its assessment of the site's overall quality — across pages that offered nothing. The 30 pages that actually mattered were being diluted by the 53 that didn't.

This is one of the most common and most underestimated problems in technical SEO. And it's almost entirely fixable.

What crawl budget actually means

Google doesn't have unlimited time to spend on your website. It allocates a crawl budget to each site — a rough limit on how many pages it will crawl in a given period — based on your domain's authority and the perceived quality of your site.

If you have 83 pages indexed and 53 of them are low-quality, Google is spending a large portion of its crawl budget on pages that return a poor signal. Over time, this depresses Google's overall assessment of your site's quality — and can actively suppress the pages you actually want to rank.

The core problem

Every low-quality page Google indexes on your site costs you twice — once in wasted crawl budget, and again in the diluted quality signal it sends about your domain as a whole.

The typical breakdown of a bloated site

Typical page breakdown — 83-page WordPress site
Tag pages
18 pages
Category archives
12 pages
Thin / old content
15 pages
Author archives
8 pages
Core pages (useful)
30 pages

The six types of pages that cause the most damage

Should be noindexed
Tag & category archives

WordPress creates a new indexed page for every tag and category you use. A site with 50 tags has 50 thin archive pages that typically contain nothing Google hasn't already seen on your actual posts. Almost always worth noindexing.

Should be noindexed
Author archive pages

If your site has one author, your author archive page is a duplicate of your blog index. If it has multiple authors, the pages are typically thin. WordPress generates these automatically — they need to be turned off in your SEO plugin.

Should be noindexed or removed
Paginated archives

Page 2, page 3, page 4 of your blog listing. These are mostly duplicate content — the same posts just at a different URL. Google doesn't need to index every page of your pagination. Usually safe to noindex everything past page 1.

Needs review
Old thin content

Posts from five or eight years ago that are 200 words long and no longer accurate. These actively drag down your content quality signal. The decision is: update and expand them to something genuinely useful, or noindex and eventually delete them.

Should be excluded
Parameter URLs

URLs with query strings — ?sort=price, ?ref=newsletter, ?session=xyz. These are often duplicate pages with different URLs. If Google is indexing these, you have a canonical tag problem. Fix the canonicals and disallow parameter URLs in Search Console.

Needs review
Search result pages

Some sites accidentally allow their internal search result pages to be indexed. These are completely dynamic, have no consistent content, and are worthless to Google. Check for /search?q= or /results/ patterns in your Search Console coverage report.

How to find your own problem pages

Step 1 — Google Search Console coverage report

Go to Search Console and open the Pages report (previously called Coverage). Look at the "Indexed" list and scan for page types that shouldn't be there — tags, categories, author pages, paginated archives, parameter URLs. This gives you the full picture of what Google has actually indexed.

Step 2 — site: search

Search Google for site:yourdomain.com.au and look at the results. Are tag pages appearing? Author pages? Old content you've forgotten about? This is a quick manual way to see what Google considers indexable on your site.

Step 3 — crawl your site

Tools like Screaming Frog (free up to 500 URLs) will crawl your site and show you every URL it finds, including ones that may not appear in Search Console. This often surfaces hidden parameter URLs, old redirects, and legacy pages you didn't know were still accessible.

How to fix it — the three tools

Noindex meta tag: Adding a noindex tag to a page tells Google to stop indexing it. This is the gentlest option — the page still exists and is accessible to users, but Google removes it from the index over time. In RankMath or Yoast, this is a one-click toggle on any page or post type.

Robots.txt disallow: Tells Google not to even crawl certain URL patterns. Use this for parameter URLs and other patterns rather than individual pages. Important: disallowing in robots.txt doesn't automatically remove already-indexed pages — combine it with a noindex tag for pages already in the index.

301 redirect or delete: For content that genuinely has no value and you don't want to keep, delete the page and redirect the URL to the most relevant page on your site. Don't just delete without redirecting — that creates 404 errors which waste crawl budget too.

Real result

On the client site mentioned at the start of this article, after noindexing 53 low-quality pages and cleaning up the crawl architecture, core page rankings improved measurably within 8 weeks. Google was now spending its crawl budget on 30 genuinely useful pages — and its quality assessment of the domain improved accordingly.

What this has to do with AI visibility

The same quality signal that affects Google rankings also affects AI citation. If your site is bloated with low-quality, duplicated, or thin content, AI systems building a model of your domain will pick up on that. A leaner, better-structured site with clear topical focus is more likely to be seen as an authoritative source worth citing.

Indexing hygiene isn't just an SEO technical fix — it's foundational to the kind of domain quality that both Google and AI systems reward.

Want us to audit your site?

Book an online session and we'll walk through your Search Console data, identify the pages dragging down your domain quality, and give you a clear action list.

Book Online →
← Google AI Overview Next: Domain Trust →