# 02 · Crawlability & Indexing

> **Target:** [Unstop.com](https://unstop.com)
> **Focus:** Site architecture, sitemap, `robots.txt`, canonicals, redirects, duplicate URLs

---

## Prompt used

> *Review the site structure, sitemap, robots.txt notes, canonicals, redirects, and duplicate URLs. Tell me what may block crawling or indexing and what should be fixed first.*

---

## Observed state (public signals)

| Area | Observation |
|---|---|
| `robots.txt` | Present at `/robots.txt`; references a `sitemap.xml`. Some `Disallow:` rules cover auth + internal paths, which is appropriate. |
| Sitemap(s) | Main sitemap exists but is not index-based (no split by type — opportunities, articles, companies). |
| Canonicals | Many listing URLs self-canonicalize even when carrying filter/sort parameters, risking index bloat. |
| URL patterns | `/hackathons`, `/internships`, `/competitions`, `/jobs` exist as category hubs but share many listings. |
| Redirects | Expired opportunities stay at their original URLs with "closed" status rather than redirecting or being `noindex`-ed. |
| Pagination | Uses `?page=` params; no visible `rel="next"`/`rel="prev"` markup (now deprecated, but structured pagination links remain important). |

---

## What is likely blocking or diluting indexing

### A. Index bloat from expired opportunities
Every closed hackathon / internship remains an indexable page. This:
- Splits link equity across thousands of low-engagement URLs
- Trains Google that the site produces short-lived content
- Pushes evergreen pages (hubs, guides) deeper in the crawl tree

### B. Parameter-driven duplication
Filters (`?location=`, `?duration=`, `?category=`) generate crawlable combinations. Without explicit canonical to the clean URL + consistent parameter handling, Google wastes crawl budget.

### C. Sitemap without segmentation
A single monolithic `sitemap.xml` makes it hard to:
- Isolate priority sections (hubs, guides)
- Diagnose indexing issues per content type in Search Console
- Control submission freshness (opportunities change daily, blog rarely)

### D. Thin pages getting indexed
Category pages with fewer than ~5 live listings still rank-attempt with near-empty content, creating weak signals for category intent.

---

## Prioritized fixes

| # | Fix | Why first | Effort |
|---|---|---|---|
| 1 | **Split sitemap** into `sitemap-opportunities.xml`, `sitemap-articles.xml`, `sitemap-pages.xml` behind a sitemap index | Faster diagnosis + targeted freshness | Low |
| 2 | **Expired opportunities → `noindex, follow`** after the deadline + 30 days; keep page live for backlinks/history | Removes largest bloat source | Medium |
| 3 | **Self-canonical on clean URL only;** parameterized filter URLs canonicalize to base hub | Consolidates equity | Medium |
| 4 | **Block parameter combinations** (`?sort=`, `?view=`) in `robots.txt` for clearly non-indexable filters | Crawl budget | Low |
| 5 | **Add `lastmod`** honestly per URL (not site-wide today's date — a frequent anti-pattern) | Signals real freshness | Low |
| 6 | **404 vs 410** — permanently closed or removed opportunities should return 410, not soft-404 | Faster de-indexing | Low |
| 7 | **Internal search pages** (`/search?q=`) should be `Disallow`-ed + `noindex` | Eliminates infinite crawl space | Low |
| 8 | **Audit redirect chains** (often 3+ hops after category renames) and flatten to single 301s | Preserves link equity | Medium |

---

## Suggested `robots.txt` pattern (illustrative)

```txt
User-agent: *
Disallow: /search
Disallow: /*?sort=
Disallow: /*?view=
Disallow: /my/
Disallow: /account/
Allow: /

Sitemap: https://unstop.com/sitemap.xml
```

---

## Suggested sitemap index (illustrative)

```xml
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap><loc>https://unstop.com/sitemap-hubs.xml</loc></sitemap>
  <sitemap><loc>https://unstop.com/sitemap-opportunities.xml</loc></sitemap>
  <sitemap><loc>https://unstop.com/sitemap-articles.xml</loc></sitemap>
  <sitemap><loc>https://unstop.com/sitemap-companies.xml</loc></sitemap>
  <sitemap><loc>https://unstop.com/sitemap-colleges.xml</loc></sitemap>
</sitemapindex>
```

---

## Measurement

After rollout, track in Search Console:
- **Index coverage:** expect a *drop* in "Indexed" count (a good thing — less bloat)
- **Crawl stats:** crawl rate per type should rise on priority sitemaps
- **Valid / excluded pages:** "Crawled – currently not indexed" should shrink

---

### Further reading

- [Google — Manage your sitemap](https://developers.google.com/search/docs/crawling-indexing/sitemaps/overview)
- [Google — Canonicalization](https://developers.google.com/search/docs/crawling-indexing/canonicalization/overview)
- [Ahrefs — Crawl budget](https://ahrefs.com/blog/crawl-budget/)

Back to: [Full Site Audit](./01-full-site-audit.md) · Next: [On-page SEO →](./03-on-page-seo.md)