Robots.txt adalah file yang memberitahu search engine crawler halaman mana yang boleh dan tidak boleh di-crawl. Penting untuk crawl budget optimization dan privacy.
Apa itu Robots.txt?
Location: yoursite.com/robots.txt
Purpose: Guide search engine crawlers
Format: Plain text file
Standard: Robots Exclusion Protocol
Syntax Dasar
# Comment line
User-agent: *
Disallow: /private/
Allow: /public/
Sitemap: https://yoursite.com/sitemap.xml
Directives
| Directive | Function |
|---|---|
| User-agent | Target crawler |
| Disallow | Block path |
| Allow | Permit path (overrides Disallow) |
| Sitemap | Sitemap location |
| Crawl-delay | Wait between requests |
Contoh Robots.txt
Basic (Allow All)
User-agent: *
Disallow:
Sitemap:
https://yoursite.com/sitemap.xml
Block Specific Folder
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /tmp/
Sitemap:
https://yoursite.com/sitemap.xml
Block All Crawlers
User-agent: *
Disallow: /
Different Rules per Bot
User-agent: Googlebot
Disallow: /nogoogle/
User-agent: Bingbot
Disallow: /nobing/
User-agent: *
Disallow: /private/
WordPress Standard
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Disallow: /trackback/
Disallow: /feed/
Disallow: /comments/
E-commerce
User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /search?
Disallow: /*?sort=
Disallow: /*?filter=
Sitemap:
https://yoursite.com/sitemap.xml
Pattern Matching
Wildcards
# Block all PDF files
User-agent: *
Disallow: /*.pdf$
Disallow: /*?
Block specific parameter
Disallow: /*?ref=
End of URL ($)
# Only block .pdf files
Disallow: /*.pdf$
This blocks /file.pdf
But allows /file.pdf/page
Common Mistakes
β Blocking CSS/JS (hurts rendering)
β Blocking images (hurts image SEO)
β Typos in syntax
β Wrong file location
β Using noindex in robots.txt (doesn't work)
β Blocking sitemap
Correct Approach
# Allow CSS and JS for rendering
User-agent: *
Allow: /wp-includes/js/
Allow: /wp-content/themes/
Allow: /wp-content/plugins/
Disallow: /wp-admin/
Testing Robots.txt
Google Search Console
- Settings > robots.txt Tester
- Enter URL to test
- Check if blocked/allowed
Screaming Frog
- Configuration > robots.txt
- Test custom robots.txt
- See blocked URLs
Manual Check
curl https://yoursite.com/robots.txt
Robots.txt vs Noindex
Robots.txt:
- Controls crawling
- Doesn't prevent indexing
- File-based
Noindex:
- Controls indexing
- Page still crawled
- Meta tag/header
Best Practice:
- Use robots.txt for crawl efficiency
- Use noindex to prevent indexing
- Don't block pages you want noindexed
Important Notes
Blocking Doesn’t Mean Private
Warning:
Robots.txt is public
Anyone can read it
Not a security measure
For sensitive content:
- Password protection
- Server-side auth
- Not just robots.txt
Blocked Pages Can Still Index
If page has backlinks:
- URL may still appear in search
- Just without description
- Shows "blocked by robots.txt"
To truly prevent indexing:
- Use noindex tag
- Don't block crawling
- Let Google see the noindex
Best Practices Checklist
β Place at root domain
β Include sitemap location
β Test before deploying
β Don't block important resources
β Use for crawl efficiency
β Regular review and update
β Don't rely for security
β Don't block then expect noindex
Kesimpulan
Robots.txt adalah tool untuk mengontrol crawling, bukan indexing atau security. Gunakan dengan bijak untuk crawl budget optimization dan guide crawler ke content yang penting.