When I first started scraping, I hit a confusing problem. My spider would visit a page, I could see the request in the logs, but my parse() method never got called. No data. No errors. Just... nothing.
After hours of debugging, I discovered the truth: the page was returning a 404. And Scrapy, by default, silently drops anything that isn’t a 200 response.
This behavior makes sense once you understand it, but nobody explains it clearly to beginners. Let me fix that right now.
The Big Secret: Scrapy Only Handles 200 Responses
Here’s what the documentation doesn’t emphasize enough:
By default, Scrapy only passes responses with status codes between 200 and 299 to your spider.
Everything else gets dropped silently:
- 301 redirects? Dropped.
- 302 redirects? Dro…
When I first started scraping, I hit a confusing problem. My spider would visit a page, I could see the request in the logs, but my parse() method never got called. No data. No errors. Just... nothing.
After hours of debugging, I discovered the truth: the page was returning a 404. And Scrapy, by default, silently drops anything that isn’t a 200 response.
This behavior makes sense once you understand it, but nobody explains it clearly to beginners. Let me fix that right now.
The Big Secret: Scrapy Only Handles 200 Responses
Here’s what the documentation doesn’t emphasize enough:
By default, Scrapy only passes responses with status codes between 200 and 299 to your spider.
Everything else gets dropped silently:
- 301 redirects? Dropped.
- 302 redirects? Dropped.
- 404 not found? Dropped.
- 403 forbidden? Dropped.
- 500 server error? Dropped.
Your parse() method never sees them. They disappear into the void.
Why Does Scrapy Do This?
Think about it. Most of the time, you only care about successful responses. If a page returns 404, there’s nothing to scrape. If it’s a 500 error, the server is broken.
Scrapy assumes you don’t want to waste time processing error pages. It’s protecting you from bad data.
But sometimes you DO want to handle these responses. Maybe you want to:
- Log which pages are missing (404s)
- Handle redirects manually (301, 302)
- Detect when you’re being blocked (403)
- Retry server errors (500, 503)
That’s where handle_httpstatus_list comes in.
Understanding HTTP Status Codes (Quick Refresh)
Before we dive in, let’s quickly review status codes:
2xx: Success
- 200: OK, everything worked
- 201: Created (used in APIs)
3xx: Redirection
- 301: Moved Permanently
- 302: Found (temporary redirect)
- 304: Not Modified (cached)
4xx: Client Error
- 400: Bad Request
- 401: Unauthorized
- 403: Forbidden
- 404: Not Found
- 429: Too Many Requests (rate limited!)
5xx: Server Error
- 500: Internal Server Error
- 502: Bad Gateway
- 503: Service Unavailable
- 504: Gateway Timeout
By default, Scrapy only processes 2xx responses.
Handling Specific Status Codes (handle_httpstatus_list)
Method 1: Spider-Level (All Requests)
Add this to your spider class:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://example.com']
# Handle 404 and 500 responses
handle_httpstatus_list = [404, 500]
def parse(self, response):
if response.status == 404:
self.logger.warning(f'Page not found: {response.url}')
# Do something with 404s
elif response.status == 500:
self.logger.error(f'Server error: {response.url}')
# Do something with 500s
else:
# Normal 200 response
yield {'url': response.url, 'title': response.css('h1::text').get()}
Now your spider receives 404 and 500 responses. You can check response.status and handle them appropriately.
Method 2: Per-Request (Specific URLs)
Sometimes you only want to handle certain codes for specific requests:
def parse(self, response):
# This request will handle 404s
yield scrapy.Request(
'https://example.com/might-not-exist',
callback=self.parse_page,
meta={'handle_httpstatus_list': [404]}
)
def parse_page(self, response):
if response.status == 404:
self.logger.info('Page doesn't exist, that's ok')
else:
# Process normal response
yield {'data': response.css('.content::text').get()}
The meta={'handle_httpstatus_list': [404]} tells Scrapy to pass 404s to the callback for just this request.
Method 3: Settings (Project-Wide)
Set it in settings.py:
# settings.py
HTTPERROR_ALLOWED_CODES = [404, 403, 500]
Now ALL spiders in your project handle these codes by default.
Handling ALL Status Codes (handle_httpstatus_all)
Sometimes you want to handle every possible status code:
def parse(self, response):
yield scrapy.Request(
'https://example.com/anything',
callback=self.parse_any,
meta={'handle_httpstatus_all': True}
)
def parse_any(self, response):
self.logger.info(f'Got status {response.status} from {response.url}')
if 200 <= response.status < 300:
# Success
yield self.parse_success(response)
elif 300 <= response.status < 400:
# Redirect
self.logger.info(f'Redirect to: {response.headers.get("Location")}')
elif 400 <= response.status < 500:
# Client error
self.logger.warning(f'Client error: {response.status}')
elif 500 <= response.status < 600:
# Server error
self.logger.error(f'Server error: {response.status}')
Warning: Use handle_httpstatus_all carefully. You’ll get EVERYTHING, including redirects that Scrapy normally handles automatically.
Real Example: Handling Missing Pages
Let’s say you’re scraping product pages, but some products have been deleted (404):
import scrapy
class ProductSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://example.com']
handle_httpstatus_list = [404]
def parse(self, response):
# Get product links
for link in response.css('.product a::attr(href)').getall():
yield response.follow(link, callback=self.parse_product)
def parse_product(self, response):
if response.status == 404:
# Product doesn't exist anymore
yield {
'url': response.url,
'status': 'deleted',
'found': False
}
else:
# Product exists, scrape it
yield {
'url': response.url,
'status': 'active',
'found': True,
'name': response.css('h1::text').get(),
'price': response.css('.price::text').get()
}
Now you can track which products have been deleted!
Real Example: Detecting Rate Limiting
Websites often return 429 (Too Many Requests) when you’re scraping too fast:
import scrapy
import time
class RateLimitSpider(scrapy.Spider):
name = 'ratelimit'
start_urls = ['https://example.com']
handle_httpstatus_list = [429]
custom_settings = {
'DOWNLOAD_DELAY': 1 # Start with 1 second delay
}
def parse(self, response):
if response.status == 429:
# We're being rate limited!
self.logger.warning('Rate limited! Slowing down...')
# Increase delay
self.crawler.engine.downloader.total_concurrency = 1
# Retry this request after waiting
retry_after = int(response.headers.get('Retry-After', 60))
self.logger.info(f'Waiting {retry_after} seconds...')
# Return the request to be retried
return scrapy.Request(
response.url,
callback=self.parse,
dont_filter=True,
priority=10 # High priority
)
# Normal processing
for product in response.css('.product'):
yield {
'name': product.css('h2::text').get(),
'price': product.css('.price::text').get()
}
Real Example: Handling Redirects
By default, Scrapy follows redirects automatically. But sometimes you want to handle them manually:
import scrapy
class RedirectSpider(scrapy.Spider):
name = 'redirects'
start_urls = ['https://example.com/old-page']
# Handle redirect status codes
handle_httpstatus_list = [301, 302]
custom_settings = {
'REDIRECT_ENABLED': False # Disable automatic redirect following
}
def parse(self, response):
if response.status in [301, 302]:
# Manual redirect handling
new_url = response.headers.get('Location').decode('utf-8')
self.logger.info(f'Redirect: {response.url} -> {new_url}')
# Track the redirect
yield {
'original_url': response.url,
'redirect_type': response.status,
'new_url': new_url
}
# Follow manually if needed
yield response.follow(new_url, callback=self.parse_final)
else:
# Normal page
yield {'url': response.url, 'title': response.css('title::text').get()}
def parse_final(self, response):
yield {
'url': response.url,
'title': response.css('title::text').get(),
'is_final': True
}
The Critical Difference: View Page Source vs Inspect Element
This is huge and trips up almost every beginner. Let me explain.
What You See in Inspect Element
When you right-click a page and choose "Inspect Element" or "Inspect," you see the DOM (Document Object Model). This is the HTML AFTER:
- JavaScript has run
- Content has loaded dynamically
- AJAX requests have completed
- React/Vue/Angular has rendered
- Infinite scroll has loaded more items
This is NOT what Scrapy sees.
What Scrapy Actually Sees
Scrapy downloads the raw HTML. It doesn’t run JavaScript. It doesn’t wait for AJAX. It sees the page BEFORE any JavaScript execution.
To see what Scrapy sees, you need to view the page source.
How to View Page Source
Method 1: Right-Click Menu
- Right-click on the page
- Choose "View Page Source" (NOT "Inspect")
- This opens a new tab with raw HTML
Method 2: Keyboard Shortcut
- Windows/Linux:
Ctrl + U - Mac:
Cmd + Option + U
Method 3: URL Bar
- Add
view-source:before the URL - Example:
view-source:https://example.com
The Problem This Solves
Here’s a real scenario:
You inspect a product page and see:
<div class="price">$29.99</div>
You write this selector:
response.css('.price::text').get()
But it returns None. Why?
You view page source and discover:
<div class="price"></div>
<script>
// Price loads via JavaScript
loadPrice();
</script>
The price isn’t in the HTML! It’s loaded by JavaScript. Scrapy can’t see it because Scrapy doesn’t run JavaScript.
Real Example: JavaScript-Loaded Content
Let’s say you’re scraping a product list. Inspect Element shows:
<div class="products">
<div class="product">Product 1</div>
<div class="product">Product 2</div>
<div class="product">Product 3</div>
</div>
But when you view page source, you see:
<div class="products">
<!-- Products loaded by JavaScript -->
</div>
<script src="loadProducts.js"></script>
Your selector won’t work! The products aren’t in the HTML Scrapy downloads.
Solutions:
- Use Scrapy-Playwright or Scrapy-Selenium (renders JavaScript)
- Find the API endpoint the JavaScript calls
- Extract data from
<script>tags if it’s embedded
Testing What Scrapy Sees
Use Scrapy shell to see exactly what Scrapy downloads:
scrapy shell "https://example.com"
Then check:
# See the raw HTML
>>> print(response.text)
# Try your selectors
>>> response.css('.price::text').get()
If your selector returns None but works in the browser, the content is JavaScript-loaded.
Complete Example: Production-Ready Response Handling
Here’s a spider that handles responses properly:
import scrapy
from scrapy.exceptions import IgnoreRequest
class RobustSpider(scrapy.Spider):
name = 'robust'
start_urls = ['https://example.com/products']
# Handle various status codes
handle_httpstatus_list = [404, 403, 429, 500, 503]
custom_settings = {
'DOWNLOAD_DELAY': 2,
'CONCURRENT_REQUESTS': 8,
'RETRY_TIMES': 3,
'RETRY_HTTP_CODES': [500, 502, 503, 504, 408]
}
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.stats = {
'success': 0,
'not_found': 0,
'forbidden': 0,
'rate_limited': 0,
'server_error': 0
}
def parse(self, response):
# Check status before processing
if response.status != 200:
return self.handle_error(response)
# Normal processing
self.stats['success'] += 1
for product in response.css('.product'):
yield response.follow(
product.css('a::attr(href)').get(),
callback=self.parse_product,
errback=self.handle_failure
)
def parse_product(self, response):
# Check status
if response.status != 200:
return self.handle_error(response)
# Scrape product
yield {
'url': response.url,
'name': response.css('h1::text').get(),
'price': response.css('.price::text').get(),
'status': 'active'
}
def handle_error(self, response):
"""Handle non-200 responses"""
if response.status == 404:
self.stats['not_found'] += 1
self.logger.warning(f'404 Not Found: {response.url}')
yield {
'url': response.url,
'status': 'deleted',
'error': '404'
}
elif response.status == 403:
self.stats['forbidden'] += 1
self.logger.error(f'403 Forbidden: {response.url}')
# Might be blocked, slow down
self.crawler.engine.pause()
import time
time.sleep(10)
self.crawler.engine.unpause()
elif response.status == 429:
self.stats['rate_limited'] += 1
self.logger.warning(f'429 Rate Limited: {response.url}')
# Re-queue with lower priority
yield scrapy.Request(
response.url,
callback=self.parse_product,
dont_filter=True,
priority=0
)
elif response.status >= 500:
self.stats['server_error'] += 1
self.logger.error(f'{response.status} Server Error: {response.url}')
# Retry middleware will handle this
def handle_failure(self, failure):
"""Handle request failures (network errors, etc.)"""
self.logger.error(f'Request failed: {failure}')
def closed(self, reason):
"""Log statistics when spider closes"""
self.logger.info('='*60)
self.logger.info('SPIDER STATISTICS')
self.logger.info(f'Success: {self.stats["success"]}')
self.logger.info(f'Not Found (404): {self.stats["not_found"]}')
self.logger.info(f'Forbidden (403): {self.stats["forbidden"]}')
self.logger.info(f'Rate Limited (429): {self.stats["rate_limited"]}')
self.logger.info(f'Server Errors (5xx): {self.stats["server_error"]}')
self.logger.info('='*60)
This spider:
- Handles multiple error codes
- Tracks statistics
- Slows down when blocked
- Re-queues rate-limited requests
- Logs everything properly
Common Mistakes
Mistake #1: Not Checking Status
# WRONG (assumes all responses are 200)
def parse(self, response):
yield {'title': response.css('h1::text').get()}
# RIGHT (checks status first)
def parse(self, response):
if response.status == 200:
yield {'title': response.css('h1::text').get()}
else:
self.logger.warning(f'Got status {response.status}')
Mistake #2: Using Inspect Element Instead of View Source
# You see this in Inspect Element
response.css('.dynamically-loaded::text').get()
# Returns None because content isn't in page source!
# Always check view-source: first
Mistake #3: Forgetting to Add Status Code to List
handle_httpstatus_list = [404]
def parse(self, response):
# This never runs for 500 errors!
if response.status == 500:
self.logger.error('Server error')
If you want to handle 500s, add them to the list!
Quick Reference
Spider-Level Handling
class MySpider(scrapy.Spider):
handle_httpstatus_list = [404, 403, 500]
Per-Request Handling
yield scrapy.Request(
url,
meta={'handle_httpstatus_list': [404]}
)
Handle All Codes
yield scrapy.Request(
url,
meta={'handle_httpstatus_all': True}
)
Settings
# settings.py
HTTPERROR_ALLOWED_CODES = [404, 403, 500]
Check Status
def parse(self, response):
if response.status == 200:
# Success
elif response.status == 404:
# Not found
elif response.status >= 500:
# Server error
View Page Source
- Right-click → "View Page Source"
Ctrl+U(Windows/Linux) orCmd+Option+U(Mac)view-source:https://example.com
Summary
Key takeaways:
- Scrapy only handles 200-299 responses by default
- Use
handle_httpstatus_listto handle specific codes - Use
handle_httpstatus_allto handle everything - Always check
response.statusbefore processing - View Page Source, not Inspect Element (critical!)
- Page source shows what Scrapy sees
- Inspect Element shows what the browser renders
Start checking status codes in your spiders. View page source before writing selectors. Your scraping life will get much easier.
Happy scraping! 🕷️