I’ll never forget the first time I tried scraping a modern website. My selectors worked perfectly in the browser’s inspector. But when I ran my spider, everything returned None.
I checked my CSS selectors ten times. I tried XPath. Still nothing. I was going crazy.
Then I viewed the page source (Ctrl+U) and realized: the content wasn’t there. The HTML was nearly empty. Everything was loaded by JavaScript after the page rendered.
That’s when I learned: Scrapy doesn’t run JavaScript. It only sees the initial HTML. Let me show you how to handle JavaScript-heavy sites.
The Problem: Scrapy Doesn’t Run JavaScript
When you visit a website in a browser:
- Browser downloads HTML
- Browser runs JavaScript
- JavaScript fetches data (AJAX)
- JavaScript builds the p…
I’ll never forget the first time I tried scraping a modern website. My selectors worked perfectly in the browser’s inspector. But when I ran my spider, everything returned None.
I checked my CSS selectors ten times. I tried XPath. Still nothing. I was going crazy.
Then I viewed the page source (Ctrl+U) and realized: the content wasn’t there. The HTML was nearly empty. Everything was loaded by JavaScript after the page rendered.
That’s when I learned: Scrapy doesn’t run JavaScript. It only sees the initial HTML. Let me show you how to handle JavaScript-heavy sites.
The Problem: Scrapy Doesn’t Run JavaScript
When you visit a website in a browser:
- Browser downloads HTML
- Browser runs JavaScript
- JavaScript fetches data (AJAX)
- JavaScript builds the page
- You see the final result
When Scrapy visits the same site:
- Scrapy downloads HTML
- Scrapy stops here
- JavaScript never runs
- Dynamic content never loads
- Your selectors find nothing
How to Detect JavaScript-Heavy Sites
Test 1: View Page Source vs Inspect Element
In your browser:
- Right-click → "Inspect Element"
- Find the element you want to scrape
- Note its HTML structure
Then:
- Press Ctrl+U (or Cmd+Option+U on Mac)
- Search for the same content
- Is it there?
If NO: Content is JavaScript-loaded.
If YES: Content is in HTML, Scrapy will work.
Test 2: Use Scrapy Shell
scrapy shell "https://example.com"
>>> response.css('.product-name::text').get()
None # Uh oh, JavaScript site!
Test 3: Disable JavaScript in Browser
- Open Chrome DevTools (F12)
- Press Ctrl+Shift+P
- Type "disable javascript"
- Select "Disable JavaScript"
- Refresh the page
If the page is now empty or broken, it’s JavaScript-heavy.
Solution 1: Find the API (Best Approach)
JavaScript-heavy sites load data from APIs. Find these APIs and scrape them directly.
How to Find APIs
Step 1: Open Network Tab
- Open DevTools (F12)
- Click "Network" tab
- Filter by "XHR" or "Fetch"
- Refresh the page
Step 2: Look for JSON Responses
Watch the network requests. Look for:
- Requests to
/api/ - Requests returning JSON
- Requests with product data
Step 3: Click on Interesting Requests
Click a request → "Preview" tab
If you see your data in JSON format, you found it!
Example: Scraping the API Directly
Let’s say you find this API:
https://example.com/api/products?page=1&limit=20
Returns:
{
"products": [
{"id": 1, "name": "Product 1", "price": 29.99},
{"id": 2, "name": "Product 2", "price": 39.99}
],
"total": 100
}
Your Spider:
import scrapy
import json
class ApiSpider(scrapy.Spider):
name = 'api'
def start_requests(self):
# Scrape the API directly, not the webpage
url = 'https://example.com/api/products?page=1&limit=20'
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
data = json.loads(response.text)
for product in data['products']:
yield {
'name': product['name'],
'price': product['price']
}
# Pagination
current_page = int(response.url.split('page=')[1].split('&')[0])
total = data['total']
items_per_page = 20
if current_page * items_per_page < total:
next_page = current_page + 1
next_url = f'https://example.com/api/products?page={next_page}&limit=20'
yield scrapy.Request(next_url, callback=self.parse)
Benefits:
- Much faster (no rendering)
- Clean JSON data (no HTML parsing)
- More reliable
- Often has more data than the webpage
What the docs don’t tell you:
- APIs often have rate limiting (go slower)
- Some APIs require authentication (check headers)
- APIs might have different pagination than the website
Solution 2: Scrapy-Playwright (Modern Browser Automation)
When you can’t find the API, use Scrapy-Playwright to render JavaScript.
Installation
pip install scrapy-playwright
playwright install
Basic Setup
settings.py:
# Enable Playwright
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
# Optional: Use Chromium, Firefox, or Webkit
PLAYWRIGHT_BROWSER_TYPE = 'chromium'
# Launch options
PLAYWRIGHT_LAUNCH_OPTIONS = {
'headless': True, # Run without visible browser
'timeout': 60000, # 60 seconds timeout
}
Basic Spider
import scrapy
class PlaywrightSpider(scrapy.Spider):
name = 'playwright'
def start_requests(self):
yield scrapy.Request(
'https://example.com',
meta={'playwright': True} # Enable Playwright for this request
)
def parse(self, response):
# JavaScript has run! Content is now in response
for product in response.css('.product'):
yield {
'name': product.css('h2::text').get(),
'price': product.css('.price::text').get()
}
Wait for Elements to Load
def start_requests(self):
yield scrapy.Request(
'https://example.com',
meta={
'playwright': True,
'playwright_page_methods': [
# Wait for specific element
{'wait_for_selector': '.product-name'},
]
}
)
Wait for Network Idle
def start_requests(self):
yield scrapy.Request(
'https://example.com',
meta={
'playwright': True,
'playwright_page_methods': [
# Wait until no network requests for 500ms
{'wait_for_load_state': 'networkidle'},
]
}
)
Scrolling (For Infinite Scroll)
def start_requests(self):
yield scrapy.Request(
'https://example.com',
meta={
'playwright': True,
'playwright_page_methods': [
# Scroll down multiple times
{'evaluate': 'window.scrollBy(0, 1000)'},
{'wait_for_timeout': 2000},
{'evaluate': 'window.scrollBy(0, 1000)'},
{'wait_for_timeout': 2000},
{'evaluate': 'window.scrollBy(0, 1000)'},
]
}
)
Click Buttons
def start_requests(self):
yield scrapy.Request(
'https://example.com',
meta={
'playwright': True,
'playwright_page_methods': [
# Click "Load More" button
{'click': 'button.load-more'},
{'wait_for_selector': '.new-products'},
]
}
)
Screenshots (Debugging)
def start_requests(self):
yield scrapy.Request(
'https://example.com',
meta={
'playwright': True,
'playwright_page_methods': [
{'screenshot': {'path': 'page.png', 'fullPage': True}},
]
}
)
Solution 3: Scrapy-Selenium (Older but Still Works)
Selenium has been around longer and has more examples online.
Installation
pip install scrapy-selenium
Download ChromeDriver from: https://chromedriver.chromium.org/
Setup
settings.py:
from shutil import which
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS = ['--headless'] # Run without visible browser
DOWNLOADER_MIDDLEWARES = {
'scrapy_selenium.SeleniumMiddleware': 800
}
Basic Spider
from scrapy_selenium import SeleniumRequest
class SeleniumSpider(scrapy.Spider):
name = 'selenium'
def start_requests(self):
yield SeleniumRequest(
url='https://example.com',
callback=self.parse
)
def parse(self, response):
# JavaScript has run!
for product in response.css('.product'):
yield {
'name': product.css('h2::text').get(),
'price': product.css('.price::text').get()
}
Wait for Elements
from scrapy_selenium import SeleniumRequest
def start_requests(self):
yield SeleniumRequest(
url='https://example.com',
callback=self.parse,
wait_time=10, # Wait up to 10 seconds
wait_until=EC.presence_of_element_located((By.CLASS_NAME, 'product'))
)
Execute JavaScript
def parse(self, response):
driver = response.meta['driver']
# Scroll down
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
# Wait a bit
import time
time.sleep(2)
# Now scrape
for product in response.css('.product'):
yield {...}
Solution 4: Splash (Lightweight Rendering)
Splash is a lightweight JavaScript rendering service.
Installation
# Run Splash with Docker
docker run -p 8050:8050 scrapinghub/splash
Setup
settings.py:
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
Basic Spider
from scrapy_splash import SplashRequest
class SplashSpider(scrapy.Spider):
name = 'splash'
def start_requests(self):
yield SplashRequest(
url='https://example.com',
callback=self.parse,
args={'wait': 2} # Wait 2 seconds
)
def parse(self, response):
for product in response.css('.product'):
yield {...}
Comparison: Which Solution to Use?
Use API Scraping When:
- You can find the API endpoints
- API has all the data you need
- You want maximum speed
- You want clean JSON data
Pros: Fast, reliable, clean data Cons: Requires finding API, might need authentication
Use Scrapy-Playwright When:
- Modern sites (2020+)
- Need full browser features
- Complex JavaScript interactions
- Want the best tool
Pros: Modern, fast, feature-rich Cons: Requires Playwright installation
Use Scrapy-Selenium When:
- Older sites
- Need Selenium-specific features
- More examples available online
- Already familiar with Selenium
Pros: Mature, lots of examples Cons: Slower, resource-heavy
Use Splash When:
- Want lightweight rendering
- Already using Scrapinghub services
- Need something between Scrapy and full browsers
Pros: Lightweight, separate service Cons: Extra infrastructure, learning curve
Real-World Example: Infinite Scroll Site
Many modern sites use infinite scroll. Here’s how to handle it:
With Playwright
import scrapy
class InfiniteScrollSpider(scrapy.Spider):
name = 'infinite'
def start_requests(self):
yield scrapy.Request(
'https://example.com/products',
meta={
'playwright': True,
'playwright_include_page': True, # Keep page object
},
callback=self.parse
)
async def parse(self, response):
page = response.meta['playwright_page']
# Scroll and wait multiple times
for i in range(10): # Scroll 10 times
# Scroll to bottom
await page.evaluate('window.scrollTo(0, document.body.scrollHeight)')
# Wait for new content
await page.wait_for_timeout(2000)
# Get the fully loaded HTML
content = await page.content()
# Close page
await page.close()
# Parse with Scrapy selectors
from scrapy.http import HtmlResponse
new_response = HtmlResponse(
url=response.url,
body=content.encode('utf-8')
)
for product in new_response.css('.product'):
yield {
'name': product.css('h2::text').get(),
'price': product.css('.price::text').get()
}
Performance Considerations
JavaScript Rendering is SLOW
Normal Scrapy:
- 10-50 pages per second
With Playwright/Selenium:
- 1-5 pages per second
Rendering JavaScript is 10-50x slower!
Optimization Strategies
1. Only render when necessary
def start_requests(self):
# Check if page needs JavaScript
if self.needs_javascript(url):
yield scrapy.Request(url, meta={'playwright': True})
else:
yield scrapy.Request(url) # Normal request
2. Use concurrent browsers
# settings.py
PLAYWRIGHT_MAX_CONTEXTS = 4 # Run 4 browsers in parallel
3. Cache rendered pages
HTTPCACHE_ENABLED = True # Don't re-render during development
4. Prefer APIs
Always try to find APIs first. They’re much faster.
Common Mistakes
Mistake #1: Not Waiting Long Enough
# BAD (content might not load)
meta={'playwright': True}
# GOOD (wait for content)
meta={
'playwright': True,
'playwright_page_methods': [
{'wait_for_selector': '.products-loaded'}
]
}
Mistake #2: Forgetting to Enable Playwright
# BAD (won't render JavaScript)
yield scrapy.Request(url)
# GOOD
yield scrapy.Request(url, meta={'playwright': True})
Mistake #3: Using Rendering for Everything
# BAD (slow for no reason)
for url in urls:
yield scrapy.Request(url, meta={'playwright': True})
# GOOD (only render when needed)
for url in urls:
if 'product' in url:
yield scrapy.Request(url, meta={'playwright': True})
else:
yield scrapy.Request(url)
Quick Decision Tree
Is content in page source (Ctrl+U)?
├─ Yes → Use normal Scrapy
└─ No → JavaScript content
│
Can you find the API?
├─ Yes → Scrape API directly (BEST)
└─ No → Need browser
│
Modern site (2020+)?
├─ Yes → Use Scrapy-Playwright
└─ No → Use Scrapy-Selenium
Summary
JavaScript content isn’t in initial HTML:
- Scrapy doesn’t run JavaScript
- Need special tools to render pages
Four solutions:
- Find API (best, fastest)
- Scrapy-Playwright (modern, recommended)
- Scrapy-Selenium (older, still works)
- Splash (lightweight alternative)
Always try APIs first:
- 10-50x faster
- Cleaner data
- More reliable
When rendering JavaScript:
- Wait for content to load
- Only render when necessary
- Use concurrent browsers
- Cache during development
Key insight:
- View page source (Ctrl+U) to check if content is there
- Inspect Element shows final result after JavaScript
- These are different!
Start by checking page source. If content is there, use normal Scrapy. If not, find the API or use Playwright.
Happy scraping! 🕷️