guide
Web Scraping Not Working? Complete Troubleshooting Guide
Web scraping failures in 2026 almost always trace back to one of four root causes: IP blocking, anti-bot detection, site structure changes, or misconfigured proxy infrastructure. This guide walks through every major failure mode, how to diagnose it, and exactly how to fix it.
Web Scraping Not Working? Complete Troubleshooting Guide
Published:* March 2026 *Author: IPCloud Team
---
Your scraper was running fine yesterday. Now it's returning empty responses, throwing 403s, getting hit with CAPTCHAs, or just hanging indefinitely. You've checked the code three times and nothing looks wrong.
The problem usually isn't your code.
Web scraping failures in 2026 almost always trace back to one of four root causes: IP blocking, anti-bot detection, site structure changes, or misconfigured proxy infrastructure. This guide walks through every major failure mode, how to diagnose it, and exactly how to fix it — with specific solutions for both proxy-based and non-proxy issues.
---
Table of Contents
1. Quick Diagnostic Checklist 2. HTTP Error Code Breakdown 3. IP Blocking & Banning 4. CAPTCHA and Bot Detection 5. JavaScript-Rendered Content 6. Rate Limiting 7. Session & Cookie Issues 8. Proxy Configuration Problems 9. Site Structure Changes 10. Headers and Fingerprinting 11. Geo-Blocking 12. Residential vs Datacenter: When Your Proxy Type Is the Problem 13. Tools to Diagnose Scraping Issues 14. Prevention: Building Scrapers That Don't Break 15. FAQ
---
Quick Diagnostic Checklist {#diagnostic}
Before diving deep, run through this fast triage list:
- [ ] Does the target URL load in a browser? If not, the site may be down — not your scraper. - [ ] Does the request succeed without a proxy? Isolates proxy vs. scraper issues instantly. - [ ] What HTTP status code are you getting? See the breakdown below. - [ ] Has the page's HTML structure changed? Your selectors may be broken by a site update. - [ ] Are you sending the correct headers? Missing `User-Agent` is one of the most common causes of instant blocks. - [ ] Is the content JavaScript-rendered? Static scrapers return empty `` on SPAs and dynamic pages. - ] Is your IP blacklisted? Test at [whatismyipaddress.com and check against known blacklist databases. - [ ] Are your proxies rotating correctly? A misconfigured rotation endpoint hammers the same IP repeatedly. - [ ] Is your request rate too aggressive? Even legitimate traffic gets rate-limited past certain thresholds.
If you can locate your failure in this list, jump directly to that section below.
---
HTTP Error Code Breakdown {#error-codes}
The status code your scraper receives is the fastest way to identify what's going wrong.
403 Forbidden
The server received your request and explicitly refused it. This is the most common scraping block. Causes:- Your IP is flagged or blacklisted - Missing or incorrect request headers - User-Agent identified as a bot - Referrer check failing - Geographic restriction enforced
Fix: Rotate your IP, add proper headers, check User-Agent strings, and verify geo-targeting if the site is region-specific.
429 Too Many Requests
You've exceeded the server's rate limit. The response usually includes a `Retry-After` header telling you when to try again.Fix: Add delays between requests, implement exponential backoff, reduce concurrency, and rotate IPs to distribute request load across multiple exit nodes.
404 Not Found
Either the URL has changed, the page was removed, or the site is serving different URLs to different user agents or IP ranges.Fix: Verify the URL manually in a browser. If the structure changed, update your URL patterns. If you're being served different content than a real browser, your request headers need work.
503 Service Unavailable / 502 Bad Gateway
The server is overloaded or your request was dropped by a reverse proxy or load balancer. This can also be a soft block — some anti-bot systems return 503 instead of 403.Fix: Check if the site is experiencing genuine downtime. If only your scraper gets 503 while browsers work fine, it's a bot detection measure — rotate IP and improve headers.
407 Proxy Authentication Required
Your proxy credentials are wrong or your proxy provider's authentication is failing.Fix: Verify your proxy username, password, and endpoint format. Check that your IP whitelist (if using IP auth) includes your current outbound IP.
Empty Response / Connection Timeout
The server is silently dropping your connection without responding. Common on well-protected targets.Fix: Rotate IP immediately, reduce request frequency, and switch to residential proxies if using datacenter.
---
IP Blocking & Banning {#ip-blocking}
IP blocking is the most common cause of scraping failures. When a target site's bot detection flags your IP, you'll typically see 403 responses, empty bodies, or redirects to block pages — often within minutes of starting a new session.
Why IPs Get Blocked
Volume: Too many requests from a single IP in a short window. Even 10 requests per minute can trigger blocks on well-protected sites.
ASN reputation: Datacenter IPs from known cloud providers (AWS, DigitalOcean, Hetzner) are pre-flagged on many targets. The site doesn't even need to analyze your behavior — it blocks the entire ASN range.
Behavioral patterns: Scraping produces inhuman request patterns — perfectly consistent timing, no mouse events, no CSS/font resource loading, identical viewport dimensions. Modern detection correlates these signals.
Shared IP history: If you're on shared datacenter proxies, other users may have already burned those IPs on the same target. You inherit the block.
How to Fix IP Blocks
Rotate your exit IPs. Use rotating proxies so each request (or session) comes from a different IP. IPCloud's rotating residential proxies give you access to 10M+ IPs with automatic rotation per request or on a configurable interval.
Switch to residential proxies. If your datacenter IPs are being blocked, the ASN itself may be flagged. Residential IPs from real household connections bypass this entirely — they carry real ISP signals that anti-bot systems treat as legitimate user traffic.
Implement backoff on block signals. When your scraper detects a 403 or block response, immediately flag that IP and pull a fresh one. Don't retry with the same IP.
Slow down. Counter-intuitive when you want speed, but reducing requests per IP per minute dramatically lowers your block rate. Pair this with higher concurrency across more IPs for net throughput.
Use sticky sessions wisely. Don't stick to the same IP longer than necessary. For stateless scraping, rotate every request. For session-based workflows, rotate between logical sessions, not within them.
---
CAPTCHA and Bot Detection {#captcha}
Modern CAPTCHA and bot detection has evolved well beyond the "identify traffic lights" phase. In 2026, the major bot detection platforms — Cloudflare, DataDome, Akamai Bot Manager, PerimeterX — use behavioral fingerprinting that runs before you ever see a CAPTCHA challenge.
Types of Bot Detection You'll Hit
IP reputation scoring: Cloudflare and similar systems score incoming IPs based on historical behavior, ASN, and shared reputation databases. Datacenter IPs score poorly by default.
TLS fingerprinting: Your scraper's TLS handshake (the cryptographic negotiation before any HTTP data is sent) has a unique fingerprint based on cipher suites, extensions, and ordering. Most HTTP libraries have distinct fingerprints that differ from browsers.
HTTP/2 fingerprinting: Similar to TLS — the way your client implements HTTP/2 framing, stream prioritization, and header compression has a detectable signature.
Browser fingerprinting (canvas, WebGL, font metrics): When a site can execute JavaScript, it can collect dozens of device signals — canvas rendering, WebGL renderer, installed fonts, screen dimensions, battery status. These are checked against expected browser profiles.
Behavioral analysis: Real users scroll, hover, move mouse, and interact with pages in irregular patterns. Scrapers request URLs directly with no interaction signals, at machine-consistent timing, which looks anomalous.
Solutions by Detection Type
For IP reputation blocks: Residential proxies are the primary fix. Real household IPs carry legitimate ISP reputation that bypasses pre-emptive IP scoring.
For TLS fingerprinting: Use a scraping library that mimics browser TLS fingerprints. In Python, `curl_cffi` with Chrome impersonation is the current standard. In Node.js, Playwright or Puppeteer handle this automatically. Avoid bare `requests` or `axios` on fingerprinted targets.
For browser fingerprinting: Use a headless browser (Playwright, Puppeteer) with stealth plugins that patch the common detection vectors — `navigator.webdriver`, canvas noise, consistent viewport sizes.
For behavioral detection: Add randomized delays between requests (not fixed intervals), simulate mouse movements in headless browser contexts, and vary request patterns. Tools like Playwright's human emulation plugins or Undetected ChromeDriver help here.
For CAPTCHA walls you can't avoid: Third-party CAPTCHA solving services (2captcha, CapMonster, NopeCHA) can integrate into your pipeline. For large-scale operations, this adds cost — another reason to prevent blocks before they trigger CAPTCHAs.
---
JavaScript-Rendered Content {#javascript}
One of the most common "my scraper broke" moments: you fetch a URL, parse the response, and your target data just isn't there. The HTML looks right in DevTools but your scraper gets an almost-empty body.
The page is rendering its content via JavaScript after the initial load. Your static scraper is getting the pre-render skeleton, not the populated page.
How to Identify JS-Rendered Content
1. Open the page in Chrome, right-click → View Page Source (not Inspect — View Source) 2. If the content you're trying to scrape isn't in the raw HTML, it's JS-rendered 3. Alternatively, use `curl` to fetch the URL and check if the data is present
Fixes for JS-Rendered Pages
Use a headless browser. Playwright (Python or Node.js) and Puppeteer (Node.js) launch a real Chromium instance, execute JavaScript, and give you the fully-rendered DOM. This handles virtually all JS rendering cases.
```python
Playwright example - Python
from playwright.sync_api import sync_playwrightwith sync_playwright() as p: browser = p.chromium.launch() page = browser.new_page() page.goto("https://target-site.com/products") page.wait_for_selector(".product-card") # Wait for content to load content = page.content() browser.close() ```
Wait for the right element. JS-rendered content loads asynchronously. Use `wait_for_selector()` or `wait_for_load_state("networkidle")` rather than a fixed sleep — this is both faster and more reliable.
Check for API calls instead. Open DevTools → Network tab → XHR/Fetch. Many sites load their data via JSON APIs that your scraper can hit directly — much faster and simpler than rendering the full page. Look for calls returning JSON data after the page loads.
Intercept network requests. Playwright lets you intercept and capture API responses that the page makes internally:
```python def handle_response(response): if "api/products" in response.url: print(response.json())
page.on("response", handle_response) page.goto("https://target-site.com/products") ```
---
Rate Limiting {#rate-limiting}
Rate limiting and IP blocking are often confused, but they behave differently. Rate limiting is typically a per-IP threshold that resets after a time window. IP blocking is a persistent ban that doesn't reset automatically.
Identifying Rate Limiting
- 429 status code with a `Retry-After` header - Responses slow down progressively before failing - Requests succeed for a while, then fail in bursts, then succeed again - The issue resolves if you wait a few minutes (unlike a hard IP ban)
Solutions
Distribute requests across more IPs. With IPCloud's rotating residential proxies, each request can come from a different IP, making per-IP rate limits irrelevant at the infrastructure level.
Add jitter to your request timing. Replace fixed delays with randomized ones. Instead of `sleep(1)`, use `sleep(random.uniform(0.8, 3.5))`. Fixed intervals are a behavioral red flag.
Implement exponential backoff. When you hit a rate limit, don't hammer the endpoint with retries. Back off: wait 2s, then 4s, then 8s. Most rate limits reset within 60 seconds.
```python import time import random
def fetch_with_backoff(url, max_retries=5): for attempt in range(max_retries): response = requests.get(url, proxies=proxies) if response.status_code == 429: wait = (2 attempt) + random.uniform(0, 1) time.sleep(wait) continue return response return None ```
Respect `robots.txt` crawl delay directives. Some sites specify acceptable crawl rates in their robots.txt. Respecting these (when appropriate for your use case) prevents automatic rate limiting from being applied.
---
Session & Cookie Issues {#session}
Some sites require active sessions to serve content. If your scraper doesn't handle cookies correctly, you'll get login redirects, empty content, or inconsistent responses.
Common Session-Related Failures
Login-gated content without proper cookie handling. If you've authenticated in a previous request but your scraper doesn't persist cookies, subsequent requests will land on the login page.
Session tokens that expire mid-run. Long scraping jobs that don't refresh authentication tokens will fail after the session expires.
CSRF tokens on form submissions. If you're scraping content that requires form submission, many sites embed per-session CSRF tokens. Your scraper needs to parse and replay these.
Fixes
Use a session object. In Python's `requests` library, use a `Session` object rather than individual `requests.get()` calls. Sessions automatically persist cookies across requests:
```python import requests
session = requests.Session() session.headers.update({"User-Agent": "Mozilla/5.0 ..."})
Login
session.post("https://site.com/login", data={"user": "...", "pass": "..."})All subsequent requests carry the session cookies
response = session.get("https://site.com/protected-page") ```In Playwright, persist browser context. Save and reload browser storage state to avoid re-authenticating:
```python
Save state after login
context.storage_state(path="auth.json")Reuse state in future runs
context = browser.new_context(storage_state="auth.json") ```Don't rotate IPs mid-session. If a site ties session validity to the originating IP, rotating IPs mid-session will invalidate your cookies. Use sticky sessions from IPCloud for session-sensitive workflows.
---
Proxy Configuration Problems {#proxy-config}
Proxy misconfiguration is responsible for a surprising number of "my scraper is broken" reports. The scraper code is fine — the proxy setup is wrong.
Most Common Proxy Config Issues
Wrong proxy format. The proxy string format must match exactly what your library expects. The standard format is:
``` http://username:password@proxy-host:port ```
A missing `http://` prefix, incorrect port, or malformed credentials will silently fail in some libraries.
HTTP proxy used for HTTPS requests. If your target is HTTPS, you need to configure the `https` proxy key, not just `http`:
```python proxies = { "http": "http://user:pass@host:port", "https": "http://user:pass@host:port" # ← Don't forget this } response = requests.get("https://target.com", proxies=proxies) ```
IP whitelist not updated. If you're using IP-based authentication (rather than username/password), your proxy provider whitelist must include your current outbound IP. If your own IP changed (common on residential ISPs), authentication fails silently.
Rotation endpoint not actually rotating. Some proxy setups use a gateway that requires specific session parameters to trigger rotation. If you're hitting the same IP on every request, check whether your endpoint includes the rotation flag (e.g., `session=random` or similar, depending on your provider).
SSL certificate verification conflicts. When routing through a proxy, SSL certificate chains can behave differently. Don't disable SSL verification as a fix — instead, configure your proxy correctly or add the provider's certificate to your trust store.
Testing your proxy configuration:
```python import requests
proxies = { "http": "http://user:pass@proxy.ipcloud.io:port", "https": "http://user:pass@proxy.ipcloud.io:port" }
This endpoint returns your exit IP
r = requests.get("https://api.ipify.org?format=json", proxies=proxies) print(r.json()) # Should show proxy exit IP, not your real IP ```Run this before your main scrape to confirm the proxy is working and routing through the expected IP.
---
Site Structure Changes {#site-changes}
If your scraper was working perfectly and then broke with no changes on your end, the most likely culprit is a site update. Modern web applications deploy continuously, and a single CSS class rename or DOM restructure will break every CSS selector in your scraper.
Diagnosing Structure Changes
1. Open the target URL in a browser 2. Inspect the element you're trying to scrape 3. Compare the current selector/class to what your code expects 4. Check if the data has moved to a different location in the DOM
Building More Resilient Scrapers
Target semantic attributes over styling classes. Classes like `product-card__title--v2` change frequently. Prefer `data-*` attributes, `id` attributes, ARIA labels, and structural patterns (first `
` in a ``) that are less likely to change with design updates.
Use multiple fallback selectors. If your primary selector fails, try alternates before throwing an error:
```python
def get_price(soup):
selectors = [
"span.price-current",
"[data-testid='product-price']",
".product__price span"
]
for selector in selectors:
element = soup.select_one(selector)
if element:
return element.get_text(strip=True)
return None
```
Monitor for structure changes. Run a lightweight check that alerts you when the expected selector returns `None` rather than silently collecting empty data for hours.
Check the site's API first. Before scraping HTML, check whether the site exposes a public API or loads its data from an internal JSON endpoint. API responses are far more stable than HTML structure.
---
Headers and Fingerprinting {#headers}
Sending requests with missing or incorrect HTTP headers is the fastest way to get blocked. Default headers from `requests`, `axios`, and similar libraries are well-known to bot detection systems and often trigger immediate blocks.
Critical Headers to Set
User-Agent: The most checked header. Use a realistic, current browser User-Agent:
```
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36
```
Don't use outdated UA strings. Chrome's major version updates frequently — an old UA is a detection signal.
Accept-Language: Real browsers send this. `en-US,en;q=0.9` is a safe default for English-language targets.
Accept-Encoding: `gzip, deflate, br` — match what Chrome sends.
Referer: For multi-step scraping, include a realistic referrer. If you're scraping a product page, the referer might be the category page. A direct request to a deep URL with no referrer looks robotic.
Accept: `text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,/;q=0.8`
A minimal working header set:
```python
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,/;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
}
```
For high-protection targets, use `curl_cffi` in Python to impersonate a real browser's TLS fingerprint — header matching alone isn't enough if TLS is fingerprinted.
---
Geo-Blocking {#geo-blocking}
Some sites serve different content, redirect to local domains, or block access entirely based on the geographic location of the requesting IP. This isn't always obvious — you might receive a response, but it's the wrong regional version of the page, or content is missing that only appears in specific markets.
Diagnosing Geo-Blocks
- The page loads but redirects to a different URL (e.g., `site.com` → `site.com/en-gb/`)
- Content you expect to find is absent (pricing, products, features)
- You receive a "not available in your region" message
- A VPN or manual test from the target country succeeds
Fix: Geo-Targeted Proxies
IPCloud's residential proxies support country, state, city, and ISP-level targeting across 195+ countries. Set your exit country to match your target's expected audience:
```python
IPCloud geo-targeted endpoint format
proxy = "http://user-country-US:password@residential.ipcloud.io:port"
City-level targeting
proxy = "http://user-country-US-city-NewYork:password@residential.ipcloud.io:port"
```
For price comparison across regions, run parallel scrapers with proxies set to each target country — you'll get the locally-served pricing rather than a redirected or blocked response.
---
Residential vs Datacenter: When Your Proxy Type Is the Problem {#proxy-type}
A significant proportion of scraping failures aren't code problems at all — they're proxy type mismatches. If you're using datacenter proxies on a target that actively blocks them, no amount of header tweaking or timing adjustment will fix the underlying issue.
Signs Your Proxy Type Is Wrong
- You get blocked within the first 1–3 requests on every new IP
- The site shows Cloudflare challenge pages or DataDome bot screens immediately
- You can access the site fine in a browser but never through your proxy
- Switching to a fresh datacenter IP only delays the block by seconds
All of these point to ASN-level blocking — the site is refusing your entire IP range, not just individual addresses.
The Fix
Switch to residential proxies. Residential IPs from real household connections carry ISP reputation signals that bypass ASN-level blocks. IPCloud's residential network provides IPs from legitimate consumer ISPs in 195+ countries — these IPs are indistinguishable from real user traffic at the network level.
For the hardest targets (luxury retail, ticket platforms, financial data), mobile proxies (real 4G/5G device IPs) provide the highest trust level available.
---
Tools to Diagnose Scraping Issues {#tools}
For IP and proxy testing:
- `https://api.ipify.org?format=json` — confirms your current exit IP
- `https://ipinfo.io/json` — shows IP, org, city, and ASN
- MXToolbox Blacklist Check — checks if your IP is on major blacklists
- IPQualityScore — proxy/VPN/bot score for any IP
For request inspection:
- `https://httpbin.org/get` — echoes back your headers, IP, and request details
- Chrome DevTools → Network tab — see exactly what a real browser sends
- Wireshark — deep packet inspection if you suspect TLS issues
For scraper testing:
- `curl -v` — verbose raw HTTP request from the command line, great for isolating header issues
- Postman or Insomnia — GUI HTTP client for testing requests interactively
- reqbin.com — online request tester with header control
For monitoring and alerting:
- Sentry or custom logging — catch when selectors return `None` unexpectedly
- Uptime monitoring on your scraping jobs — get alerted to failure immediately rather than hours later
---
Prevention: Building Scrapers That Don't Break {#prevention}
Fixing scraping failures reactively is expensive. The better approach is building scrapers that are resilient from the start.
Use a proxy pool from day one. Even if your current target is permissive, building your scraper around a proxy provider like IPCloud means you can instantly upgrade proxy type and scale when you hit protected targets.
Implement automatic retry with IP rotation. Every scraper should have a retry handler that detects block signals (403, 429, empty body on non-empty pages) and automatically rotates to a fresh IP before retrying.
Log everything. Track status codes, response sizes, and selector success rates per run. Anomaly detection on these metrics will surface problems before they silently corrupt your dataset.
Use separate proxy sessions per target domain. If you're scraping multiple sites from the same proxy pool, configure domain-specific session handling. A burned IP on one target shouldn't affect operations on another.
Version your scrapers. When a target site changes its structure, you need to know which version of your scraper was in production at the time. Git tagging your scraper releases makes diagnosis much faster.
Test in production conditions. Always validate your scraper with your actual proxy infrastructure before scaling. A scraper that works on your local IP may behave differently when routing through a proxy due to header differences, SSL behavior, or geographic targeting.
---
FAQ {#faq}
Why does my scraper work in a browser but not in code?
Browsers send dozens of headers, maintain cookies, execute JavaScript, and have a consistent TLS fingerprint that anti-bot systems recognize as legitimate. Your script sends only the headers you've explicitly set, has no JS execution, and uses a different TLS stack. Use `httpbin.org/get` to compare what your script sends vs. what a browser sends and bridge the gap.
Why do I keep getting CAPTCHAs even with residential proxies?
Residential IPs reduce the probability of CAPTCHA challenges dramatically, but they don't eliminate them. If you're still hitting CAPTCHAs with residential proxies, the issue is likely behavioral — your request patterns are too robotic (fixed timing, no interaction signals, no resource loading). Switch to a headless browser with human emulation or integrate a CAPTCHA solving service.
How many concurrent threads can I run before getting blocked?
This varies completely by target. Some sites tolerate hundreds of concurrent requests from different IPs; others block after 5 requests per minute per IP. Start conservatively at 1–2 requests per IP per minute and scale up while monitoring block rates. With IPCloud's rotating residential pool, you can maintain high throughput while keeping per-IP request rates low.
My proxies were working fine yesterday and now nothing works. What happened?
Most likely causes: the site deployed a bot detection update, your IP pool was burned by concurrent users on shared proxies (switch to dedicated or residential), or the site's HTML structure changed. Run the diagnostic checklist at the top of this guide to isolate which category your failure falls into.
Can I use the same proxy pool for scraping multiple different sites?
Yes, but configure domain-level session management. Use IPCloud's rotating proxies to ensure that IPs burned on one target don't poison requests to other targets. For high-volume multi-target operations, consider separate proxy allocations per domain.
Is web scraping legal?
Scraping publicly accessible data is generally legal in most jurisdictions based on established case law (hiQ v. LinkedIn, Ryanair v. PR Aviation). However, legality depends on data type, jurisdiction, and how data is used. Always comply with a site's terms of service and avoid scraping personal data without a legal basis. IPCloud's infrastructure is designed for legitimate data collection use cases.
What's the most reliable proxy type for high-protection e-commerce sites?
Rotating residential proxies with sticky sessions for multi-step flows. For the absolute hardest targets (sneaker drops, ticket releases), mobile proxies provide the highest trust level as they route through real 4G/5G device IPs. IPCloud offers both options — start with 100MB free to test on your target before committing.
---
The Fix Is Usually Simpler Than You Think
Most web scraping failures come down to three things: your IP is blocked, your headers look robotic, or the site changed its structure. Work through the checklist at the top of this guide and you'll isolate the cause in minutes rather than hours.
For the majority of persistent blocking issues, the fix is switching proxy type or improving your rotation strategy. IPCloud's residential proxies eliminate the ASN-level blocking that takes down most datacenter-based scrapers, and the 2-minute dashboard setup means you can be running on fresh residential IPs before your next test run.
Get 100MB of Residential Proxies Free — No Credit Card
---
IPCloud provides enterprise-grade proxy infrastructure for web scraping, data extraction, ad verification, and automation. 10M+ residential IPs Python, JavaScript & Node.js API.
Use multiple fallback selectors. If your primary selector fails, try alternates before throwing an error:
```python def get_price(soup): selectors = [ "span.price-current", "[data-testid='product-price']", ".product__price span" ] for selector in selectors: element = soup.select_one(selector) if element: return element.get_text(strip=True) return None ```
Monitor for structure changes. Run a lightweight check that alerts you when the expected selector returns `None` rather than silently collecting empty data for hours.
Check the site's API first. Before scraping HTML, check whether the site exposes a public API or loads its data from an internal JSON endpoint. API responses are far more stable than HTML structure.
---
Headers and Fingerprinting {#headers}
Sending requests with missing or incorrect HTTP headers is the fastest way to get blocked. Default headers from `requests`, `axios`, and similar libraries are well-known to bot detection systems and often trigger immediate blocks.
Critical Headers to Set
User-Agent: The most checked header. Use a realistic, current browser User-Agent:
``` Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 ```
Don't use outdated UA strings. Chrome's major version updates frequently — an old UA is a detection signal.
Accept-Language: Real browsers send this. `en-US,en;q=0.9` is a safe default for English-language targets.
Accept-Encoding: `gzip, deflate, br` — match what Chrome sends.
Referer: For multi-step scraping, include a realistic referrer. If you're scraping a product page, the referer might be the category page. A direct request to a deep URL with no referrer looks robotic.
Accept: `text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,/;q=0.8`
A minimal working header set:
```python headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,/;q=0.8", "Accept-Language": "en-US,en;q=0.9", "Accept-Encoding": "gzip, deflate, br", "Connection": "keep-alive", "Upgrade-Insecure-Requests": "1", } ```
For high-protection targets, use `curl_cffi` in Python to impersonate a real browser's TLS fingerprint — header matching alone isn't enough if TLS is fingerprinted.
---
Geo-Blocking {#geo-blocking}
Some sites serve different content, redirect to local domains, or block access entirely based on the geographic location of the requesting IP. This isn't always obvious — you might receive a response, but it's the wrong regional version of the page, or content is missing that only appears in specific markets.
Diagnosing Geo-Blocks
- The page loads but redirects to a different URL (e.g., `site.com` → `site.com/en-gb/`) - Content you expect to find is absent (pricing, products, features) - You receive a "not available in your region" message - A VPN or manual test from the target country succeeds
Fix: Geo-Targeted Proxies
IPCloud's residential proxies support country, state, city, and ISP-level targeting across 195+ countries. Set your exit country to match your target's expected audience:
```python
IPCloud geo-targeted endpoint format
proxy = "http://user-country-US:password@residential.ipcloud.io:port"City-level targeting
proxy = "http://user-country-US-city-NewYork:password@residential.ipcloud.io:port" ```For price comparison across regions, run parallel scrapers with proxies set to each target country — you'll get the locally-served pricing rather than a redirected or blocked response.
---
Residential vs Datacenter: When Your Proxy Type Is the Problem {#proxy-type}
A significant proportion of scraping failures aren't code problems at all — they're proxy type mismatches. If you're using datacenter proxies on a target that actively blocks them, no amount of header tweaking or timing adjustment will fix the underlying issue.
Signs Your Proxy Type Is Wrong
- You get blocked within the first 1–3 requests on every new IP - The site shows Cloudflare challenge pages or DataDome bot screens immediately - You can access the site fine in a browser but never through your proxy - Switching to a fresh datacenter IP only delays the block by seconds
All of these point to ASN-level blocking — the site is refusing your entire IP range, not just individual addresses.
The Fix
Switch to residential proxies. Residential IPs from real household connections carry ISP reputation signals that bypass ASN-level blocks. IPCloud's residential network provides IPs from legitimate consumer ISPs in 195+ countries — these IPs are indistinguishable from real user traffic at the network level.
For the hardest targets (luxury retail, ticket platforms, financial data), mobile proxies (real 4G/5G device IPs) provide the highest trust level available.
---
Tools to Diagnose Scraping Issues {#tools}
For IP and proxy testing: - `https://api.ipify.org?format=json` — confirms your current exit IP - `https://ipinfo.io/json` — shows IP, org, city, and ASN - MXToolbox Blacklist Check — checks if your IP is on major blacklists - IPQualityScore — proxy/VPN/bot score for any IP
For request inspection: - `https://httpbin.org/get` — echoes back your headers, IP, and request details - Chrome DevTools → Network tab — see exactly what a real browser sends - Wireshark — deep packet inspection if you suspect TLS issues
For scraper testing: - `curl -v` — verbose raw HTTP request from the command line, great for isolating header issues - Postman or Insomnia — GUI HTTP client for testing requests interactively - reqbin.com — online request tester with header control
For monitoring and alerting: - Sentry or custom logging — catch when selectors return `None` unexpectedly - Uptime monitoring on your scraping jobs — get alerted to failure immediately rather than hours later
---
Prevention: Building Scrapers That Don't Break {#prevention}
Fixing scraping failures reactively is expensive. The better approach is building scrapers that are resilient from the start.
Use a proxy pool from day one. Even if your current target is permissive, building your scraper around a proxy provider like IPCloud means you can instantly upgrade proxy type and scale when you hit protected targets.
Implement automatic retry with IP rotation. Every scraper should have a retry handler that detects block signals (403, 429, empty body on non-empty pages) and automatically rotates to a fresh IP before retrying.
Log everything. Track status codes, response sizes, and selector success rates per run. Anomaly detection on these metrics will surface problems before they silently corrupt your dataset.
Use separate proxy sessions per target domain. If you're scraping multiple sites from the same proxy pool, configure domain-specific session handling. A burned IP on one target shouldn't affect operations on another.
Version your scrapers. When a target site changes its structure, you need to know which version of your scraper was in production at the time. Git tagging your scraper releases makes diagnosis much faster.
Test in production conditions. Always validate your scraper with your actual proxy infrastructure before scaling. A scraper that works on your local IP may behave differently when routing through a proxy due to header differences, SSL behavior, or geographic targeting.
---
FAQ {#faq}
Why does my scraper work in a browser but not in code?
Browsers send dozens of headers, maintain cookies, execute JavaScript, and have a consistent TLS fingerprint that anti-bot systems recognize as legitimate. Your script sends only the headers you've explicitly set, has no JS execution, and uses a different TLS stack. Use `httpbin.org/get` to compare what your script sends vs. what a browser sends and bridge the gap.
Why do I keep getting CAPTCHAs even with residential proxies?
Residential IPs reduce the probability of CAPTCHA challenges dramatically, but they don't eliminate them. If you're still hitting CAPTCHAs with residential proxies, the issue is likely behavioral — your request patterns are too robotic (fixed timing, no interaction signals, no resource loading). Switch to a headless browser with human emulation or integrate a CAPTCHA solving service.
How many concurrent threads can I run before getting blocked?
This varies completely by target. Some sites tolerate hundreds of concurrent requests from different IPs; others block after 5 requests per minute per IP. Start conservatively at 1–2 requests per IP per minute and scale up while monitoring block rates. With IPCloud's rotating residential pool, you can maintain high throughput while keeping per-IP request rates low.
My proxies were working fine yesterday and now nothing works. What happened?
Most likely causes: the site deployed a bot detection update, your IP pool was burned by concurrent users on shared proxies (switch to dedicated or residential), or the site's HTML structure changed. Run the diagnostic checklist at the top of this guide to isolate which category your failure falls into.
Can I use the same proxy pool for scraping multiple different sites?
Yes, but configure domain-level session management. Use IPCloud's rotating proxies to ensure that IPs burned on one target don't poison requests to other targets. For high-volume multi-target operations, consider separate proxy allocations per domain.
Is web scraping legal?
Scraping publicly accessible data is generally legal in most jurisdictions based on established case law (hiQ v. LinkedIn, Ryanair v. PR Aviation). However, legality depends on data type, jurisdiction, and how data is used. Always comply with a site's terms of service and avoid scraping personal data without a legal basis. IPCloud's infrastructure is designed for legitimate data collection use cases.
What's the most reliable proxy type for high-protection e-commerce sites?
Rotating residential proxies with sticky sessions for multi-step flows. For the absolute hardest targets (sneaker drops, ticket releases), mobile proxies provide the highest trust level as they route through real 4G/5G device IPs. IPCloud offers both options — start with 100MB free to test on your target before committing.
---
The Fix Is Usually Simpler Than You Think
Most web scraping failures come down to three things: your IP is blocked, your headers look robotic, or the site changed its structure. Work through the checklist at the top of this guide and you'll isolate the cause in minutes rather than hours.
For the majority of persistent blocking issues, the fix is switching proxy type or improving your rotation strategy. IPCloud's residential proxies eliminate the ASN-level blocking that takes down most datacenter-based scrapers, and the 2-minute dashboard setup means you can be running on fresh residential IPs before your next test run.
Get 100MB of Residential Proxies Free — No Credit Card
---
IPCloud provides enterprise-grade proxy infrastructure for web scraping, data extraction, ad verification, and automation. 10M+ residential IPs Python, JavaScript & Node.js API.