动态数据抓取实战指南
理解动态数据加载机制
现代网站广泛使用JavaScript和AJAX技术动态加载内容,传统爬虫无法直接获取这些数据。动态内容通常通过XHR(XMLHttpRequest)或Fetch API发起异步请求,从服务器获取数据后渲染到页面。理解这种机制是抓取的前提。
分析网络请求
使用浏览器开发者工具(F12)的Network面板监控XHR/Fetch请求。筛选AJAX请求,查看请求URL、参数、响应格式(JSON/HTML)。重点关注Headers中的Request Method(GET/POST)、Query String Parameters和Request Headers。
直接调用API接口
若网站采用前后端分离架构,可直接模拟AJAX请求调用数据接口。使用Python的requests库复制原始请求的Headers和参数:
import requests
headers = {
'User-Agent': 'Mozilla/5.0',
'X-Requested-With': 'XMLHttpRequest'
}
params = {'page': 1, 'size': 20}
response = requests.get('https://api.example.com/data', headers=headers, params=params)
print(response.json())
使用无头浏览器技术
对于复杂渲染场景,可采用Selenium或Playwright等工具控制浏览器:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
opts = Options()
opts.add_argument('--headless')
driver = webdriver.Chrome(options=opts)
driver.get('https://example.com')
ajax_content = driver.execute_script("return fetch('/api/data').then(r => r.json())")
print(ajax_content)
处理反爬机制
动态网站常设置反爬措施:
- 添加随机延迟避免高频请求
- 维护Cookies和Session
- 处理验证码(需第三方服务)
- 轮换User-Agent和代理IP
import time
import random
time.sleep(random.uniform(1, 3))
proxies = {'http': 'http://proxy.example.com:8080'}
requests.get(url, proxies=proxies)
数据解析与存储
根据响应格式选择解析方式:
- JSON数据直接使用
json.loads() - HTML片段用BeautifulSoup解析
- 二进制数据(如图片)需特殊处理
from bs4 import BeautifulSoup
soup = BeautifulSoup(ajax_html, 'html.parser')
items = soup.select('.data-item')
维护爬虫可持续性
定期检查API变更,建立异常处理机制。使用日志记录失败请求,实现自动重试:
import logging
from tenacity import retry, stop_after_attempt
@retry(stop=stop_after_attempt(3))
def safe_request(url):
try:
return requests.get(url, timeout=10)
except Exception as e:
logging.error(f"Request failed: {e}")
raise
BbS.okacop020.info/PoSt/1120_404912.HtM
BbS.okacop021.info/PoSt/1120_862029.HtM
BbS.okacop022.info/PoSt/1120_377454.HtM
BbS.okacop023.info/PoSt/1120_486219.HtM
BbS.okacop024.info/PoSt/1120_043193.HtM
BbS.okacop025.info/PoSt/1120_373253.HtM
BbS.okacop026.info/PoSt/1120_937612.HtM
BbS.okacop027.info/PoSt/1120_762448.HtM
BbS.okacop028.info/PoSt/1120_910066.HtM
BbS.okacop029.info/PoSt/1120_854564.HtM
BbS.okacop020.info/PoSt/1120_445771.HtM
BbS.okacop021.info/PoSt/1120_616950.HtM
BbS.okacop022.info/PoSt/1120_038820.HtM
BbS.okacop023.info/PoSt/1120_940306.HtM
BbS.okacop024.info/PoSt/1120_448004.HtM
BbS.okacop025.info/PoSt/1120_085694.HtM
BbS.okacop026.info/PoSt/1120_988104.HtM
BbS.okacop027.info/PoSt/1120_109281.HtM
BbS.okacop028.info/PoSt/1120_343034.HtM
BbS.okacop029.info/PoSt/1120_374019.HtM
BbS.okacop020.info/PoSt/1120_814032.HtM
BbS.okacop021.info/PoSt/1120_349542.HtM
BbS.okacop022.info/PoSt/1120_742812.HtM
BbS.okacop023.info/PoSt/1120_467247.HtM
BbS.okacop024.info/PoSt/1120_758031.HtM
BbS.okacop025.info/PoSt/1120_261904.HtM
BbS.okacop026.info/PoSt/1120_495086.HtM
BbS.okacop027.info/PoSt/1120_802708.HtM
BbS.okacop028.info/PoSt/1120_842511.HtM
BbS.okacop029.info/PoSt/1120_197575.HtM
BbS.okacop030.info/PoSt/1120_521326.HtM
BbS.okacop031.info/PoSt/1120_598854.HtM
BbS.okacop032.info/PoSt/1120_024806.HtM
BbS.okacop033.info/PoSt/1120_784451.HtM
BbS.okacop034.info/PoSt/1120_047981.HtM
BbS.okacop035.info/PoSt/1120_618335.HtM
BbS.okacop036.info/PoSt/1120_873111.HtM
BbS.okacop037.info/PoSt/1120_935130.HtM
BbS.okacop038.info/PoSt/1120_534200.HtM
BbS.okacop039.info/PoSt/1120_690854.HtM
BbS.okacop030.info/PoSt/1120_966839.HtM
BbS.okacop031.info/PoSt/1120_344594.HtM
BbS.okacop032.info/PoSt/1120_295764.HtM
BbS.okacop033.info/PoSt/1120_067440.HtM
BbS.okacop034.info/PoSt/1120_695594.HtM
BbS.okacop035.info/PoSt/1120_308082.HtM
BbS.okacop036.info/PoSt/1120_327931.HtM
BbS.okacop037.info/PoSt/1120_203592.HtM
BbS.okacop038.info/PoSt/1120_523149.HtM
BbS.okacop039.info/PoSt/1120_635997.HtM
BbS.okacop030.info/PoSt/1120_363114.HtM
BbS.okacop031.info/PoSt/1120_730888.HtM
BbS.okacop032.info/PoSt/1120_222906.HtM
BbS.okacop033.info/PoSt/1120_986216.HtM
BbS.okacop034.info/PoSt/1120_090333.HtM
BbS.okacop035.info/PoSt/1120_125360.HtM
BbS.okacop036.info/PoSt/1120_645260.HtM
BbS.okacop037.info/PoSt/1120_048667.HtM
BbS.okacop038.info/PoSt/1120_397503.HtM
BbS.okacop039.info/PoSt/1120_058081.HtM
BbS.okacop030.info/PoSt/1120_647398.HtM
BbS.okacop031.info/PoSt/1120_142719.HtM
BbS.okacop032.info/PoSt/1120_312562.HtM
BbS.okacop033.info/PoSt/1120_856636.HtM
BbS.okacop034.info/PoSt/1120_347931.HtM
BbS.okacop035.info/PoSt/1120_083782.HtM
BbS.okacop036.info/PoSt/1120_987995.HtM
BbS.okacop037.info/PoSt/1120_011697.HtM
BbS.okacop038.info/PoSt/1120_221548.HtM
BbS.okacop039.info/PoSt/1120_917674.HtM
BbS.okacop030.info/PoSt/1120_186263.HtM
BbS.okacop031.info/PoSt/1120_144644.HtM
BbS.okacop032.info/PoSt/1120_570594.HtM
BbS.okacop033.info/PoSt/1120_746440.HtM
BbS.okacop034.info/PoSt/1120_039171.HtM
BbS.okacop035.info/PoSt/1120_630461.HtM
BbS.okacop036.info/PoSt/1120_532593.HtM
BbS.okacop037.info/PoSt/1120_717205.HtM
BbS.okacop038.info/PoSt/1120_451221.HtM
BbS.okacop039.info/PoSt/1120_535848.HtM
