Python爬取豆瓣TOP250电影全攻略
Python 爬取豆瓣TOP250电影数据的技术实现
爬取豆瓣TOP250电影数据是学习Python网络爬虫的经典案例。通过分析网页结构、发送HTTP请求、解析HTML内容,可以获取电影名称、评分、导演、主演等信息。
环境准备 需要安装requests库发送HTTP请求,BeautifulSoup或lxml库解析HTML内容。安装命令如下:
pip install requests beautifulsoup4
分析网页结构
打开豆瓣TOP250页面(https://movie.douban.com/top250),检查网页源代码。每部电影信息包含在<div class="item">标签中,电影名称在<span class="title">,评分在<span class="rating_num">。
发送HTTP请求 使用requests库发送GET请求获取网页内容。需要设置User-Agent模拟浏览器访问,避免被反爬虫机制拦截。
import requests
from bs4 import BeautifulSoup
url = 'https://movie.douban.com/top250'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers)
解析HTML内容 使用BeautifulSoup解析返回的HTML内容,提取所需数据。
soup = BeautifulSoup(response.text, 'html.parser')
movies = soup.find_all('div', class_='item')
for movie in movies:
title = movie.find('span', class_='title').text
rating = movie.find('span', class_='rating_num').text
print(f'电影名称:{title},评分:{rating}')
处理分页数据 豆瓣TOP250分为多页显示,需要循环处理每页数据。观察URL规律,发现分页通过start参数控制。
for start in range(0, 250, 25):
url = f'https://movie.douban.com/top250?start={start}'
response = requests.get(url, headers=headers)
# 解析逻辑同上
数据存储 将爬取的数据保存到CSV文件或数据库。使用csv模块实现简单存储。
import csv
with open('douban_top250.csv', 'w', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerow(['电影名称', '评分', '导演', '主演'])
for movie in movies:
# 提取数据并写入
反爬虫策略应对 豆瓣有反爬虫机制,需要控制请求频率,添加随机延迟。使用time.sleep实现简单延迟。
import time
import random
time.sleep(random.uniform(1, 3)) # 随机延迟1-3秒
完整代码示例
import requests
from bs4 import BeautifulSoup
import csv
import time
import random
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
with open('douban_top250.csv', 'w', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerow(['排名', '电影名称', '评分', '导演', '主演', '年份', '地区', '类型'])
for start in range(0, 250, 25):
url = f'https://movie.douban.com/top250?start={start}'
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
movies = soup.find_all('div', class_='item')
for movie in movies:
rank = movie.find('em').text
title = movie.find('span', class_='title').text
rating = movie.find('span', class_='rating_num').text
info = movie.find('div', class_='bd').p.get_text(strip=True).split('\n')[0]
writer.writerow([rank, title, rating, *info.split('/')])
time.sleep(random.uniform(1, 3))
注意事项
- 尊重网站robots.txt协议,控制爬取频率
- 避免对服务器造成过大压力
- 爬取数据仅用于学习研究,不得用于商业用途
- 豆瓣可能会更新页面结构,需要定期维护代码
通过这个案例可以掌握Python爬虫的基本技术要点,包括请求发送、页面解析、数据存储和反爬虫应对策略。
BbS.okacop010.info/PoSt/1120_356610.HtM
BbS.okacop011.info/PoSt/1120_125433.HtM
BbS.okacop012.info/PoSt/1120_524445.HtM
BbS.okacop013.info/PoSt/1120_650039.HtM
BbS.okacop014.info/PoSt/1120_230314.HtM
BbS.okacop015.info/PoSt/1120_986880.HtM
BbS.okacop016.info/PoSt/1120_024203.HtM
BbS.okacop017.info/PoSt/1120_205367.HtM
BbS.okacop018.info/PoSt/1120_792558.HtM
BbS.okacop019.info/PoSt/1120_244415.HtM
BbS.okacop020.info/PoSt/1120_293564.HtM
BbS.okacop021.info/PoSt/1120_846815.HtM
BbS.okacop022.info/PoSt/1120_660564.HtM
BbS.okacop023.info/PoSt/1120_116344.HtM
BbS.okacop024.info/PoSt/1120_202572.HtM
BbS.okacop025.info/PoSt/1120_137726.HtM
BbS.okacop026.info/PoSt/1120_669158.HtM
BbS.okacop027.info/PoSt/1120_392988.HtM
BbS.okacop028.info/PoSt/1120_813756.HtM
BbS.okacop029.info/PoSt/1120_350049.HtM
BbS.okacop020.info/PoSt/1120_298987.HtM
BbS.okacop021.info/PoSt/1120_343895.HtM
BbS.okacop022.info/PoSt/1120_227967.HtM
BbS.okacop023.info/PoSt/1120_480090.HtM
BbS.okacop024.info/PoSt/1120_390966.HtM
BbS.okacop025.info/PoSt/1120_230282.HtM
BbS.okacop026.info/PoSt/1120_171013.HtM
BbS.okacop027.info/PoSt/1120_078739.HtM
BbS.okacop028.info/PoSt/1120_258092.HtM
BbS.okacop029.info/PoSt/1120_609087.HtM
BbS.okacop020.info/PoSt/1120_582831.HtM
BbS.okacop021.info/PoSt/1120_142585.HtM
BbS.okacop022.info/PoSt/1120_598303.HtM
BbS.okacop023.info/PoSt/1120_343113.HtM
BbS.okacop024.info/PoSt/1120_272832.HtM
BbS.okacop025.info/PoSt/1120_349602.HtM
BbS.okacop026.info/PoSt/1120_994269.HtM
BbS.okacop027.info/PoSt/1120_665900.HtM
BbS.okacop028.info/PoSt/1120_672010.HtM
BbS.okacop029.info/PoSt/1120_328131.HtM
BbS.okacop020.info/PoSt/1120_133305.HtM
BbS.okacop021.info/PoSt/1120_348146.HtM
BbS.okacop022.info/PoSt/1120_499403.HtM
BbS.okacop023.info/PoSt/1120_930025.HtM
BbS.okacop024.info/PoSt/1120_266392.HtM
BbS.okacop025.info/PoSt/1120_082160.HtM
BbS.okacop026.info/PoSt/1120_797801.HtM
BbS.okacop027.info/PoSt/1120_994917.HtM
BbS.okacop028.info/PoSt/1120_995588.HtM
BbS.okacop029.info/PoSt/1120_796597.HtM
BbS.okacop020.info/PoSt/1120_191858.HtM
BbS.okacop021.info/PoSt/1120_105323.HtM
BbS.okacop022.info/PoSt/1120_391241.HtM
BbS.okacop023.info/PoSt/1120_002101.HtM
BbS.okacop024.info/PoSt/1120_414925.HtM
BbS.okacop025.info/PoSt/1120_525824.HtM
BbS.okacop026.info/PoSt/1120_568886.HtM
BbS.okacop027.info/PoSt/1120_115984.HtM
BbS.okacop028.info/PoSt/1120_048187.HtM
BbS.okacop029.info/PoSt/1120_412937.HtM
BbS.okacop020.info/PoSt/1120_668609.HtM
BbS.okacop021.info/PoSt/1120_439759.HtM
BbS.okacop022.info/PoSt/1120_135745.HtM
BbS.okacop023.info/PoSt/1120_865013.HtM
BbS.okacop024.info/PoSt/1120_070846.HtM
BbS.okacop025.info/PoSt/1120_985842.HtM
BbS.okacop026.info/PoSt/1120_019037.HtM
BbS.okacop027.info/PoSt/1120_169926.HtM
BbS.okacop028.info/PoSt/1120_480653.HtM
BbS.okacop029.info/PoSt/1120_302123.HtM
BbS.okacop020.info/PoSt/1120_602506.HtM
BbS.okacop021.info/PoSt/1120_384610.HtM
BbS.okacop022.info/PoSt/1120_349226.HtM
BbS.okacop023.info/PoSt/1120_084999.HtM
BbS.okacop024.info/PoSt/1120_270772.HtM
BbS.okacop025.info/PoSt/1120_258504.HtM
BbS.okacop026.info/PoSt/1120_203698.HtM
BbS.okacop027.info/PoSt/1120_330578.HtM
BbS.okacop028.info/PoSt/1120_656434.HtM
BbS.okacop029.info/PoSt/1120_943835.HtM