爬虫相关的b站教程
关于http和https协议
(相关内容先了解即可,计算机网络的时候可以深入理解)
-
http协议
-
概念:服务器和客户端进行数据交互的一种形式
- 常用请求头信息:
- User-Agent:请求载体(比如Google浏览器)的身份标识
- Connection:请求完毕后,是断开连接还是保持连接
- 常用响应头
-
Content-Type:服务器端响应回客户端的数据类型
-
https协议
-
概念:‘s’对应的是security,是安全的超文本传输协议
-
加密方式
-
对称密钥加密:利用公钥加密。客户端自己制定加密和解密的方式(密钥),服务器接受到加密信息之后会使用密钥进行解密。但是如果在密钥传输的过程中密钥被盗取或者拦截,会很不安全
![image-20240118204529838](/Users/wangjiawei/Library/Application Support/typora-user-images/image-20240118204529838.png)
- 非对称密钥加密:利用私钥和公钥.A代表服务器端,B代表客户端。公钥是会发送给客户端,服务器端有自己的私钥进行解密从而避免秘文和秘钥同时放松给服务器端。但是公钥容易被中间机构拦截
![image-20240118204627896](/Users/wangjiawei/Library/Application Support/typora-user-images/image-20240118204627896.png)
下面是相关的一些弹幕解释(doge):
![image-20240118205237465](/Users/wangjiawei/Library/Application Support/typora-user-images/image-20240118205237465.png)
- 证书密钥加密(https加密的方式):客户端会接受一个已经签名的证书进行加密再发给服务器端利用密钥解密。
![image-20240118205021490](/Users/wangjiawei/Library/Application Support/typora-user-images/image-20240118205021490.png)
关于requests
- urllib模块
- requests模块
- 作用:模拟浏览器发送请求
步骤:1.指定url,2.发起请求,3.获取响应数据, 4.持久化存储
示例代码1
以下主要是熟悉一下request.get()中所包含的三个参数,并且通过改变网址后面的(‘?’后面的内容)进行不同网页的请求数据
from bs4 import BeautifulSoup
import requests
import re
headers = {
'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36'
}
params = {
'wd':'hello'
}
response = requests.get("https://www.baidu.com", headers=headers, params=params)
html = response.text
with open('./a.html', 'w', encoding='utf-8') as fp:
fp.write(html)
示例代码2
对于百度翻译的破解,主要是熟悉一下request.post()所包含的参数
- post请求(携带了参数)
- 响应数据是一组json数据
可以通过检查
进行相关参数的检查,包括返回的是否是json
![image-20240118220801133](/Users/wangjiawei/Library/Application Support/typora-user-images/image-20240118220801133.png)
![image-20240118220726746](/Users/wangjiawei/Library/Application Support/typora-user-images/image-20240118220726746.png)
from bs4 import BeautifulSoup
import requests
import re
import json
headers = {
'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36'
}
params = {
'kw':'dog'
}
response = requests.post("https://fanyi.baidu.com/sug", headers=headers, data=params)
html = response.json()
fp = open('./a.json', 'w', encoding='utf-8')
#因为传回来的是中文,所以不可以用ascii码进行编码否则会有问题
json.dump(html, fp=fp, ensure_ascii=False)
示例代码3
关于爬取KFC网站各个地点的餐厅名字并且保留成json格式
from bs4 import BeautifulSoup
import requests
import re
import json
import jsonlines
headers = {
'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36'
}
restrants = []
for pageIndex in range(1, 11):
print(pageIndex)
params = {
'cname': '',
'pid': '',
'keyword': '北京',
'pageIndex': pageIndex,
'pageSize': 10,
}
response = requests.post("http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword", headers=headers,data=params)
html = response.text
json_html = json.loads(html)
for all_restrant in json_html['Table1']:
storeName = all_restrant['storeName']
addressDetail = all_restrant['addressDetail']
restrants.append(
{
'storeName':storeName,
'addressDetail':addressDetail,
}
)
file_name = './a.json'
with open(file_name, 'w', encoding='utf-8') as json_file:
json.dump(restrants, json_file, ensure_ascii=False, indent=2)
# with jsonlines.open(file_name, 'w') as jsonl_file:
# jsonl_file.write_all(restrants)
#可以读取json文件的内容
with open(file_name, 'r', encoding='utf-8') as json_file:
# 解析JSON文件内容
data = json.load(json_file)
# 打印解析后的Python对象
print(data)
# fp = open('./a.json', 'w', encoding='utf-8')
# json.dump(html, fp=fp, ensure_ascii=False)
当然,由于一些资源是动态加载的,一些页面的a标签点进去之后会发现参数不同,但是前缀的网址是相同的。此时可以利用上面的方法去得到json然后解析出来那个唯一不同或者多个唯一不同的参数,通过得到的参数再去进行request请求。这就需要我们好好分析网页的结构,然后再进行爬取。
数据解析概论
当我们爬取了页面中指定的页面内容后,需要把爬取的页面内容进行解析以获取页面的局部内容。
-
数据解析分类
-
正则
- bs4
-
xpath
-
数据解析原理
-
进行指定标签的定位
- 标签或者标签对应的属性中存储的数据值进行提取
关于bs4
主要利用bs4进行html的一些解析工作,可以快速的得到很多的网页内容
可以用下面内容当成练习
![image-20240119204222875](/Users/wangjiawei/Library/Application Support/typora-user-images/image-20240119204222875.png)
示例代码
from bs4 import BeautifulSoup
import requests
import re
#可以指定一下请求头信息
headers = {
#主要是为了发送请求的时候模拟浏览器发送请求
'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36'
}
response = requests.get("https://movie.douban.com/top250", headers=headers)
html = response.text
soup = BeautifulSoup(html, "html.parser")
movie_href = soup.findAll("div", attrs={"class":"hd"})
关于re正则匹配
这里主要是为了匹配出一些我们需要的字符子串,这个正则匹配会比一般的字符串处理更容易处理字符串。
一般来说我们可以利用re去提取URL,因此可以利用正则匹配去爬取图片的数据
- 爬取图片数据
from bs4 import BeautifulSoup
import requests
import re
import json
import jsonlines
headers = {
'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36'
}
response = requests.get('https://lmg.jj20.com/up/allimg/tp10/22022312542M617-0-lp.jpg', headers=headers)
#获取图片二进制
image_wb = response.content
with open('./a.jpg', 'wb') as fp:
fp.write(image_wb)
下面有一个写正则表达式的一个样例,可以看到虽然很长,但是大多主要是利用.*?
![image-20240119114927642](/Users/wangjiawei/Library/Application Support/typora-user-images/image-20240119114927642.png)
示例代码
比如我们要去得到解析出来\[HTML_REMOVED]\[HTML_REMOVED]标签下的href的链接地址
from bs4 import BeautifulSoup
import requests
import re
headers = {
'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36'
}
response = requests.get("https://movie.douban.com/top250", headers=headers)
html = response.text
soup = BeautifulSoup(html, "html.parser")
movie_href = soup.findAll("div", attrs={"class":"hd"})
for href in movie_href:
s = str(href.a)
match = re.search(r'href="(.*?)"', s)
if match:
href_value = match.group(1)
print(href_value)
else:
print("hh")
关于xpath
最常用且最便捷高效的一种解析方式,比较通用
解析原理:
- 实例化一个etree的对象,且需要将被解析的页面源码数据加载到该对象中
- 调用etree对象中的xpath方法结合着xpath表达式实现标签的定位和内容的捕获
如何实例化一个etree的对象
- 从本地html:etree.parse(filePath)
- 从互联网上:etree.HTML(‘page_text’)
- xpath(‘xpath表达式’)
xpath表达式
其实和定位文件差不多....
- /表示单个层级,//表示多个层级
(单个层级) /html/body/div = (多个层级)/html//div
对比
soup.select(‘’)中的空格 和大于号
>
- 属性定位://div[@class=”某某”]
- 索引定位://div[@class=”某某”]/p[3] (索引从1开始)
- 取文本://div[@class=”某某”]/p[3]/text() 或 //text()
- 取属性: //div[@class=”某某”]/img/@src
(下面是视频的内容,但是目前python过高版本的lxml已经没有etree)
可以通过下面方式导入
from lxml.html import etree
![image-20240119205742634](/Users/wangjiawei/Library/Application Support/typora-user-images/image-20240119205742634.png)
示例代码
from bs4 import BeautifulSoup
import requests
import re
import json
from lxml import html
import jsonlines
#可以指定一下请求头信息
headers = {
#主要是为了发送请求的时候模拟浏览器发送请求
'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36'
}
response = requests.get("https://movie.douban.com/top250", headers=headers)
hh = response.content
hh_text = html.fromstring(hh)
a = hh_text.xpath('//*[@id="content"]/div/div[1]/ol/li[7]/div/div[2]/div[2]/p[1]/text()')
print(a)
关于验证码识别
如果有些网站必须登陆才能访问某些数据,例如
![image-20240119213113001](/Users/wangjiawei/Library/Application Support/typora-user-images/image-20240119213113001.png)
我们需要输入验证码,识别验证码的操作如下:
- 人工肉眼识别(不推荐,效率比较低)
- 第三方自动识别
from bs4 import BeautifulSoup
import requests
import re
import json
from lxml import html
import jsonlines
import base64
import requests
_custom_url = "http://api.jfbym.com/api/YmServer/customApi"
_token = "uJgigF8CS5NR-t8ALI8-LRY2OUjC6UHY294tjnoyIfw"
_headers = {
'Content-Type': 'application/json'
}
headers = {
#主要是为了发送请求的时候模拟浏览器发送请求
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0',
}
def common_verify(image, verify_type="50100"):
payload = {
"image": base64.b64encode(image).decode(),
"token": _token,
"type": verify_type
}
resp = requests.post(_custom_url, headers=_headers, data=json.dumps(payload))
return resp.json()['data']['data']
def image_to_base64(image_path):
with open(image_path, "rb") as image_file:
# 读取图片文件内容
image_content = image_file.read()
return image_content
session = requests.Session()
login_url = "https://cas.bjtu.edu.cn/auth/login/?next=/o/authorize/%3Fresponse_type%3Dcode%26client_id%3DaGex8GLTLueDZ0nW2tD3DwXnSA3F9xeFimirvhfo%26state%3D1705809064%26redirect_uri%3Dhttps%3A//mis.bjtu.edu.cn/auth/callback/%3Fredirect_to%3D/home/"
login_page = requests.get(login_url, headers=headers).content
login_html = html.fromstring(login_page)
img_url = 'https://cas.bjtu.edu.cn/' + login_html.xpath('//*[@id="login"]/dl/dd[2]/div/div[3]/span/img/@src')[0]
img_page = requests.get(img_url, headers=headers).content
with open('./1.jpg', 'wb') as fp:
fp.write(img_page)
print('图片下载成功!!')
img_result = common_verify(image=image_to_base64('./1.jpg'))
print('图片处理成功!!')
print(img_result)
after_login_page_url = 'https://cas.bjtu.edu.cn/auth/login/?next=/o/authorize/%3Fresponse_type%3Dcode%26client_id%3DaGex8GLTLueDZ0nW2tD3DwXnSA3F9xeFimirvhfo%26state%3D1705809064%26redirect_uri%3Dhttps%3A//mis.bjtu.edu.cn/auth/callback/%3Fredirect_to%3D/home/'
data = {
'next':'/o/authorize/?response_type=code&client_id=aGex8GLTLueDZ0nW2tD3DwXnSA3F9xeFimirvhfo&state=1705809064&redirect_uri=https://mis.bjtu.edu.cn/auth/callback/?redirect_to=/home/',
'csrfmiddlewaretoken': 'dNjvND4fz99P99Qc2FhYxoFy8hnJGoAgcIWZ2M4Pw7dcMPYO655VGpJlUPez9OlZ',
'loginname': '*********',
'password': '***********',
'captcha_0': '373515fc2ad2c8a9d25c8c938d6285c5c6737296',
'captcha_1': img_result
}
after_page = session.post(after_login_page_url, data=data,headers=headers)
print(after_page.status_code)
final_page_url = 'https://mis.bjtu.edu.cn/home/'
final_page = session.get(url=final_page_url, headers=headers).text
with open('./a.html', 'w') as fp:
fp.write(final_page)
关于selenium
-
selenium和爬虫之间的关联
-
可以便捷的获取网站中动态加载的数据
- 便捷的实现模拟登陆
示例代码1 进行无头和规避检测
from selenium import webdriver
import time
from selenium.webdriver.common.by import By
from lxml.html import etree
import requests
from PIL import Image
import base64
import json
from selenium.webdriver.chrome.options import Options
#实现无可视化界面
chrom_options = Options()
chrom_options.add_argument('--headless')
chrom_options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=chrom_options)
driver.get('http://www.baidu.com')
print(driver.page_source)
time.sleep(3)
driver.quit() # 使用完关闭浏览器
示例代码2 爬北京交通大学mis系统模拟登陆
from selenium import webdriver
import time
from selenium.webdriver.common.by import By
from lxml import html
import requests
from PIL import Image
import base64
from selenium.webdriver.chrome.options import Options
import json
from fake_useragent import UserAgent
# 创建UserAgent对象
ua = UserAgent()
# 生成随机User-Agent
user_agent = ua.random
options = Options()
# options.add_argument("--headless")
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument(f"user-agent={user_agent}")
_custom_url = "http://api.jfbym.com/api/YmServer/customApi"
_token = "uJgigF8CS5NR-t8ALI8-LRY2OUjC6UHY294tjnoyIfw"
_headers = {
'Content-Type': 'application/json'
}
headers = {
#主要是为了发送请求的时候模拟浏览器发送请求
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0',
}
def common_verify(image, verify_type="50100"):
payload = {
"image": base64.b64encode(image).decode(),
"token": _token,
"type": verify_type
}
resp = requests.post(_custom_url, headers=_headers, data=json.dumps(payload))
return resp.json()['data']['data']
def image_to_base64(image_path):
with open(image_path, "rb") as image_file:
# 读取图片文件内容
image_content = image_file.read()
return image_content
driver = webdriver.Chrome(options=options) # 创建Chrome对象
driver.get('https://mis.bjtu.edu.cn/home/')
driver.save_screenshot('./a.png')
img_ele = driver.find_element(By.XPATH, '//*[@id="login"]/dl/dd[2]/div/div[3]/span/img')
location = img_ele.location
size = img_ele.size
rangle = (
int(location['x']) * 2,
int(location['y']) * 2,
(int(location['x']) + size['width']) * 2,
(int(location['y']) + size['height']) * 2,
)
i = Image.open('./a.png')
fram = i.crop(rangle)
fram.save('./aa.png')
img_result = common_verify(image=image_to_base64('./aa.png'))
print('图片处理成功!!')
print(img_result)
time.sleep(3)
username = driver.find_element(By.ID, 'id_loginname')
passward = driver.find_element(By.ID, 'id_password')
yzm = driver.find_element(By.ID, 'id_captcha_1')
login_bt = driver.find_element(By.CSS_SELECTOR, '.btn-lg')
username.send_keys('21281201')
time.sleep(3)
passward.send_keys('LPjz9249&')
time.sleep(3)
yzm.send_keys(img_result)
time.sleep(3)
login_bt.click()
time.sleep(4)
r = driver.page_source
hh = html.fromstring(r)
print(hh.xpath('/html/body/div[2]/div/div[2]/div[2]/div[1]/ul/li[2]/a/strong/i/text()'))
# jwxt = driver.find_element(By.XPATH, '/html/body/div[2]/div/div[3]/div/dl/dd[1]/div/ul/li[1]/div/div[2]/h3/a')
# jwxt.click()
while True:
driver.refresh()
time.sleep(100) # 两秒后关闭
driver.quit() # 使用完关闭浏览器
写在最后
工程1 爬豆瓣影评
import requests
from bs4 import BeautifulSoup
from lxml import html
import re
import json
import os
import time
from tqdm import tqdm
from fake_useragent import UserAgent
import pandas as pd
#判定多少星
def starCnt(x):
match = re.search(r'allstar(\d+) rating', x)
if match:
# 提取数字并将其除以10
extracted_number = float(match.group(1)) / 5.0
result = round(extracted_number, 1)
return result
else:
return 0
ua = UserAgent()
headers = {
'User-Agent': ua.random
}
# 隧道域名:端口号
tunnel = "x236.kdltps.com:15818"
# 用户名密码方式
username = "t10653676550197"
password = "ghover0v"
proxies = {
"http": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password, "proxy": tunnel},
"https": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password, "proxy": tunnel}
}
movie_id = 35131346
params = {
'percent_type':'h',
'limit':1,
'status':'P',
'sort':'new_score',
}
while(1):
try:
# 指定url
response = requests.get(f'https://movie.douban.com/subject/{movie_id}/comments', headers=headers, params=params, proxies=proxies)
h = response.text
hh = html.fromstring(h)
# 对于爬取电影的相关内容的xpath
# 电影的名字
if len(hh.xpath('//*[@id="content"]/h1/text()')) == 0:
print('获取电影相关信息失败,重新获取!!!!')
continue
movie_name = str(hh.xpath('//*[@id="content"]/h1/text()')[0].split(' ')[0])
# 电影的导演
movie_derector = str(hh.xpath('//*[@id="content"]/div/div[2]/div[1]/div/span/p[1]/a/text()')[0])
# 电影的主演
movie_actor = hh.xpath('//*[@id="content"]/div/div[2]/div[1]/div/span/p[2]/a/text()')
# 电影的类型
movie_type = str(hh.xpath('//*[@id="content"]/div/div[2]/div[1]/div/span/p[3]/text()')[1])
# 电影的地区
movie_field = str(hh.xpath('//*[@id="content"]/div/div[2]/div[1]/div/span/p[4]/text()')[1])
# 电影播放的总时间
movie_time = str(hh.xpath('//*[@id="content"]/div/div[2]/div[1]/div/span/p[5]/text()')[1])
# 电影上映的时间
movie_date = str(hh.xpath('//*[@id="content"]/div/div[2]/div[1]/div/span/p[6]/text()')[1])
break
except requests.RequestException as e:
print(f"Error: {e}")
print("Retrying...")
time.sleep(2) # 等待一段时间后重试
# 做一下数据预清洗工作
movie_name = re.sub(r"[\s+\.\!\/_,$%^*()+\"\'\?\n]+|[+——!,。?、~@#¥%……&*()【】;:]+|\[.+\]|\[.+\]", "", movie_name)
movie_derector = re.sub(r"[\s+\.\!\/_,$%^*()+\"\'\?\n]+|[+——!,。?、~@#¥%……&*()【】;:]+|\[.+\]|\[.+\]", "", movie_derector)
movie_type = re.sub(r"[\s+\.\!\/_,$%^*()+\"\'\?\n]+|[+——!,。?、~@#¥%……&*()【】;:]+|\[.+\]|\[.+\]", "", movie_type)
movie_field = re.sub(r"[\s+\.\!\/_,$%^*()+\"\'\?\n]+|[+——!,。?、~@#¥%……&*()【】;:]+|\[.+\]|\[.+\]", "", movie_field)
movie_time = re.sub(r"[\s+\.\!\/_,$%^*()+\"\'\?\n]+|[+——!,。?、~@#¥%……&*()【】;:]+|\[.+\]|\[.+\]", "", movie_time)
movie_date = re.sub(r"[\s+\.\!\/_,$%^*()+\"\'\?\n]+|[+——!,。?、~@#¥%……&*()【】;:]+|\[.+\]|\[.+\]", "", movie_date)
pa_data = {
# 电影的名字
'movie_name':movie_name,
# 电影的导演
'movie_derector':movie_derector,
# 电影的主演
'movie_actor':movie_actor,
# 电影的类型
'movie_type':movie_type,
# 电影的地区
'movie_field':movie_field,
# 电影播放的总时间
'movie_time':movie_time,
# 电影上映的时间
'movie_date':movie_date,
# 电影的影评相关内容
'coments_all':[{
'coments_type':'',
# 电影的影评内容
'content':'',
# 电影的影评得分
'starScore':0,
# 影评的有用数
'usefulCnt':0,
}],
}
# 设定我们短评循环的params的相关信息
loop_info = [{'percent_type':'h', 'limit':120},{'percent_type':'m', 'limit':160},{'percent_type':'l', 'limit':120}]
#爬虫短评文本内容
for loop in tqdm(loop_info):
while(1):
headers = {
'User-Agent': ua.random
}
param = {
'percent_type':loop['percent_type'],
'limit': loop['limit'],
'status':'P',
'sort':'new_score',
}
sort_type = loop['percent_type']
print(f'开始爬取短评,短评分类为{sort_type}')
response = requests.get(f'https://movie.douban.com/subject/{movie_id}/comments', headers=headers, params=param, proxies=proxies)
h = response.text
hh = html.fromstring(h)
# 对于提取评论区的相关内容的xpath
comments_body = hh.xpath('/html/body/div[3]/div[1]/div/div[1]/div[4]/div[@class="comment-item "]')
cnt = 0
for comment in comments_body:
#爬取分数
starScore = starCnt(str(comment.xpath('./div[2]/h3/span[2]/span[2]/@class')))
#爬取有用的数量
usefulCnt = 0
if len(comment.xpath('./div[2]/h3/span[1]/span/text()')) >= 1:
usefulCnt = int(comment.xpath('./div[2]/h3/span[1]/span/text()')[0])
#爬取用户的评论
content = ''
if len(comment.xpath('./div[2]/p/span/text()')) >= 1:
content = str(comment.xpath('./div[2]/p/span/text()')[0])
content = re.sub(r"[\s+\.\!\/_$%^*()+\n]+|[+——、~#¥%&*()【】;:]+|\[.+\]|\[.+\]", "", content)
dict_data = {
# 是短评
'coments_type':'dp',
# 电影的影评内容
'content': content,
# 电影的影评得分
'starScore': starScore,
# 影评的有用数
'usefulCnt': usefulCnt,
}
pa_data['coments_all'].append(dict_data)
cnt = cnt + 1
if cnt == 0:
print(f'爬虫访问失败,重新爬虫!!!')
time.sleep(3)
else:
print(f'爬取成功短评,短评分类为{sort_type}, 共有{cnt}条短评')
break
#爬虫影评内容
#分别对应5星,4星,3星,2星,1星爬多少
loop_info = [{'percent_type':'h', 'limit':20, 'star':5},
{'percent_type':'h', 'limit':10, 'star':4},
{'percent_type':'m', 'limit':40, 'star':3},
{'percent_type':'l', 'limit':25, 'star':2},
{'percent_type':'l', 'limit':5, 'star':1}]
cnt = 0
sum = 0
for loop in loop_info:
cnt = cnt + 1
star = loop['star']
limit = loop['limit']
ccnt = 0
for start in range(0, 40, 20):
param = {
'rating':star,
'start':start,
}
cnt2 = 0
while(1):
cnt1 = 0
print(f'爬取{star}星的影评,开始为{start}')
response = requests.get(f'https://movie.douban.com/subject/{movie_id}/reviews', params=param, headers=headers, proxies=proxies).text
hh_short = html.fromstring(response)
data_list = hh_short.xpath('//*[@id="content"]/div/div[1]/div[1]/div')
for comment_list in data_list:
cnt1 = cnt1 + 1
ccnt = ccnt + 1
if ccnt > limit:
break
comment_id = comment_list.xpath('./@data-cid')
if len(comment_id) < 1:
continue
comment_id = comment_id[0]
data_response = requests.get(f'https://movie.douban.com/review/{comment_id}/', headers=headers, proxies=proxies).text
hh_long = html.fromstring(data_response)
data_all = hh_long.xpath(f'//*[@id="link-report-{comment_id}"]/div[1]/p')
comments = ""
for p_all in data_all:
now_p = p_all.xpath('.//text()')
if len(now_p) < 1:
continue
now_p = now_p[0]
comments = comments + str(now_p)
comments = re.sub(r"[\s+\.\!\/_$%^*()+\n]+|[+——、~#¥%&*()【】;:]+|\[.+\]|\[.+\]", "", comments)
useful_bt = hh_long.xpath(f'//*[@id="review-{comment_id}-content"]/div[3]/button[1]/text()')
usefulCnt = 0
if len(useful_bt) >= 1:
match = re.search(r'\n 有用 (\d*)\n ', useful_bt[0])
usefulCnt = int(match.group(1))
print(usefulCnt)
dict_data = {
# 是影评
'coments_type': 'yp',
# 电影的影评内容
'content': comments,
# 电影的影评得分
'starScore': star * 2,
# 影评的有用数
'usefulCnt': usefulCnt,
}
pa_data['coments_all'].append(dict_data)
if cnt1 == 0 and cnt2 < 5:
print(f'爬取{star}星的影评失败!!重新爬取!!')
cnt2 = cnt2 + 1
else:
break
if ccnt >= limit:
break
print(f'爬取{star}星的影评成功')
# 创建数据集json文件夹
movie_datajson_dir = './movie_data_json'
try:
os.makedirs(movie_datajson_dir)
print(f'文件夹 "{movie_datajson_dir}" 创建成功')
except FileExistsError:
print(f'文件夹 "{movie_datajson_dir}" 已存在')
except Exception as e:
print(f'创建文件夹时发生错误: {e}')
# 指定要保存的JSON文件路径
new_json_name = str(movie_id) + '.json'
json_file_path = os.path.join(movie_datajson_dir, new_json_name)
# 使用json.dumps将字典转换为JSON格式的字符串
json_data = json.dumps(pa_data, indent=2,ensure_ascii=False)
# 将JSON字符串写入文件
with open(json_file_path, 'w', encoding='utf-8') as json_file:
json_file.write(json_data)
print(f'Data has been written to {json_file_path}')
#通过生成的数据生成最终的excel
read_json_path = f'./movie_data_json/{movie_id}.json'
with open(read_json_path, 'r') as file:
data = json.load(file)
coments_all_list = data.get('coments_all', [])
print(coments_all_list)
data_dict = {
'类型':[],
'得分':[],
'有用数':[],
'内容':[],
}
for coments in coments_all_list:
lx = coments['coments_type']
df = coments['starScore']
yys = coments['usefulCnt']
nr = coments['content']
data_dict['类型'].append(lx)
data_dict['得分'].append(df)
data_dict['有用数'].append(yys)
data_dict['内容'].append(nr)
dt = pd.DataFrame(data_dict)
print(dt)
# 创建数据集excel文件夹
movie_dataexcel_dir = './movie_data_excel'
try:
os.makedirs(movie_dataexcel_dir)
print(f'文件夹 "{movie_dataexcel_dir}" 创建成功')
except FileExistsError:
print(f'文件夹 "{movie_dataexcel_dir}" 已存在')
except Exception as e:
print(f'创建文件夹时发生错误: {e}')
# 指定要保存的excel文件路径
new_excel_name = str(movie_id) + '.xlsx'
new_excel_name = os.path.join(movie_dataexcel_dir, new_excel_name)
print(new_excel_name)
dt.to_excel(new_excel_name,index=False, header=False)
额大家好,因为本地typora上传有图片显示不了后面忙完补上,但是图片没啥用hh
请教一下楼主能不能用session实现 洛谷 的登录,我之前试过但是验证码一直是错误的,最后用的selenium实现的效率太低了😢
你是验证码有问题是吗?验证码验证用云码平台,你可以看看我验证码那块,然后你可以用我的token
验证码识别我确认了一下没问题,我抓包找到登录发送账号密码验证码参数的url,我用requests.Session发送请求之后status_code也是200,但是再请求那些需要登录才能访问的页面之后都是失败的,提示我要登录。这是登录的 url ,应该没找错😢
你可以看看
这个视频,可能登陆发生了重定向,之后我也会写
https://www.bilibili.com/video/BV17v4y1w7rA/?spm_id_from=333.999.0.0
谢佬了,一直用selenium效率太低了😢
是嘟而且selunim必须要浏览器驱动呢,之后如果项目数据是爬虫的必须还是request去模拟登陆这个适用一些
牛