新手如何上手爬虫，看这一篇分享就可以了，含多个项目实战

历史记录

清除记录

猜你想搜

AcWing热点
App
登录/注册

新手如何上手爬虫，看这一篇分享就可以了，含多个项目实战

作者：

sheepice , 2024-02-24 21:25:16 , 所有人可见 , 阅读 145

5

11

爬虫相关的b站教程

新手版爬虫教程

较进阶版爬虫教程

关于http和https协议

（相关内容先了解即可，计算机网络的时候可以深入理解）

http协议
概念：服务器和客户端进行数据交互的一种形式
常用请求头信息：
User-Agent：请求载体（比如Google浏览器）的身份标识
Connection：请求完毕后，是断开连接还是保持连接
常用响应头
Content-Type：服务器端响应回客户端的数据类型
https协议
概念：‘s’对应的是security，是安全的超文本传输协议
加密方式
对称密钥加密：利用公钥加密。客户端自己制定加密和解密的方式（密钥），服务器接受到加密信息之后会使用密钥进行解密。但是如果在密钥传输的过程中密钥被盗取或者拦截，会很不安全

![image-20240118204529838](/Users/wangjiawei/Library/Application Support/typora-user-images/image-20240118204529838.png)

非对称密钥加密：利用私钥和公钥.A代表服务器端，B代表客户端。公钥是会发送给客户端，服务器端有自己的私钥进行解密从而避免秘文和秘钥同时放松给服务器端。但是公钥容易被中间机构拦截

![image-20240118204627896](/Users/wangjiawei/Library/Application Support/typora-user-images/image-20240118204627896.png)

下面是相关的一些弹幕解释（doge）：

![image-20240118205237465](/Users/wangjiawei/Library/Application Support/typora-user-images/image-20240118205237465.png)

证书密钥加密（https加密的方式）：客户端会接受一个已经签名的证书进行加密再发给服务器端利用密钥解密。

![image-20240118205021490](/Users/wangjiawei/Library/Application Support/typora-user-images/image-20240118205021490.png)

关于requests

urllib模块
requests模块
作用：模拟浏览器发送请求

步骤：1.指定url，2.发起请求，3.获取响应数据， 4.持久化存储

示例代码1

以下主要是熟悉一下request.get()中所包含的三个参数，并且通过改变网址后面的（‘？’后面的内容）进行不同网页的请求数据

from bs4 import BeautifulSoup
import requests
import re

headers = {
    'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36'
}
params = {
    'wd':'hello'
}
response = requests.get("https://www.baidu.com", headers=headers, params=params)
html = response.text

with open('./a.html', 'w', encoding='utf-8') as fp:
    fp.write(html)

示例代码2

对于百度翻译的破解，主要是熟悉一下request.post()所包含的参数

post请求（携带了参数）
响应数据是一组json数据

可以通过检查进行相关参数的检查，包括返回的是否是json

![image-20240118220801133](/Users/wangjiawei/Library/Application Support/typora-user-images/image-20240118220801133.png)

![image-20240118220726746](/Users/wangjiawei/Library/Application Support/typora-user-images/image-20240118220726746.png)

from bs4 import BeautifulSoup
import requests
import re
import json

headers = {
    'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36'
}
params = {
    'kw':'dog'
}
response = requests.post("https://fanyi.baidu.com/sug", headers=headers, data=params)
html = response.json()

fp = open('./a.json', 'w', encoding='utf-8')
#因为传回来的是中文，所以不可以用ascii码进行编码否则会有问题
json.dump(html, fp=fp, ensure_ascii=False)

示例代码3

关于爬取KFC网站各个地点的餐厅名字并且保留成json格式

from bs4 import BeautifulSoup
import requests
import re
import json
import jsonlines

headers = {
    'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36'
}
restrants = []
for pageIndex in range(1, 11):
    print(pageIndex)
    params = {
    'cname': '',
    'pid': '',
    'keyword': '北京',
    'pageIndex': pageIndex,
    'pageSize': 10,
    }
    response = requests.post("http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword", headers=headers,data=params)
    html = response.text
    json_html = json.loads(html)
    for all_restrant in json_html['Table1']:
        storeName = all_restrant['storeName']
        addressDetail = all_restrant['addressDetail']
        restrants.append(
            {
                'storeName':storeName,
                'addressDetail':addressDetail,
            }
        )

file_name = './a.json'
with open(file_name, 'w', encoding='utf-8') as json_file:
    json.dump(restrants, json_file, ensure_ascii=False, indent=2)

# with jsonlines.open(file_name, 'w') as jsonl_file:
#     jsonl_file.write_all(restrants)

#可以读取json文件的内容
with open(file_name, 'r', encoding='utf-8') as json_file:
    # 解析JSON文件内容
    data = json.load(json_file)
# 打印解析后的Python对象
print(data)
# fp = open('./a.json', 'w', encoding='utf-8')
# json.dump(html, fp=fp, ensure_ascii=False)

当然，由于一些资源是动态加载的，一些页面的a标签点进去之后会发现参数不同，但是前缀的网址是相同的。此时可以利用上面的方法去得到json然后解析出来那个唯一不同或者多个唯一不同的参数，通过得到的参数再去进行request请求。这就需要我们好好分析网页的结构，然后再进行爬取。

数据解析概论

当我们爬取了页面中指定的页面内容后，需要把爬取的页面内容进行解析以获取页面的局部内容。

数据解析分类
正则
bs4
xpath
数据解析原理
进行指定标签的定位
标签或者标签对应的属性中存储的数据值进行提取

关于bs4

bs4的文档

主要利用bs4进行html的一些解析工作，可以快速的得到很多的网页内容

可以用下面内容当成练习

![image-20240119204222875](/Users/wangjiawei/Library/Application Support/typora-user-images/image-20240119204222875.png)

示例代码

from bs4 import BeautifulSoup
import requests
import re
#可以指定一下请求头信息
headers = {
    #主要是为了发送请求的时候模拟浏览器发送请求
    'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36'
}
response = requests.get("https://movie.douban.com/top250", headers=headers)
html = response.text
soup = BeautifulSoup(html, "html.parser")
movie_href = soup.findAll("div", attrs={"class":"hd"})

关于re正则匹配

正则匹配文档

这里主要是为了匹配出一些我们需要的字符子串，这个正则匹配会比一般的字符串处理更容易处理字符串。

一般来说我们可以利用re去提取URL，因此可以利用正则匹配去爬取图片的数据

爬取图片数据

from bs4 import BeautifulSoup
import requests
import re
import json
import jsonlines

headers = {
    'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36'
}

response = requests.get('https://lmg.jj20.com/up/allimg/tp10/22022312542M617-0-lp.jpg', headers=headers)
#获取图片二进制
image_wb = response.content

with open('./a.jpg', 'wb') as fp:
    fp.write(image_wb)

下面有一个写正则表达式的一个样例，可以看到虽然很长，但是大多主要是利用.*?

![image-20240119114927642](/Users/wangjiawei/Library/Application Support/typora-user-images/image-20240119114927642.png)

示例代码

比如我们要去得到解析出来\[HTML_REMOVED]\[HTML_REMOVED]标签下的href的链接地址

from bs4 import BeautifulSoup
import requests
import re

headers = {
    'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36'
}
response = requests.get("https://movie.douban.com/top250", headers=headers)
html = response.text
soup = BeautifulSoup(html, "html.parser")
movie_href = soup.findAll("div", attrs={"class":"hd"})

for href in movie_href:
    s = str(href.a)
    match = re.search(r'href="(.*?)"', s)
    if match:
        href_value = match.group(1)
        print(href_value)
    else:
        print("hh")

关于xpath

最常用且最便捷高效的一种解析方式，比较通用

解析原理：

实例化一个etree的对象，且需要将被解析的页面源码数据加载到该对象中
调用etree对象中的xpath方法结合着xpath表达式实现标签的定位和内容的捕获

如何实例化一个etree的对象

从本地html：etree.parse(filePath)
从互联网上：etree.HTML(‘page_text’)
xpath(‘xpath表达式’)

xpath表达式

其实和定位文件差不多....

/表示单个层级，//表示多个层级

(单个层级) /html/body/div = （多个层级）/html//div

对比

soup.select(‘’)中的空格和大于号>

属性定位：//div[@class=”某某”]
索引定位：//div[@class=”某某”]/p[3] （索引从1开始）
取文本：//div[@class=”某某”]/p[3]/text() 或 //text()
取属性: //div[@class=”某某”]/img/@src

（下面是视频的内容，但是目前python过高版本的lxml已经没有etree）

可以通过下面方式导入

from lxml.html import etree

![image-20240119205742634](/Users/wangjiawei/Library/Application Support/typora-user-images/image-20240119205742634.png)

示例代码

from bs4 import BeautifulSoup
import requests
import re
import json
from lxml import html
import jsonlines

#可以指定一下请求头信息
headers = {
    #主要是为了发送请求的时候模拟浏览器发送请求
    'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36'
}
response = requests.get("https://movie.douban.com/top250", headers=headers)
hh = response.content

hh_text = html.fromstring(hh)
a = hh_text.xpath('//*[@id="content"]/div/div[1]/ol/li[7]/div/div[2]/div[2]/p[1]/text()')
print(a)

关于验证码识别

如果有些网站必须登陆才能访问某些数据，例如

![image-20240119213113001](/Users/wangjiawei/Library/Application Support/typora-user-images/image-20240119213113001.png)

我们需要输入验证码，识别验证码的操作如下：

人工肉眼识别（不推荐，效率比较低）
第三方自动识别

from bs4 import BeautifulSoup
import requests
import re
import json
from lxml import html
import jsonlines
import base64
import requests

_custom_url = "http://api.jfbym.com/api/YmServer/customApi"
_token = "uJgigF8CS5NR-t8ALI8-LRY2OUjC6UHY294tjnoyIfw"
_headers = {
    'Content-Type': 'application/json'
}
headers = {
    #主要是为了发送请求的时候模拟浏览器发送请求
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0',
}
def common_verify(image, verify_type="50100"):
    payload = {
        "image": base64.b64encode(image).decode(),
        "token": _token,
        "type": verify_type
    }
    resp = requests.post(_custom_url, headers=_headers, data=json.dumps(payload))
    return resp.json()['data']['data']
def image_to_base64(image_path):
    with open(image_path, "rb") as image_file:
        # 读取图片文件内容
        image_content = image_file.read()
    return image_content

session = requests.Session()
login_url = "https://cas.bjtu.edu.cn/auth/login/?next=/o/authorize/%3Fresponse_type%3Dcode%26client_id%3DaGex8GLTLueDZ0nW2tD3DwXnSA3F9xeFimirvhfo%26state%3D1705809064%26redirect_uri%3Dhttps%3A//mis.bjtu.edu.cn/auth/callback/%3Fredirect_to%3D/home/"
login_page = requests.get(login_url, headers=headers).content
login_html = html.fromstring(login_page)
img_url = 'https://cas.bjtu.edu.cn/' + login_html.xpath('//*[@id="login"]/dl/dd[2]/div/div[3]/span/img/@src')[0]

img_page = requests.get(img_url, headers=headers).content
with open('./1.jpg', 'wb') as fp:
    fp.write(img_page)
print('图片下载成功！！')
img_result = common_verify(image=image_to_base64('./1.jpg'))
print('图片处理成功！！')
print(img_result)
after_login_page_url = 'https://cas.bjtu.edu.cn/auth/login/?next=/o/authorize/%3Fresponse_type%3Dcode%26client_id%3DaGex8GLTLueDZ0nW2tD3DwXnSA3F9xeFimirvhfo%26state%3D1705809064%26redirect_uri%3Dhttps%3A//mis.bjtu.edu.cn/auth/callback/%3Fredirect_to%3D/home/'
data = {
  'next':'/o/authorize/?response_type=code&client_id=aGex8GLTLueDZ0nW2tD3DwXnSA3F9xeFimirvhfo&state=1705809064&redirect_uri=https://mis.bjtu.edu.cn/auth/callback/?redirect_to=/home/',
  'csrfmiddlewaretoken': 'dNjvND4fz99P99Qc2FhYxoFy8hnJGoAgcIWZ2M4Pw7dcMPYO655VGpJlUPez9OlZ',
  'loginname': '*********',
  'password': '***********',
  'captcha_0': '373515fc2ad2c8a9d25c8c938d6285c5c6737296',
  'captcha_1': img_result
}
after_page = session.post(after_login_page_url, data=data,headers=headers)
print(after_page.status_code)
final_page_url = 'https://mis.bjtu.edu.cn/home/'
final_page = session.get(url=final_page_url, headers=headers).text

with open('./a.html', 'w') as fp:
    fp.write(final_page)

关于selenium

selenium和爬虫之间的关联
可以便捷的获取网站中动态加载的数据
便捷的实现模拟登陆

示例代码1 进行无头和规避检测

from selenium import webdriver
import time
from selenium.webdriver.common.by import By
from lxml.html import etree
import requests
from PIL import Image
import base64
import json
from selenium.webdriver.chrome.options import Options

#实现无可视化界面
chrom_options = Options()
chrom_options.add_argument('--headless')
chrom_options.add_argument('--disable-gpu')

driver = webdriver.Chrome(options=chrom_options)

driver.get('http://www.baidu.com')

print(driver.page_source)
time.sleep(3)
driver.quit()  # 使用完关闭浏览器

示例代码2 爬北京交通大学mis系统模拟登陆

from selenium import webdriver
import time
from selenium.webdriver.common.by import By
from lxml import html
import requests
from PIL import Image
import base64
from selenium.webdriver.chrome.options import Options
import json
from fake_useragent import UserAgent
# 创建UserAgent对象
ua = UserAgent()
# 生成随机User-Agent
user_agent = ua.random
options = Options()
# options.add_argument("--headless")
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument(f"user-agent={user_agent}")
_custom_url = "http://api.jfbym.com/api/YmServer/customApi"
_token = "uJgigF8CS5NR-t8ALI8-LRY2OUjC6UHY294tjnoyIfw"
_headers = {
    'Content-Type': 'application/json'
}
headers = {
    #主要是为了发送请求的时候模拟浏览器发送请求
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0',
}
def common_verify(image, verify_type="50100"):
    payload = {
        "image": base64.b64encode(image).decode(),
        "token": _token,
        "type": verify_type
    }
    resp = requests.post(_custom_url, headers=_headers, data=json.dumps(payload))
    return resp.json()['data']['data']
def image_to_base64(image_path):
    with open(image_path, "rb") as image_file:
        # 读取图片文件内容
        image_content = image_file.read()
    return image_content

driver = webdriver.Chrome(options=options)  # 创建Chrome对象
driver.get('https://mis.bjtu.edu.cn/home/')
driver.save_screenshot('./a.png')
img_ele = driver.find_element(By.XPATH, '//*[@id="login"]/dl/dd[2]/div/div[3]/span/img')
location = img_ele.location
size = img_ele.size
rangle = (
    int(location['x']) * 2,
    int(location['y']) * 2,
    (int(location['x']) + size['width']) * 2,
    (int(location['y']) + size['height']) * 2,
)

i = Image.open('./a.png')
fram = i.crop(rangle)
fram.save('./aa.png')
img_result = common_verify(image=image_to_base64('./aa.png'))
print('图片处理成功！！')
print(img_result)

time.sleep(3)

username = driver.find_element(By.ID, 'id_loginname')
passward = driver.find_element(By.ID, 'id_password')
yzm = driver.find_element(By.ID, 'id_captcha_1')
login_bt = driver.find_element(By.CSS_SELECTOR, '.btn-lg')

username.send_keys('21281201')
time.sleep(3)
passward.send_keys('LPjz9249&')
time.sleep(3)
yzm.send_keys(img_result)
time.sleep(3)
login_bt.click()
time.sleep(4)

r = driver.page_source
hh = html.fromstring(r)
print(hh.xpath('/html/body/div[2]/div/div[2]/div[2]/div[1]/ul/li[2]/a/strong/i/text()'))

# jwxt = driver.find_element(By.XPATH, '/html/body/div[2]/div/div[3]/div/dl/dd[1]/div/ul/li[1]/div/div[2]/h3/a')
# jwxt.click()

while True:
    driver.refresh()

time.sleep(100)  # 两秒后关闭
driver.quit()  # 使用完关闭浏览器

写在最后

工程1 爬豆瓣影评

import requests
from bs4 import BeautifulSoup
from lxml import html
import re
import json
import os
import time
from tqdm import tqdm
from fake_useragent import UserAgent
import pandas as pd
#判定多少星
def starCnt(x):
    match = re.search(r'allstar(\d+) rating', x)

    if match:
        # 提取数字并将其除以10
        extracted_number = float(match.group(1)) / 5.0
        result = round(extracted_number, 1)
        return result
    else:
        return 0

ua = UserAgent()

headers = {
    'User-Agent': ua.random
}

# 隧道域名:端口号
tunnel = "x236.kdltps.com:15818"

# 用户名密码方式
username = "t10653676550197"
password = "ghover0v"
proxies = {
    "http": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password, "proxy": tunnel},
    "https": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password, "proxy": tunnel}
}

movie_id = 35131346
params = {
    'percent_type':'h',
    'limit':1,
    'status':'P',
    'sort':'new_score',
}
while(1):
    try:
        # 指定url
        response = requests.get(f'https://movie.douban.com/subject/{movie_id}/comments', headers=headers, params=params, proxies=proxies)
        h = response.text
        hh = html.fromstring(h)
        # 对于爬取电影的相关内容的xpath
        # 电影的名字
        if len(hh.xpath('//*[@id="content"]/h1/text()')) == 0:
            print('获取电影相关信息失败，重新获取!!!!')
            continue
        movie_name = str(hh.xpath('//*[@id="content"]/h1/text()')[0].split(' ')[0])
        # 电影的导演
        movie_derector = str(hh.xpath('//*[@id="content"]/div/div[2]/div[1]/div/span/p[1]/a/text()')[0])
        # 电影的主演
        movie_actor = hh.xpath('//*[@id="content"]/div/div[2]/div[1]/div/span/p[2]/a/text()')
        # 电影的类型
        movie_type = str(hh.xpath('//*[@id="content"]/div/div[2]/div[1]/div/span/p[3]/text()')[1])
        # 电影的地区
        movie_field = str(hh.xpath('//*[@id="content"]/div/div[2]/div[1]/div/span/p[4]/text()')[1])
        # 电影播放的总时间
        movie_time = str(hh.xpath('//*[@id="content"]/div/div[2]/div[1]/div/span/p[5]/text()')[1])
        # 电影上映的时间
        movie_date = str(hh.xpath('//*[@id="content"]/div/div[2]/div[1]/div/span/p[6]/text()')[1])
        break
    except requests.RequestException as e:
        print(f"Error: {e}")
        print("Retrying...")
        time.sleep(2)  # 等待一段时间后重试
# 做一下数据预清洗工作
movie_name = re.sub(r"[\s+\.\!\/_,$%^*()+\"\'\?\n]+|[+——！，。？、~@#￥%……&*（）【】；：]+|\[.+\]|\［.+\］", "", movie_name)
movie_derector = re.sub(r"[\s+\.\!\/_,$%^*()+\"\'\?\n]+|[+——！，。？、~@#￥%……&*（）【】；：]+|\[.+\]|\［.+\］", "", movie_derector)
movie_type = re.sub(r"[\s+\.\!\/_,$%^*()+\"\'\?\n]+|[+——！，。？、~@#￥%……&*（）【】；：]+|\[.+\]|\［.+\］", "", movie_type)
movie_field = re.sub(r"[\s+\.\!\/_,$%^*()+\"\'\?\n]+|[+——！，。？、~@#￥%……&*（）【】；：]+|\[.+\]|\［.+\］", "", movie_field)
movie_time = re.sub(r"[\s+\.\!\/_,$%^*()+\"\'\?\n]+|[+——！，。？、~@#￥%……&*（）【】；：]+|\[.+\]|\［.+\］", "", movie_time)
movie_date = re.sub(r"[\s+\.\!\/_,$%^*()+\"\'\?\n]+|[+——！，。？、~@#￥%……&*（）【】；：]+|\[.+\]|\［.+\］", "", movie_date)



pa_data = {
    # 电影的名字
    'movie_name':movie_name,
    # 电影的导演
    'movie_derector':movie_derector,
    # 电影的主演
    'movie_actor':movie_actor,
    # 电影的类型
    'movie_type':movie_type,
    # 电影的地区
    'movie_field':movie_field,
    # 电影播放的总时间
    'movie_time':movie_time,
    # 电影上映的时间
    'movie_date':movie_date,
    # 电影的影评相关内容
    'coments_all':[{
        'coments_type':'',
        # 电影的影评内容
        'content':'',
        # 电影的影评得分
        'starScore':0,
        # 影评的有用数
        'usefulCnt':0,
    }],
}

# 设定我们短评循环的params的相关信息
loop_info = [{'percent_type':'h', 'limit':120},{'percent_type':'m', 'limit':160},{'percent_type':'l', 'limit':120}]


#爬虫短评文本内容
for loop in tqdm(loop_info):
    while(1):
        headers = {
            'User-Agent': ua.random
        }
        param = {
            'percent_type':loop['percent_type'],
            'limit': loop['limit'],
            'status':'P',
            'sort':'new_score',
        }
        sort_type = loop['percent_type']
        print(f'开始爬取短评，短评分类为{sort_type}')
        response = requests.get(f'https://movie.douban.com/subject/{movie_id}/comments', headers=headers, params=param, proxies=proxies)
        h = response.text
        hh = html.fromstring(h)
        # 对于提取评论区的相关内容的xpath
        comments_body = hh.xpath('/html/body/div[3]/div[1]/div/div[1]/div[4]/div[@class="comment-item "]')
        cnt = 0
        for comment in comments_body:
            #爬取分数
            starScore = starCnt(str(comment.xpath('./div[2]/h3/span[2]/span[2]/@class')))
            #爬取有用的数量
            usefulCnt = 0
            if len(comment.xpath('./div[2]/h3/span[1]/span/text()')) >= 1:
                usefulCnt = int(comment.xpath('./div[2]/h3/span[1]/span/text()')[0])
            #爬取用户的评论
            content = ''
            if len(comment.xpath('./div[2]/p/span/text()')) >= 1:
                content = str(comment.xpath('./div[2]/p/span/text()')[0])
            content = re.sub(r"[\s+\.\!\/_$%^*()+\n]+|[+——、~#￥%&*（）【】；：]+|\[.+\]|\［.+\］", "", content)
            dict_data = {
                # 是短评
                'coments_type':'dp',
                # 电影的影评内容
                'content': content,
                # 电影的影评得分
                'starScore': starScore,
                # 影评的有用数
                'usefulCnt': usefulCnt,
            }
            pa_data['coments_all'].append(dict_data)
            cnt = cnt + 1
        if cnt == 0:
            print(f'爬虫访问失败，重新爬虫！！！')
            time.sleep(3)
        else:
            print(f'爬取成功短评，短评分类为{sort_type}, 共有{cnt}条短评')
            break

#爬虫影评内容
#分别对应5星，4星，3星，2星，1星爬多少
loop_info = [{'percent_type':'h', 'limit':20, 'star':5},
             {'percent_type':'h', 'limit':10, 'star':4},
             {'percent_type':'m', 'limit':40, 'star':3},
             {'percent_type':'l', 'limit':25, 'star':2},
             {'percent_type':'l', 'limit':5, 'star':1}]

cnt = 0
sum = 0

for loop in loop_info:
    cnt = cnt + 1
    star = loop['star']
    limit = loop['limit']
    ccnt = 0
    for start in range(0, 40, 20):
        param = {
            'rating':star,
            'start':start,
        }
        cnt2 = 0
        while(1):
            cnt1 = 0
            print(f'爬取{star}星的影评，开始为{start}')
            response = requests.get(f'https://movie.douban.com/subject/{movie_id}/reviews', params=param, headers=headers, proxies=proxies).text
            hh_short = html.fromstring(response)
            data_list = hh_short.xpath('//*[@id="content"]/div/div[1]/div[1]/div')
            for comment_list in data_list:
                cnt1 = cnt1 + 1
                ccnt = ccnt + 1
                if ccnt > limit:
                    break
                comment_id = comment_list.xpath('./@data-cid')
                if len(comment_id) < 1:
                    continue
                comment_id = comment_id[0]
                data_response = requests.get(f'https://movie.douban.com/review/{comment_id}/', headers=headers, proxies=proxies).text
                hh_long = html.fromstring(data_response)
                data_all = hh_long.xpath(f'//*[@id="link-report-{comment_id}"]/div[1]/p')
                comments = ""
                for p_all in data_all:
                    now_p = p_all.xpath('.//text()')
                    if len(now_p) < 1:
                        continue
                    now_p = now_p[0]
                    comments = comments + str(now_p)
                comments = re.sub(r"[\s+\.\!\/_$%^*()+\n]+|[+——、~#￥%&*（）【】；：]+|\[.+\]|\［.+\］", "", comments)
                useful_bt = hh_long.xpath(f'//*[@id="review-{comment_id}-content"]/div[3]/button[1]/text()')
                usefulCnt = 0
                if len(useful_bt) >= 1:
                    match = re.search(r'\n      有用 (\d*)\n  ', useful_bt[0])
                    usefulCnt = int(match.group(1))
                print(usefulCnt)
                dict_data = {
                    # 是影评
                    'coments_type': 'yp',
                    # 电影的影评内容
                    'content': comments,
                    # 电影的影评得分
                    'starScore': star * 2,
                    # 影评的有用数
                    'usefulCnt': usefulCnt,
                }
                pa_data['coments_all'].append(dict_data)
            if cnt1 == 0 and cnt2 < 5:
                print(f'爬取{star}星的影评失败！！重新爬取！！')
                cnt2 = cnt2 + 1
            else:
                break
        if ccnt >= limit:
            break
    print(f'爬取{star}星的影评成功')

# 创建数据集json文件夹
movie_datajson_dir = './movie_data_json'
try:
    os.makedirs(movie_datajson_dir)
    print(f'文件夹 "{movie_datajson_dir}" 创建成功')
except FileExistsError:
    print(f'文件夹 "{movie_datajson_dir}" 已存在')
except Exception as e:
    print(f'创建文件夹时发生错误: {e}')
# 指定要保存的JSON文件路径
new_json_name = str(movie_id) + '.json'
json_file_path = os.path.join(movie_datajson_dir, new_json_name)
# 使用json.dumps将字典转换为JSON格式的字符串
json_data = json.dumps(pa_data, indent=2,ensure_ascii=False)
# 将JSON字符串写入文件
with open(json_file_path, 'w', encoding='utf-8') as json_file:
    json_file.write(json_data)
print(f'Data has been written to {json_file_path}')

#通过生成的数据生成最终的excel
read_json_path = f'./movie_data_json/{movie_id}.json'
with open(read_json_path, 'r') as file:
    data = json.load(file)

coments_all_list = data.get('coments_all', [])
print(coments_all_list)

data_dict = {
    '类型':[],
    '得分':[],
    '有用数':[],
    '内容':[],
}

for coments in coments_all_list:
    lx = coments['coments_type']
    df = coments['starScore']
    yys = coments['usefulCnt']
    nr = coments['content']
    data_dict['类型'].append(lx)
    data_dict['得分'].append(df)
    data_dict['有用数'].append(yys)
    data_dict['内容'].append(nr)

dt = pd.DataFrame(data_dict)
print(dt)
# 创建数据集excel文件夹
movie_dataexcel_dir = './movie_data_excel'
try:
    os.makedirs(movie_dataexcel_dir)
    print(f'文件夹 "{movie_dataexcel_dir}" 创建成功')
except FileExistsError:
    print(f'文件夹 "{movie_dataexcel_dir}" 已存在')
except Exception as e:
    print(f'创建文件夹时发生错误: {e}')
# 指定要保存的excel文件路径

new_excel_name = str(movie_id) + '.xlsx'
new_excel_name = os.path.join(movie_dataexcel_dir, new_excel_name)
print(new_excel_name)

dt.to_excel(new_excel_name,index=False, header=False)