admin 管理员组文章数量: 1184232
引入
DrissionPage与之前学习的Selenium大致相同,也是一个基于Python的页面自动化工具。区别在于DrissionPage既可以控制浏览器也能收发数据包,将两者功能合二为一。功能相对Selenium更强大,语法也更加简洁优雅,具有Selenium基础之后再学习DrissionPage也会更加简单。
官网地址:https://www.drissionpage/
环境安装
操作系统:Windows、Linux和Mac。
Python版本:3.6及以上
支持浏览器:Chromium内核(如Chrome和Edge)
pip install DrissionPage
标签定位
本小节中主要学习如何创建浏览器对象并使用浏览器对象的内置方法完成网站访问以及元素定位。
ele:定位符合条件的单个元素(支持xpath定位以及选择器定位,可以设定timeout超时时间)官网参数说明:https://www.drissionpage/browser_control/get_elements/find_in_object/#-ele
import time
from DrissionPage.common import By
from DrissionPage import ChromiumPage
# 1.创建浏览器对象并访问百度首页
page = ChromiumPage()
page.get('https://www.baidu')
# 2.浏览器最大化
page.set.window.max()
# 3.输入搜索内容
# page.ele('xpath://input[@id="kw"]').input('jk')
# page.ele('xpath://input[@id="su"]').click()
page.ele((By.XPATH, '//input[@id="kw"]')).input('jk')
page.ele((By.XPATH, '//input[@id="su"]')).click()
time.sleep(5)
# 4.关闭浏览器
page.quit()
标签等待
本小节主要学习page对象中的eles_loaded方法,用于等待某一个元素或所有元素是否载入到页面中
官网参数说明:https://www.drissionpage/browser_control/waiting/#-waiteles_loaded
from DrissionPage import ChromiumPage
page = ChromiumPage()
page.get('https://www.baidu')
flag = page.wait.eles_loaded('xpath://input[@id="su"]', timeout=2)
print(flag)
page.quit()
多标签定位
本小节主要学习eles方法定位多标签以及如何使用ele / eles对象完成文本数据提取以及属性值的提取
eles:定位符合条件的多个元素,返回值类型为列表官网参数说明:https://www.drissionpage/browser_control/get_elements/find_in_object/#-eles
from DrissionPage import ChromiumPage
page = ChromiumPage()
page.get('https://movie.douban/top250')
div_list = page.eles("xpath://ol[@class='grid_view']/li/div[@class='item']")
for item in div_list:
item_dict = dict()
movie_name = item.ele('xpath:./div[@class="info"]/div[@class="hd"]//span[1]').text
movie_image = item.ele('xpath:./div[@class="pic"]/a').attr('href')
"""
attr: 获取指定的属性值
attrs: 获取当前标签中的所有属性与值, 返回的数据类型为字典
详情: https://www.drissionpage/SessionPage/get_ele_info/#%EF%B8%8F%EF%B8%8F-attrs
"""
# movie_attr = item.ele('xpath:./div[@class="pic"]/a').attrs
print(movie_name, movie_image)
page.quit()
子页面切换
在Selenium中如果需要控制iframe标签中的内容则需要使用switch_to.frame方法,DrissionPage无需操作。
DrissionPage也可以单独对iframe进行操作:https://www.drissionpage/browser_control/iframe
import time
from DrissionPage.common import By
from DrissionPage import ChromiumPage
page = ChromiumPage()
page.get('https://www.douban')
page.ele((By.CLASS_NAME, 'account-tab-account')).click()
page.ele((By.ID, 'username')).input('admin')
page.ele((By.ID, 'password')).input('admin123')
time.sleep(3)
page.quit()
import time
from DrissionPage import ChromiumPage
class MyEmail:
def __init__(self):
self.page = ChromiumPage()
self.url = 'https://mail.163/'
def login_email(self, email, password):
self.page.get(self.url)
self.page.ele('xpath://input[@name="email"]').input(email)
self.page.ele('xpath://div[@class="u-input box"]//input[@name="password"]').input(password)
self.page.ele('xpath://*[@id="dologin"]').click()
time.sleep(10)
self.page.quit()
if __name__ == '__main__':
my_email = MyEmail()
my_email.login_email('admnin@163', 'admin123')
湖南省采购网
接口监听
本小节主要学习如何使用DrissionPage监听指定的api接口并直接获取接口中的数据
page.listen.start('接口地址'):启动监听器
page.listen.steps(count=你要获取的数据包总数):返回一个可迭代对象,用于for循环,每次循环可从中获取到数据包
<DataPacket>.response.body:获取数据包中的数据网站请求地址:http://wwwgp-hunan.gov/page/content/more.jsp?column_code=2
官网参数说明:https://www.drissionpage/browser_control/listener/#-listenstart
from DrissionPage.common import By
from DrissionPage import ChromiumPage
page = ChromiumPage()
url = 'http://wwwgp-hunan.gov/page/content/more.jsp?column_code=2'
# api监听: 先监听再请求
page.listen.start('/mvc/getContentList.do')
page.get(url)
# 获取api返回的数据(page.listen.steps(): 返回的是一个迭代器)
# print(page.listen.steps())
# 循环获取数据(count: 监听次数)
page_num = 1
for item in page.listen.steps(count=8):
# 打印监听后返回的对象
# print(item)
# 打印数据
print(f'第{page_num}页(数据类型[{type(item.response.body[0])}]):', item.response.body)
# 下一页
flag = page.ele((By.LINK_TEXT, '下一页'), timeout=3)
if flag:
flag.click()
else:
print('翻页结束...')
page.quit()
break
page_num += 1
动作链(菜鸟学堂案例)
本小节主要学习actions动作链对象所提供的内置方法
官网内置方法说明:https://www.drissionpage/browser_control/actions/#%EF%B8%8F-%E4%BD%BF%E7%94%A8%E6%96%B9%E6%B3%95
import time
from DrissionPage.common import By
from DrissionPage import ChromiumPage
page = ChromiumPage()
url = 'http://www.runoob/try/try.php?filename=jqueryui-api-droppable'
page.get(url)
source = page.ele((By.ID, 'draggable'))
target = page.ele((By.ID, 'droppable'))
page.actions.hold(source).release(target)
time.sleep(2)
page.quit()
验证码滑动(云南省建设监管)
import time
from DrissionPage.common import By
from DrissionPage import ChromiumPage
url = 'https://ynjzjgcx/dataPub/enterprise'
page = ChromiumPage()
page.get(url)
# 定位滑动控件并选中控件
button = page.ele((By.CLASS_NAME, 'slide-verify-slider-mask-item'))
page.actions.hold(button)
# 向右移动指定的像素位并释放
page.actions.right(150).release()
time.sleep(3)
page.quit()
"""
绕过验证码需要专门学习ocr图像识别库
"""
SessionPage的使用
SessionPage基于requests进行网络连接,因此可使用requests内置的所有请求方式,包括get()、post()、head()、options()、put()、patch()、delete()。
官网内置方法说明:https://www.drissionpage/SessionPage/visit/
from DrissionPage import SessionPage
url = 'http://wwwgp-hunan.gov/mvc/getContentList.do'
form_data = {
'column_code': 2,
'title': '',
'pub_time1': '',
'pub_time2': '',
'own_org': 1,
'page': 1,
'pageSize': 18
}
page = SessionPage()
page.post(url, data=form_data)
print(page.user_agent) # 自动构建请求头
print(page.response.json())
# page.close()
WebPage的使用
WebPage是整合了上面两者的页面对象,既可控制浏览器,又可收发数据包,并且可以在这两者之间共享登录信息。
它有两种工作模式:d模式和s模式。d模式用于控制浏览器,s模式用于收发数据包。WebPage可在两种模式间切换,但同一时间只能处于其中一种模式。
当前对象在新版本中已被作者刻意淡化,不建议使用。
官方文档说明:https://drissionpage/DP32Docs/get_start/basic_concept/#webpage
import time
from DrissionPage import WebPage
# 创建页面对象,初始d模式
page = WebPage('d')
# 访问百度
page.get('http://www.baidu')
# 定位输入框并输入关键字
page.ele('#kw').input('DrissionPage')
# 点击"百度一下"按钮
page.ele('@value=百度一下').click()
# 等待页面加载
page.wait.load_start()
# 切换到s模式
page.change_mode()
# 获取所有结果元素
results = page.eles('tag:h3')
# 遍历所有结果
for result in results:
# 打印结果文本
print(result.text)
time.sleep(3)
page.quit()
豆丁网(请求头加密请求)
请求头加密的爬虫案例
网站访问地址:https://kaoyan.docin/pdfreader/web/#/docin/documents?type=1&keyword=%E5%A4%8D%E8%AF%95%E4%BB%BF%E7%9C%9F%E6%A8%A1%E6%8B%9F
import requests
headers = {
"Accept": "application/json, text/plain, */*",
"Accept-Language": "zh-CN,zh;q=0.9",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"Content-Type": "application/json",
"Origin": "https://kaoyan.docin",
"Pragma": "no-cache",
"Referer": "https://kaoyan.docin/",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "cross-site",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
"X-Application": "Pdfreader.Web",
"X-Nonce": "02fec55d-8a78-2453-cb18-42f62ef46732",
"X-Sign": "C0FC7B5784DDE2FD774DD00C6AB89BE5",
"X-Timestamp": "1733317931",
"X-Token": "null",
"X-Version": "V2.2",
"sec-ch-ua": "Google Chrome;v=131, Chromium;v=131, Not_A Brand;v=24",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "macOS"
}
url = "https://www.handebook/api/web/document/list"
json_data = {
"SearchType": 0,
"SearchKeyword": "复试仿真模拟",
"DocumentType": " ",
"UniversityCode": "",
"MajorCode": "",
"ExamSubjectList": [],
"PageIndex": 1,
"PageSize": 30
}
response = requests.post(url, headers=headers, json=json_data)
print(response.json())
import pymongo
from DrissionPage import WebPage
from DrissionPage.common import By
class DouDing:
mongo_client = pymongo.MongoClient()
def __init__(self):
self.url = 'https://kaoyan.docin/pdfreader/web/#/docin/documents?type=1&keyword=%E5%A4%8D%E8%AF%95%E4%BB%BF%E7%9C%9F%E6%A8%A1%E6%8B%9F'
# self.page = WebPage(mode='d') # mode为设置session模式或者driver模式, 默认为driver
self.page = WebPage()
def get_info(self):
self.page.listen.start('/api/web/document/list')
self.page.get(self.url)
for item in self.page.listen.steps(count=20):
# print(item)
self.parse_info(item.response.body)
# 翻页
self.page.ele((By.CLASS_NAME, 'btn-next'), timeout=3).click()
self.page.quit()
def parse_info(self, info):
for temp in info['Data']['DocumentInfos']:
item = dict()
item['DocumentGuid'] = temp['DocumentGuid']
item['DocumentName'] = temp['DocumentName']
item['DocumentPrice'] = temp['DocumentPrice']
self.save_info(item)
def save_info(self, info_dict):
collection = self.mongo_client['py_spider']['dou_ding']
collection.insert_one(info_dict)
print('保存成功:', info_dict)
if __name__ == '__main__':
dou_ding = DouDing()
dou_ding.get_info()
文件下载
DrissionPage内置下载器,无需手动构建open()文件对象官网说明:https://www.drissionpage/download/intro
import os
from DrissionPage import SessionPage
url = 'https://www.lpbzj.vip/allimg'
page = SessionPage()
page.get(url)
element_div = page.s_eles("xpath://div[@id='posts']")[0]
detail_url_list = element_div.s_eles("xpath:.//div[@class='img']/a")
save_path = './美女写真/'
os.makedirs(save_path, exist_ok=True) # 确保保存目录存在
for detail_url in detail_url_list:
page.get(detail_url.attr('href'))
div_element = page.s_eles("xpath://div[@class='article-content clearfix']")[0]
image_url_list = div_element.s_eles("xpath:.//img")
for image_url in image_url_list:
img_src = image_url.attr('src')
# 同步下载
# res = page.download(img_src, save_path)
# print('task status:', res)
# 添加下载任务并发下载
page.download.add(img_src, save_path)
print(f"Downloading image from: {img_src}")
1-唯品会接口数据抓取-项目实战
唯品会接口数据抓取
import json
from DrissionPage.common import By
from DrissionPage import ChromiumPage
page = ChromiumPage()
url = 'https://category.vip/suggest.php?keyword=%E7%94%B5%E8%84%91&ff=235%7C12%7C1%7C1&tfs_url=%2F%2Fmapi-pc.vip%2Fvips-mobile%2Frest%2Fshopping%2Fpc%2Fsearch%2Fproduct%2Frank&page=1'
page.listen.start('vips-mobile/rest/shopping/pc/product/module/list/v2')
page.get(url)
for item in page.listen.steps():
response_body = item.response.body
try:
# 提取 JSON 字符串部分
start_index = response_body.find('{')
end_index = response_body.rfind('}') + 1
json_str = response_body[start_index:end_index]
info_dict = json.loads(json_str)
for temp in info_dict['data']['products']:
shop_info = dict()
shop_info['商品名称'] = temp['title']
shop_info['商品品牌'] = temp['brandShowName']
shop_info['商品价格'] = temp['price']['salePrice']
print(shop_info)
button = page.ele((By.CLASS_NAME, 'cat-paging-next'), timeout=3)
if button:
button.click()
else:
print('爬虫结束...')
page.quit()
except AttributeError:
print('数据为空:', response_body)
2-小红书首页数据抓取
小红书首页数据抓取
from DrissionPage import ChromiumPage
page = ChromiumPage()
page.listen.start('api/sns/web/v1/homefeed')
page.get('https://www.xiaohongshu/explore')
while True:
js_code = f"document.documentElement.scrollTop = document.documentElement.scrollHeight * {1000}"
page.run_js(js_code)
# 未获取返回false, 获取则返回列表, 列表元素类型为DataPacket
is_api_list = page.listen.wait(count=5, timeout=1)
print('数据状态:', is_api_list)
if is_api_list:
for item in is_api_list:
print(item.response.body)
3-小红书自动评论脚本(自行研究)
import time
from DrissionPage import ChromiumPage
search = str(input("输入关键词:"))
content = str(input("输入评论内容:"))
page = ChromiumPage()
page.get('https://www.xiaohongshu/search_result?keyword=' + search + '&source=unknown&type=51')
print("网站加载成功!")
time.sleep(2)
for time_button in range(1, 20): # 下滑多少下 你就改多少下
time.sleep(2)
page.scroll.to_bottom()
print("当前下滑:", time_button, "次,剩余", 20 - time_button, "次后,将会开始抓取数据...")
print("全部下滑完毕开始抓取页面的元素链接!")
my_list = list()
ele = page.eles('.cover ld mask')
name_ele = page.eles('.title')
for href, name in zip(ele, name_ele):
lian = href.link
na = name.text
print(na, lian)
my_list.extend(lian.split(','))
sums = 0
print("本次获取数据:", len(my_list), "条")
for like_list in my_list:
sums = sums + 1
print("序号:", sums, "链接:", like_list)
page.get(like_list)
time.sleep(1)
input_list1 = page.ele('.chat-wrapper').click()
input_list2 = page.ele('.content-input').input(content)
time.sleep(0.5)
button = page.ele('.btn submit').click()
print("发送成功:", content, "-")
time.sleep(2)
print("*" * 30)
input("主程序结束...")
本文标签: 简单 DrissionPage
版权声明:本文标题:DrissionPage简单使用 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://roclinux.cn/b/1764949427a3333148.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论