踩点
进入网站
我们一直往下拉,发现它是动态加载的。
一直往下滑,发现只能加载 500 个图片,说明每日推荐一天500张,
好家伙
爬取单张图片
我们先点击一张图片进去
发现图片可以放大
那我们肯定是要获取像素高的图片了
按F12查看一下网页
我们发现这个网址就是原图!
获取一下这个网页代码,保存到txt文档中。
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
'referer': 'https://www.pixiv.net/ranking.php?mode=daily&content=illust',
}
url = 'https://www.pixiv.net/artworks/85300112'
response = requests.get(url, headers=headers)
f = open('M:/a.txt', 'wb')
f.write(response.text.encode('utf8'))
f.close()
复制一下刚刚找到的原图网址,看看能否在txt中找到:
找一下关系,诶,我们发现该网址前面的键为
original
,复制一下搜索
original
,只有三个匹配结果,对比三个结果发现其余两个一个是大写,一个不带引号,因此
"original"
我们可以根据这个东西,使用正则直接将原图网址提取出来。
我们用正则提取一下原图网址:
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
'referer': 'https://www.pixiv.net/ranking.php?mode=daily&content=illust',
}
url = 'https://www.pixiv.net/artworks/85300112'
response = requests.get(url, headers=headers)
picture = re.search('"original":"(.+?)"},"tags"', response.text)
print(picture.group(1))
结果如下:成功提取到原图地址!
获取到原图地址后,我们将图片下载保存一波:
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
'referer': 'https://www.pixiv.net/ranking.php?mode=daily&content=illust',
}
url = 'https://www.pixiv.net/artworks/85300112'
response = requests.get(url, headers=headers)
picture = re.search('"original":"(.+?)"},"tags"', response.text)
print(picture.group(1))
pic = requests.get(picture.group(1), headers=headers)
f = open('M:/1.%s' % (picture.group(1)[-3:]), 'wb')
f.write(pic.content)
f.close()
成功获取!
我们发现自己命名好难受,再把图片名字提出来,在刚刚获取到原图网址的地方,往上看一下,找到了
"illustTitle"
,查找一下,发现这个键也是唯一的,
真好
,那我们在写个正则把图片名字爬下来。
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
'referer': 'https://www.pixiv.net/ranking.php?mode=daily&content=illust',
}
url = 'https://www.pixiv.net/artworks/85300112'
response = requests.get(url, headers=headers)
name = re.search('"illustTitle":"(.+?)"', response.text)
print(name.group(1))
picture = re.search('"original":"(.+?)"},"tags"', response.text)
print(picture.group(1))
pic = requests.get(picture.group(1), headers=headers)
f = open('M:/%s.%s' % (name.group(1), picture.group(1)[-3:]), 'wb')
f.write(pic.content)
f.close()
运行成功
将获取单张图片的代码写成一个函数:
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
'referer': 'https://www.pixiv.net/ranking.php?mode=daily&content=illust',
}
# 保存路径
path = 'M:/'
def getSinglePic(url):
response = requests.get(url, headers=headers)
# 提取图片名称
name = re.search('"illustTitle":"(.+?)"', response.text)
# 提取图片原图地址
picture = re.search('"original":"(.+?)"},"tags"', response.text)
pic = requests.get(picture.group(1), headers=headers)
f = open(path + '%s.%s' % (name.group(1), picture.group(1)[-3:]), 'wb')
f.write(pic.content)
f.close()
换个图片测试一下:
url = 'https://www.pixiv.net/artworks/85317626'
getSinglePic(url)
没有问题:
获取日推所有图片网址
前面在踩点的时候,我们已经发现了这个网页是动态加载的,所以我们滑到最底下,去查看network
对比这几个包,我们发现每次请求的网址都是
https://www.pixiv.net/ranking.php?mode=daily&content=illust&p=?&format=json
,改变的只是p的值,一共获取了10个包,每个包里面有50个图片,所以我们直接请求这个网址,把p从1枚举到10,便可以获取到500张图片的信息!
-
但是我们究竟该提取什么东西呢 ?
- 我们展开单个图片,看到了两个图片地址,然而点进去发现图片不是原图,像素太差
- 我们前面已经实现了单张图片的抓取,我们只需要提取到所有这500张图片对应的地址,就可以直接获取到原图了。
- 但是没有找到图片地址,我们点开几张图片进去,看一下网址,
https://www.pixiv.net/artworks/85311013
https://www.pixiv.net/artworks/85318602
https://www.pixiv.net/artworks/85316875
- 发现只有后面的数字发生了变化,而后面这个数字
-
正是
illust_id
,所以我们只需要提取到所有的
illust_id
即可!
我们查看一下
illust_id
的位置
搜索一下,发现正好50个!
直接上正则提取
illust_id
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
'referer': 'https://www.pixiv.net/ranking.php?mode=daily&content=illust',
}
url = 'https://www.pixiv.net/ranking.php?mode=daily&content=illust&p=10&format=json'
res = requests.get(url, headers=headers)
illust_id = re.findall('"illust_id":(\d+?),', res.text)
print(len(illust_id), illust_id)
输出结果,获取成功
拼接一下图片id,获取到图片网址
url = 'https://www.pixiv.net/ranking.php?mode=daily&content=illust&p=10&format=json'
res = requests.get(url, headers=headers)
illust_id = re.findall('"illust_id":(\d+?),', res.text)
picUrl = ['https://www.pixiv.net/artworks/' + i for i in illust_id]
for i in picUrl:
print(i)
输出,成功
下载所有图片
将获取所有图片的代码写成一个函数,并直接在函数里面调用下载图片的函数
def getAllPicUrl():
count = 1
for n in range(1, 10 + 1):
url = 'https://www.pixiv.net/ranking.php?mode=daily&content=illust&p=%d&format=json' % n
response = requests.get(url, headers=headers)
illust_id = re.findall('"illust_id":(\d+?),', response.text)
picUrl = ['https://www.pixiv.net/artworks/' + i for i in illust_id]
for url in picUrl:
print('正在下载第 %d 张图片' % count, end=' ')
getSinglePic(url)
print('下载成功', end='\n')
count += 1
return None
运行测试一下
getAllPicUrl()
完美运行
总代码(不是)
import requests
import re
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
'referer': 'https://www.pixiv.net/ranking.php?mode=daily&content=illust',
}
# 下载路径
path = 'M:/'
def getSinglePic(url):
response = requests.get(url, headers=headers)
# 提取图片名称
name = re.search('"illustTitle":"(.+?)"', response.text)
# 提取图片原图地址
picture = re.search('"original":"(.+?)"},"tags"', response.text)
pic = requests.get(picture.group(1), headers=headers)
f = open(path + '%s.%s' % (name.group(1), picture.group(1)[-3:]), 'wb')
f.write(pic.content)
f.close()
def getAllPicUrl():
count = 1
for n in range(1, 10 + 1):
url = 'https://www.pixiv.net/ranking.php?mode=daily&content=illust&p=%d&format=json' % n
response = requests.get(url, headers=headers)
illust_id = re.findall('"illust_id":(\d+?),', response.text)
picUrl = ['https://www.pixiv.net/artworks/' + i for i in illust_id]
for url in picUrl:
print('正在下载第 %d 张图片' % count, end=' ')
getSinglePic(url)
print('下载成功', end='\n')
count += 1
return None
getAllPicUrl()
-
正在高兴的下载着 ,结果。。。。
看看报错信息。路径名非法,文件不能以****.jpg命名 ?
去文件夹查看一下。
卧twii’j’w!@1
既然如此,那么我们就把不合法的名字替换掉
# 全局变量
repeat = 1
name = re.search('"illustTitle":"(.+?)"', response.text)
name = name.group(1)
if re.search('[\\\ \/ \* \? \" \: \< \> \|]', name) != None:
name = re.sub('[\\\ \/ \* \? \" \: \< \> \|]', str(repeat), name)
repeat += 1
再次运行,发现已经没了问题
总代码(撒花)
import requests
import re
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
'referer': 'https://www.pixiv.net/ranking.php?mode=daily&content=illust',
}
path = 'M:/'
repeat = 1
def getSinglePic(url):
global repeat
response = requests.get(url, headers=headers)
# 提取图片名称
name = re.search('"illustTitle":"(.+?)"', response.text)
name = name.group(1)
if re.search('[\\\ \/ \* \? \" \: \< \> \|]', name) != None:
name = re.sub('[\\\ \/ \* \? \" \: \< \> \|]', str(repeat), name)
repeat += 1
# 提取图片原图地址
picture = re.search('"original":"(.+?)"},"tags"', response.text)
pic = requests.get(picture.group(1), headers=headers)
f = open(path + '%s.%s' % (name, picture.group(1)[-3:]), 'wb')
f.write(pic.content)
f.close()
def getAllPicUrl():
count = 1
for n in range(1, 10 + 1):
url = 'https://www.pixiv.net/ranking.php?mode=daily&content=illust&p=%d&format=json' % n
response = requests.get(url, headers=headers)
illust_id = re.findall('"illust_id":(\d+?),', response.text)
picUrl = ['https://www.pixiv.net/artworks/' + i for i in illust_id]
for url in picUrl:
print('正在下载第 %d 张图片' % count, end=' ')
getSinglePic(url)
print('下载成功', end='\n')
count += 1
return None
getAllPicUrl()