首先splash环境搭建:
https://blog.csdn.net/weixin_43343144/article/details/88022756
安装splash可能报的错误:
https://blog.csdn.net/weixin_43343144/article/details/89305941
pipenv install scrapy-splash
splash官方文档:
https://splash.readthedocs.io/en/stable/api.html#render-html
scrapy-splash官方文档:
https://github.com/scrapy-plugins/scrapy-splash
别人翻译的scrapy-splash文章:
https://www.e-learn.cn/content/qita/800748
scrapy-splash使用总结【使用SplashRequest代替了Request就可以抓取动态js页面了。SplashRequest就是继承了Request类增加了一些配置动态js的参数,使用起来和抓取静态页面原理是一样的,就是多了一个splash动态解析!】
settings.py配置文件:
SPLASH_URL ='http://192.168.99.100:8050'
//在settings文件DOWNLOADER_MIDDLEWARES选项中增加下面三个
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
//在settings文件SPIDER_MIDDLEWARES选项中增加下面这个
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
quotes.py爬虫文件:
# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest,SplashResponse
from scrapy.http import Request,Response
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
# start_urls = ['http://quotes.toscrape.com/js/']
url = "http://quotes.toscrape.com/js"
def start_requests(self):
# 利用SplashRequest对象直接可以读取动态js页面(前提必须搭建好了splash环境)
# SplashRequest其实就是继承了Request,并实现了splash的集成!
yield SplashRequest(url=self.url,args={"images":0,"timeout":30})
def parse(self, response):
res = response #type:Response
quotes = res.css("div.quote .text::text").getall()
pass
版权声明:本文为weixin_43343144原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。