1.爬取博客园中每页每条新闻的标题和url,在cnblog.py中写入操作内容(增加对每页的爬取)
# -*- coding: utf-8 -*-
import scrapy
import sys
import io
from..items import cnlogsItem
from scrapy.selector import Selector
from scrapy.http import Request
# from scrapy.dupefilters import RFPDupeFilter
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding="utf-8")
class CnblogsSpider(scrapy.Spider):
name = 'cnblogs'
allowed_domains = ['cnblogs.com']
start_urls = ['http://cnblogs.com/']
def parse(self, response):
print(response.url)
line1 = Selector(response=response).xpath('//div[@id="post_list"]//div[@class="post_item_body"]') #用selecter采集数据
# href = Selector(response=response).xpath('//div[@id="post_list"]//div[@class="post_item_body"]/h3/a[@class="titlelnk"]/@href').extract()
items = []
for obj in line1:
title = obj.xpath('./h3/a[@class="titlelnk"]/text()').extract()
href = obj.xpath('./h3/a[@class="titlelnk"]/@href').extract() #取href的值
item_obj = cnlogsItem(title=title,href=href) #把title、href封装成一个对象
item_obj['title']=title[0]
item_obj['href']=href[0]
#将item对象传递给pipeline
yield item_obj
line2 = Selector(response=response).xpath('//div[@class="pager"]/a/@href').extract()
for url in line2:
url = "http://cnblogs.com%s" %url #网址的拼接
#将新要访问的url添加到调度器
yield Request(url=url,callback=self.parse)
2.去重操作,创建duplication.py文件,并写入以下内容:
class RepeatFilter(object):
def __init__(self): #2.进行对象的初始化
self.visited_set =set()
@classmethod
def from_settings(cls, settings): #执行次序 1.创建对象
return cls()
def request_seen(self, request): # 去重操作 #4检查是否已经访问过
if request.url in self.visited_set:
return True
self.visited_set.add(request.url)
return False
def open(self): # can return deferred #3.开始爬取
print('open')
pass
def close(self, reason): # can return a deferred #5.停止爬取
print('close')
pass
def log(self, request, spider): # log that a request has been filtered
pass
3.在settings.py中添加配置参数
# DEPTH_LIMIT = 2 #设置爬取页数的深度
DUPEFILTER_CLASS = "pabokeyuan.duplication.RepeatFilter" #配置到pabokeyuan的目录下
4.在cmd中执行 E:\pycharm\pabokeyuan>scrapy crawl cnblogs –nolog,出现以下内容(在settings中设置DEPTH_LIMIT = 2),爬取的url
open
https://www.cnblogs.com/
https://www.cnblogs.com/sitehome/p/4
https://www.cnblogs.com/sitehome/p/7
https://www.cnblogs.com/sitehome/p/10
https://www.cnblogs.com/sitehome/p/9
https://www.cnblogs.com/sitehome/p/11
https://www.cnblogs.com/sitehome/p/8
https://www.cnblogs.com/sitehome/p/6
https://www.cnblogs.com/sitehome/p/5
https://www.cnblogs.com/
https://www.cnblogs.com/sitehome/p/2
https://www.cnblogs.com/sitehome/p/3
https://www.cnblogs.com/sitehome/p/13
https://www.cnblogs.com/sitehome/p/12
https://www.cnblogs.com/sitehome/p/16
https://www.cnblogs.com/sitehome/p/200
https://www.cnblogs.com/sitehome/p/15
https://www.cnblogs.com/sitehome/p/14
https://www.cnblogs.com/sitehome/p/196
https://www.cnblogs.com/sitehome/p/197
https://www.cnblogs.com/sitehome/p/195
https://www.cnblogs.com/sitehome/p/198
https://www.cnblogs.com/sitehome/p/199
close
5.在文件目录下可以看到news.json文件,打开该文件就会看到以下内容,爬取了上千条新闻标题和URL
创建FTP访问的YUM源
https://www.cnblogs.com/ylovew/p/11620870.html
Android开发——Toolbar常用设置
https://www.cnblogs.com/kexing/p/11620853.html
『王霸之路』从0.1到2.0一文看尽TensorFlow奋斗史
https://www.cnblogs.com/xiaosongshine/p/11620816.html
javascript中字符串对象常用的方法和属性
https://www.cnblogs.com/jjgw/p/11608617.html
非对称加密openssl协议在php7实践
https://www.cnblogs.com/wscsq789/p/11620733.html
数据结构(3):队列的原理和实现
https://www.cnblogs.com/AIThink/p/11620724.html
SUSE Ceph 的 'MAX AVAIL' 和 数据平衡 - Storage 6
https://www.cnblogs.com/alfiesuse/p/11620474.html
Flume初见与实践
https://www.cnblogs.com/novwind/p/11620626.html
Eureka实战-2【构建Multi Zone Eureka Server】
https://www.cnblogs.com/idoljames/p/11620616.html
C#刷遍Leetcode面试题系列连载(1) - 入门与工具简介
https://www.cnblogs.com/enjoy233/p/csharp_leetcode_series_1.html
朱晔和你聊Spring系列S1E11:小测Spring Cloud Kubernetes @ 阿里云K8S
https://www.cnblogs.com/lovecindywang/p/11620544.ht
**注意:**本文是在上一篇文章的基础上编写,pipeline文件和item文件的配置,如下链接
https://blog.csdn.net/doudou_wsx/article/details/101977627
版权声明:本文为doudou_wsx原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。