一:简述
基于Python做爬虫,在html页面爬取,解析方面,Xpath有极大的优势,也是由于LXML库的丰富功能,使爬虫越来越简单。
二:LXML库安装
pip install lxml
pip install lxml
1:AttributeError: module ‘lxml’ has no attribute ‘etree’;
原因:anaconda中base环境中如果有lxml包的话,虚拟环境就会报错。
解决方案:在base环境中执行: pip uninstall -y lxml。
三:示例
from lxml import etree
text = """
<div>
<ul>
<li name="zhang" class="two">张三</li>
<li name="li" class="three">李四</li>
<li name="wang" class="four">王五</li>
</ul>
</div>
"""
html = etree.HTML(text)
results = html.xpath('//li')
print(results)
for r in results:
print(r.tag)
print(r.text)
print(r.attrib)
# 结果如下:
li
张三
{'name': 'zhang', 'class': 'two'}
li
李四
{'name': 'li', 'class': 'three'}
li
王五
{'name': 'wang', 'class': 'four'}
四:爬虫案例
import requests
from lxml import etree
"""
@author: rainsty
@file: xpath-basis.py
@time: 2019-09-03 17:42
@description:
"""
def get_html_xpath(xpath):
url = 'http://www.jinxiaoke.com/'
resp = requests.get(url)
code = resp.encoding
# print(resp.text.encode(code).decode('utf-8'))
html = etree.HTML(resp.text.encode(code).decode('utf-8'))
results = html.xpath(xpath)
return [(r.text, r.attrib) for r in results if r.text is not None]
def main():
print(get_html_xpath('/html/body/div[1]/section[5]/div[@class="container"]/div[1]/div[2]//p'))
print(get_html_xpath('/html/body/div[1]/section[5]/div[@class="container"]/div[1]/div[2]/p'))
print(get_html_xpath('/html/body/div[1]/section[5]/div[@class="container"]/div[1]/div[2]/p[last()-2]'))
print(get_html_xpath('/html/body/div[1]/section[5]/div[@class="container"]/div[1]/div[2]/p[position()<2]'))
if __name__ == '__main__':
main()
五:源码地址分享
源码地址:Github:[https://github.com/Rainstyed/rainsty/blob/master/LearnPython/xpath_basis.py]
版权声明:本文为weixin_43933475原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。