LXML库XPATH解析HTML

  • Post author:
  • Post category:其他




一:简述

基于Python做爬虫,在html页面爬取,解析方面,Xpath有极大的优势,也是由于LXML库的丰富功能,使爬虫越来越简单。



二:LXML库安装




pip install lxml



1:AttributeError: module ‘lxml’ has no attribute ‘etree’;

原因:anaconda中base环境中如果有lxml包的话,虚拟环境就会报错。
解决方案:在base环境中执行: pip uninstall -y lxml。



三:示例

from lxml import etree

text = """
<div>
   <ul>
     <li name="zhang" class="two">张三</li>
    <li name="li" class="three">李四</li>
    <li name="wang" class="four">王五</li>
   </ul>
</div>
"""

html = etree.HTML(text)
results = html.xpath('//li')
print(results)
for r in results:
    print(r.tag)
    print(r.text)
    print(r.attrib)
    
# 结果如下:
li
张三
{'name': 'zhang', 'class': 'two'}
li
李四
{'name': 'li', 'class': 'three'}
li
王五
{'name': 'wang', 'class': 'four'}



四:爬虫案例

import requests
from lxml import etree

"""
@author: rainsty
@file:   xpath-basis.py
@time:   2019-09-03 17:42
@description:
"""

def get_html_xpath(xpath):
    url = 'http://www.jinxiaoke.com/'
    resp = requests.get(url)
    code = resp.encoding
    # print(resp.text.encode(code).decode('utf-8'))
    html = etree.HTML(resp.text.encode(code).decode('utf-8'))
    results = html.xpath(xpath)
    return [(r.text, r.attrib) for r in results if r.text is not None]


def main():
    print(get_html_xpath('/html/body/div[1]/section[5]/div[@class="container"]/div[1]/div[2]//p'))
    print(get_html_xpath('/html/body/div[1]/section[5]/div[@class="container"]/div[1]/div[2]/p'))
    print(get_html_xpath('/html/body/div[1]/section[5]/div[@class="container"]/div[1]/div[2]/p[last()-2]'))
    print(get_html_xpath('/html/body/div[1]/section[5]/div[@class="container"]/div[1]/div[2]/p[position()<2]'))


if __name__ == '__main__':
    main()



五:源码地址分享

源码地址:Github:[https://github.com/Rainstyed/rainsty/blob/master/LearnPython/xpath_basis.py]



版权声明:本文为weixin_43933475原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。