html 样式
该网页源代码是微博的一部分,我们需要提取博文,但发现
标签下文本被
分割开,这种情况应当如何处理
<div class="content" node-type="like">
<div class="info">
<div class="menu s-fr">
<a href="javascript:void(0);" action-type="fl_menu"><i class="wbicon">c</i></a>
<ul style="display:none;" node-type="fl_menu_right">
<li><a onclick="javascript:window.open('//service.account.weibo.com/reportspam?rid=4488118096861246&type=1&from=10501&url=&bottomnav=1&wvr=6', 'newwindow', 'height=700, width=550, toolbar =yes, menubar=no, scrollbars=yes, resizable=yes, location=no, status=no');" href="javascript:void(0);">投诉</a></li>
</ul>
</div>
<div>
<a class="name" href="//weibo.com/2864108830?refer_flag=1001030103_" target="_blank" suda-data="key=tblog_search_weibo&value=seqid:158609447248102927726|type:1|t:0|pos:2-0|q:%E7%97%98%E7%97%98%E5%8E%8B%E5%8A%9B|ext:cate:31,mpos:19,click:user_name" nick-name="一Z_c一">一Z_c一</a>
<a title="微博达人" href="//club.weibo.com/intro" target="_blank"><i class="icon-vip icon-daren"></i></a>
<!--广告微博加关注按钮 -->
</div>
</div>
<p class="txt" node-type="feed_list_content" nick-name="一Z_c一">
忌甜忌辣忌油忌熬夜否则就会长<em class="s-color-red">痘痘</em>变丑 忌咖啡忌可可忌巧克力忌熬夜忌<em class="s-color-red">压力</em>忌受刺激忌紧张忌生气否则就会偏头痛 我也太难了.. </p>
<p class="from">
xpath提取方式
具体代码如下
blog_content = str(blog.xpath("string(div[@class = 'card']//div/div[2]/p)").strip())
其中blog为通过提取的博文分块
代码如下:
tree = html.fromstring(response.text)
blog_list = tree.xpath("//div[@class='card-wrap']")
print(len(blog_list)
for blog in blog_list:
......
版权声明:本文为weixin_43165512原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。