RE正则表达式（使用python语言进行爬虫为例）

Post author:xfxia
Post published:2023年7月19日
Post category:python

re正则表达式，是一种对

字符串

进行操作的方法，可以在爬取网页时提取我们想要的数据。

认识re

1.re速览
2.re1
3.re匹配符 – – 特殊符号（决定匹配的数据量）
4.re通配匹配符（决定匹配什么数据）
5.re小练习
6.re的其他匹配符
7.re的贪婪/非贪婪模式
8.re转义字符
9.re的几个函数
10.re处理字符串
11.re处理爬虫爬取到的html字符串
12.re爬取翻页网站数据

1.re速览


for i in range(0,3):
i依次从0到2，一共3次
for i in range(0, 25, 1)
i依次从0到24，25次

【4】 findall
string8 = "{ymd:'2018-01-01',tianqi:'晴',aqiInfo:'轻度污染'}," \
          "{ymd:'2018-01-02',tianqi:'阴~小雨',aqiInfo:'优'}," \
          "{ymd:'2018-01-03',tianqi:'小雨~中雨',aqiInfo:'优'}," \
          "{ymd:'2018-01-04',tianqi:'中雨~小雨',aqiInfo:'优'}"

()表示输出你小括号内匹配的内容
print(re.findall("tianqi:'(.*?)'", string8)
注：
使用 re.findall时，不需要进行 print(result.group())
只需要直接输出 re.findall('')即可



【1】
匹配某个字符串，match()只能匹配某个

result = re.match('[-\d]', text)
print(result.group())

点(.) 匹配任意的某个字符，无法匹配换行符，若想匹配，加re.DOTALL

\d: 匹配任意的某个数字

\D: 除数字外均可匹配

\s: 匹配空白符  注：\n、\t、\r都表示空白符

\w(小写):匹配小写的a-z，大写的A-Z，数字和下划线

\W:匹配除小写的w之外的所有符号

[] : ->> 组合的方式，只要在中括号内的内容均可匹配

知道了[]之后，则
\d  ->>  [0-9]
\D  ->>  [^0-9]
\w  ->>  [0-9a-zA-Z]
\W  ->>  [^0-9a-zA-Z]
[\d\D]、[\w\W]  -->  匹配所有的字符


【2】
在待匹配的内容后面加特殊符号，可以改变匹配的数量

星号(*):匹配零个或者多个字符
加号(+)：匹配一个或者多个
问号(?):要么匹配0个，要么匹配1个

text = '-a158-5555-6582'
有?, 对[]内容匹配零次或者一次
匹配-、数字、a，匹配零次或者一次，从起始位置开始匹配，匹配 0次不会报错
result = re.match('[-a\d]?', text)



【3】
（1）从头匹配和全局遍历
re.match():【必须】从字符串开头进行匹配
re.search():从左到右进行字符串的遍历，找到就返回，后续再出现，但不再返回结果

text = 'aapythpyon'
result = re.match('y',text)   # 报错
result = re.search('py',text) # py

（2）^的用法
 a.在中括号内表示取反
 b.在中括号外表示以指定的字符开始
text = 'pppypthon'

result = re.search('[^\d]+',text)
# pppypthon

result = re.search('^p+',text)
# ppp

（3) ……$: 表示匹配以……为结尾
以com为结尾提取数据，若不是以该词结尾就报错

text = 'python123@163.com'
result = re.search('[\w]+@[a-z0-9]+[.]com$',text)

(4)|: 匹配多个表达式或者字符串
如果将https|http|ftp|file放入[]，使得https|http|ftp|file理解为同一个字符串
（1）[]中括号认为里面的都是单个字符
（2）()认为是不同的字符串

2.re1

import re
'''
1.匹配某个字符串：match()只能匹配某个！
从起始位置进行匹配
'''
text = 'cpython'
result = re.match('py', text)
# print(result)
# print(result.group()) 报错

'''
点(.)匹配任意的某个字符
Tips:
    1.无法匹配换行符
    2.从起始位置进行匹配
'''
text = 'cpython'
result = re.match('.', text)
# c

'''
\d:匹配任意的某个数字
    1.只能匹配数字，其余数据类型均不匹配
    2.从起始位置开始
    3.只能匹配一个
'''
text = "211python"
result = re.match('\d', text)
# 2

'''
\D:除数字外均可匹配
    1.只能匹配非数字
    2.从起始位置开始
    3.只能匹配一个
'''
text = "python"
result = re.match('\D', text)
# p

'''
\s:匹配空白符
    1.从起始位置开始匹配
    2.\n、\t、\r都表示空白符
    3.必须是小写的s
    4.匹配空白字符
'''
text = '\npython'
result = re.match('\s', text)
# 输出换行符

'''
\w(小写):匹配小写的a-z，大写的A-Z，数字和下划线
    1.小写的w
    2.从头开始匹配
    3.除上述外无法匹配，但是中文可以，中文符号不行
'''
text = "_python"
result = re.match('\w', text)
# _

'''
\W:匹配除小写的w之外的所有符号
    1.匹配\w能匹配以外的所有符号
    1.大写的w
    2.从头开始匹配
'''
text = '--python'
result = re.match('\W', text)
# -

'''
[] : ->> 组合的方式，只要在中括号内的内容均可匹配
tips:
    1.[]内的内容都可以匹配
    2.[]内多个内容匹配内容时，取“或”，只要匹配对象中含有其中一个内容就匹配
    3.从起始位置匹配
    4.这一节demo14讲了只匹配某个字符
'''
text = ' thon'
result = re.match('[-\s]', text)
# 匹配到空格

3.re匹配符 – – 特殊符号（决定匹配的数据量）

import re
'''
星号(*):匹配零个或者多个字符
'''

text = '158-5555-6582'
# 没有*，就是从起始位置进行匹配，匹配到第一个。
result = re.match('[\d]',text)
# 1

# 有*，从起始位置进行匹配，匹配[]内容零次或多次
result = re.match('[\d]*',text)
# 158

# 有*，从起始位置进行匹配，匹配[]内容零次或者多次。这里每次相当于不仅匹配-还或者匹配\d
result = re.match('[-\d]*',text)
# 158-5555-6582

# 有*，从起始位置进行匹配，匹配[]内容零次！！！或者多次。
result = re.match('[-]*',text)
# 没有匹配到，输出空


'''
加号(+)：匹配一个或者多个
'''
text = 'a158-5555-6582'
#有+，对[]内容匹配一次或者多次(至少有一次)
result = re.match('[\d]+',text)
# print(result.group())

#有+，对[]内容匹配一次或者多次(至少有一次)，[]内容一次或者多次，匹配到不满足条件为止。
result = re.match('[a\d]+',text)
# print(result.group())

'''
问号(?):要么匹配0个，要么匹配1个
'''
text = '-a158-5555-6582'
# 有?, 对[]内容匹配零次或者一次
# 匹配-、数字、a，匹配零次或者一次，从起始位置开始匹配，匹配0次不会报错
result = re.match('[-a\d]?', text)
# -

'''
{m}:匹配指定的个数(m)
'''
text = '158-5555-6582'
# 有{k}，从起始位置匹配[]内容中k次
# 从起始位置，如果第四次[]内容不满足匹配要求，报错
result = re.match('[\d]{3}', text)
# 158

'''
{m,n}:匹配m到n个
但是默认匹配最多次数
'''
text = '158-5m55-6582'
result = re.match('[\d]{2,4}', text)
# 158
result = re.match('[-\d]{2,4}', text)
# 158-
result = re.match('[-\d]{2,6}', text)
# 158-5

4.re通配匹配符（决定匹配什么数据）

import re
'''
\d ==> [0-9]:匹配所有的数字
'''
# 配合*（0次或多次），匹配多次数字
text = '158-5555-6582'
result = re.match('[-0-9]*', text)
# print(result.group()) # 158-5555-6582

'''
\D ==> [^0-9]:匹配所有的非数字
'''
# 配合+（1次或多次），匹配一次或者多次非数字字符
# 配合*（0次或多次），匹配零次或者多次非数字字符
text = '158-5555-6582'
result = re.match('[^0-9]*', text)
# 没有匹配到，但是不会报错


'''
\w ==> [0-9a-zA-Z_]:匹配所有的数字、字母和下划线
'''
# -不在匹配范围之内
text = '158-5555-6582'
result = re.match('[0-9a-zA-Z_]+', text)
# 158


'''
\W ==> [^0-9a-zA-Z_]:匹配所有的非数字、字母和下划线
'''
text = '我158-5555-6582'
result = re.match('[^0-9a-zA-Z_]+', text)
# 我


'''
[\d\D]、[\w\W]:匹配所有的字符
'''

text = '---------123_145\n45 \t 678中文。'
result = re.match('[\w\W]+',text)
print(result.group()) # 可以全部输出



'''
点（.）:匹配任意的某个字符
'''
text = 'python12--//'
result = re.match('[.]+', text)
# print(result.group()) # 报错
# [问题]：既然可以匹配所有的字符，配合+应该可以匹配整个text
# [原因]：[.]表示的是仅匹配点，配合+，即匹配一次或者多次，text第一个不是.，所以报错

# 去掉中括号后，.才表示匹配所有的字符
result = re.match('.+', text)
print(result.group())

5.re小练习

import re

'''
验证手机号
1.必须是11位数字
2.第一位必须是1           1
3.第二位必须是3-9         [3456789]
4.第三位到第十一位没有要求  [0-9]
'''
text = '18726981556'
result = re.match('1[3-9][0-9]{9}', text)
# print(result.group()) # 18726981556

'''
验证邮箱
...@xxx.com
python123@163.com
1.用户名部分（@前的部分） ==> 英文字母、数字、下划线组成
2.域名部分（@后的部分）    ==> 数字、字母（一般都是小写）
'''
text = 'p_ython123@qq.com'
result = re.match('\w+@[0-9a-z]+[.]com', text)
# 匹配成功


'''
验证简易的身份证号【18位】
身份证号特点
前17位：[0-9]
第18位：[0-9xX]
'''
text = '342423199805200591'
result = re.match('[0-9]{17}[0-9xX]', text)
# print(result.group()) 匹配成功

6.re的其他匹配符

import re
'''
1、从头匹配和全局遍历
re.match():【必须】从字符串开头进行匹配
re.search():从左到右进行字符串的遍历，找到就返回，后续再出现，但不再返回结果
'''
text = 'aapythpyon'
result = re.match('y',text) # 报错

result = re.search('py',text)
# py

'''
2、^的用法
（1）在中括号内表示取反
（2）在中括号外表示以指定的字符开始
'''
text = 'pppypthon'
result = re.search('[^\d]+',text)
# pppypthon

result = re.search('^p+',text)
# ppp

'''
3、$:表示匹配以……为结尾
以com为结尾提取数据，若不是以该词结尾就报错
'''
text = 'python123@163.com'
result = re.search('[\w]+@[a-z0-9]+[.]com$',text)
# 匹配成功


'''
4、|:匹配多个表达式或者字符串
如果将https|http|ftp|file放入[]，使得https|http|ftp|file理解为同一个字符串
（1）[]中括号认为里面的都是单个字符
（2）()认为是不同的字符串
'''
text = 'https://www.baidu.com/'
result = re.search('https|http|ftp|file', text)
# https
result = re.search('[https|http|ftp|file]', text)
# h
result = re.search('[https|http|ftp|file]+', text)
# https
result = re.search('(https|http|ftp|file)', text)
# https

7.re的贪婪/非贪婪模式

import re

'''
贪婪模式：正则表达式会尽可能多地匹配字符【默认就是贪婪模式】

'''
text = 'python'
# 有+，对[]内容匹配一次或者多次(至少有一次)，[]内容一次或者多次，匹配到不满足条件为止。
result = re.match('[a-z]+', text)
# python


'''
非贪婪模式：正则表达式会尽可能少地匹配字符【添加？】
'''
text = 'python'
result = re.match('[a-z]+?', text)
# p

text = \
    """
<tr class="兰智数加学院">
<tr class="1">shujia1</tr>
<tr class="2">shujia2</tr>
<tr class="3">shujia3</tr>
<tr class="4">shujia4</tr>
<tr class="5">shujia5</tr>
<tr class="6">爬虫</tr>
"""
result = re.match('\n<tr[\d\D]+>', text)
# print(result.group()) # 全部输出
# '\n<tr[\d\D]+>' 所有内容都匹配出来了为什么？ 默认的贪婪模式，将>定位到了最后的>标签上


result = re.match('\n<tr[\d\D]+?>', text)
# print(result.group()) # <tr class="兰智数加学院">

string8 = "{ymd:'2018-01-01',tianqi:'晴',aqiInfo:'轻度污染'}," \
          "{ymd:'2018-01-02',tianqi:'阴~小雨',aqiInfo:'优'}," \
          "{ymd:'2018-01-03',tianqi:'小雨~中雨',aqiInfo:'优'}," \
          "{ymd:'2018-01-04',tianqi:'中雨~小雨',aqiInfo:'优'}"

# ()输出你小括号内匹配的内容
print(re.findall("tianqi:'(.*?)'", string8))
# 使用 re.findall时，不需要进行 print(result.group())
# 只需要直接输出 re.findall('')即可，其中内部小括号是我们得到的内容

8.re转义字符

import re

'''
转义字符：\
保持符号的本意，符号转义前的本来意义，也要了解转义后的含义
'''
pi = '3....1415926***'
# 给.和*转义成普通的字符
result = re.match('\d\.+\d+\*+', pi)

'''
. ->> 匹配任意的字符
\. 转义后：这里.就是小数点的本意 加[] 或者 \

* ->> 匹配0个或多个字符
\* 转义后：这里就是*的本意 []或\

'''
print(result.group())

9.re的几个函数

import re
'''

在python中
str = "hello,i am no.{}".format()
可以通过format传入参数
'''
x = "x"
str = "hello,i am no.{}".format(x)
print(str)


'''
group函数，使用()就可以完成
'''
text = 'my email is 2781162818@qq.com and PYTHON123@163.com'
result = re.search('[\s\w]+\s(\w+@[0-9a-z]+\.com)[\s\w]+\s(\w+@[0-9a-z]+\.com)', text)
# print(result.group())
# my email is 2781162818@qq.com and PYTHON123@163.com

# print(result.group(1))
# 2781162818@qq.com

# print(result.group(2))
# PYTHON123@163.com



'''
findall()：在整个字符串中查找所有满足条件的字符串
【返回结果为列表】
'''
text = 'my email is 2781162818@qq.com and PYTHON123@163.com'
result = re.findall('\s(\w+@[0-9a-z]+\.com)', text)
# print(result)
# ['2781162818@qq.com', 'PYTHON123@163.com']


'''
sub('a', 'b', text)：替换字符串
【匹配出来的字符串进行人为替换】
'''
text = 'my email is 2781162818@qq.com and PYTHON123@163.com'
result = re.sub('\s(\w+@[0-9a-z]+\.com)', ' xxx', text)
print(result)
# 常用于数据处理
# 将\n，\r，\t转换为空字符串（使用\s匹配这些）
re.sub("\s", "", text)


'''
split()：分割字符串
'''
text = 'my email is 2781162818@qq.com and PYTHON123@163.com'
# 按照空格分割
result = re.split(' ', text)
print(result)
# 按照不是\w的分割，空格不是\w。@不是\w。.也不是\w
result = re.split('[^\w]', text)
print(result)


'''
compile()：对正则表达式可以进行编译（注释和保存的作用）
'''
text = 'my email is 2781162818@qq.com and PYTHON123@163.com'
r = re.compile(r"""
    \s # 邮箱前的空格
    (\w+ #邮箱的第一部分，即@之前的部分
    @ #提取邮箱的@符号
    [0-9a-z]+ #邮箱的第二部分，即@之后.之前的信息
    \.com)  #匹配邮箱的结尾部分
""", re.VERBOSE)
result = re.findall(r, text)
print(result)
# ['2781162818@qq.com', 'PYTHON123@163.com']

'''
我来给提取两个邮箱的代码做个注释
'''
text = 'my email is 2781162818@qq.com and PYTHON123@163.com'
r = re.compile(r"""
    [\sa-z]+ # 匹配邮箱之前的空格及小写字母my email is
    \s      # 匹配278前面的那个空格
    (\w+    # 匹配第一个邮箱之前的部分
    @       # 匹配@符号
    [0-9a-z]+   # 匹配第一个邮箱的后面的部分
    \.com)   # 匹配第一个邮箱的最后
    
    [\sa-z]+ # 匹配第二个邮箱之前的空格及小写字母 and
    \s      # 匹配PYTHON前面的那个空格
    (\w+    # 匹配第二个邮箱之前的部分
    @       # 匹配@符号
    [0-9a-z]+   # 匹配第二个邮箱的后面的部分
    \.com)  # 匹配第二个邮箱的最后
""", re.VERBOSE)
result = re.findall(r, text)
print(result)
# [('2781162818@qq.com', 'PYTHON123@163.com')]

10.re处理字符串

text = \
    """
<ul class="ullist" padding="1" spacing="1">
    <li>
        <div id="top">
            <span class="position" width="350">职位名称</span>
            <span>职位类别</span>
            <span>人数</span>
            <span>地点</span>
            <span>发布时间</span>
        </div>
        <div id="even">
            <span class="l square">
              <a target="_blank" href="position_detail.php?id=33824&amp;keywords=python&amp;tid=87&amp;lid=2218">python开发工程师</a>
            </span>
            <span>技术类</span>
            <span>2</span>
            <span>合肥</span>
            <span>2018-10-23</span>
        </div>
        <div id="odd">
            <span class="l square">
              <a target="_blank" href="position_detail.php?id=29938&amp;keywords=python&amp;tid=87&amp;lid=2218">python后端</a>
            </span>
            <span>技术类</span>
            <span>2</span>
            <span>合肥</span>
            <span>2018-10-23</span>
        </div>
        <div id="even">
            <span class="l square">
              <a target="_blank" href="position_detail.php?id=31236&amp;keywords=python&amp;tid=87&amp;lid=2218">高级Python开发工程师</a>
            </span>
            <span>技术类</span>
            <span>2</span>
            <span>合肥</span>
            <span>2018-10-23</span>
        </div>
        <div id="odd">
            <span class="l square">
              <a target="_blank" href="position_detail.php?id=31235&amp;keywords=python&amp;tid=87&amp;lid=2218">python架构师</a>
            </span>
            <span>技术类</span>
            <span>1</span>
            <span>合肥</span>
            <span>2018-10-23</span>
        </div>
        <div id="even">
            <span class="l square">
              <a target="_blank" href="position_detail.php?id=34531&amp;keywords=python&amp;tid=87&amp;lid=2218">Python数据开发工程师</a>
            </span>
            <span>技术类</span>
            <span>1</span>
            <span>合肥</span>
            <span>2018-10-23</span>
        </div>
        <div id="odd">
            <span class="l square">
              <a target="_blank" href="position_detail.php?id=34532&amp;keywords=python&amp;tid=87&amp;lid=2218">高级图像算法研发工程师</a>
            </span>
            <span>技术类</span>
            <span>1</span>
            <span>合肥</span>
            <span>2018-10-23</span>
        </div>
        <div id="even">
            <span class="l square">
              <a target="_blank" href="position_detail.php?id=31648&amp;keywords=python&amp;tid=87&amp;lid=2218">高级AI开发工程师</a>
            </span>
            <span>技术类</span>
            <span>4</span>
            <span>合肥</span>
            <span>2018-10-23</span>
        </div>
        <div id="odd">
            <span class="l square">
              <a target="_blank" href="position_detail.php?id=32218&amp;keywords=python&amp;tid=87&amp;lid=2218">后台开发工程师</a>
            </span>
            <span>技术类</span>
            <span>1</span>
            <span>合肥</span>
            <span>2018-10-23</span>
        </div>
        <div id="even">
            <span class="l square">
              <a target="_blank" href="position_detail.php?id=32217&amp;keywords=python&amp;tid=87&amp;lid=2218">Python开发（自动化运维方向）</a>
            </span>
            <span>技术类</span>
            <span>1</span>
            <span>合肥</span>
            <span>2018-10-23</span>
        </div>
        <div id="odd">
            <span class="l square">
              <a target="_blank" href="position_detail.php?id=34511&amp;keywords=python&amp;tid=87&amp;lid=2218">Python数据挖掘讲师 </a>
            </span>
            <span>技术类</span>
            <span>1</span>
            <span>合肥</span>
            <span>2018-10-23</span>
        </div>
    </li>
</ul>
"""
import re



# 1.获取所有的div标签
result = re.findall('<div[\d\D]*?</div>', text)
'''
*表示一次匹配0个或多个字符，？表示不要一次性把所有的div都取出来了，尽可能少的匹配
findall()表示可以找多次，每一次都要找出所有符合条件的数据
'''

# print(result)
# 或者
result = re.findall('<div.*?</div>', text, re.DOTALL)

# 2.获取某个属性的div标签(含有id属性的div标签)
result = re.findall('<div\sid.*?</div>', text, re.DOTALL)

# 3.获取所有的id=even的标签
result = re.findall('<div\sid="even".*?</div>', text, re.DOTALL)

# 4.获取某个标签属性的值
# 获取所有id的值
result = re.findall('<div id="(.*?)".*?</div>', text, re.DOTALL)
print(result)

# 5.获取a标签中的href属性的值
result = re.findall('<a.*?href="(.*?)">', text, re.DOTALL)
# print(result)

# 6.div中所有的职位信息
result = re.findall('<span>(.*?)</span>', text, re.DOTALL)
print(result)

# 7.获取岗位信息
result = re.findall('<a.*?>(.*?)</a>', text, re.DOTALL)
print(result)

11.re处理爬虫爬取到的html字符串

使用
requests库
获取content是
html字符串
，而re正则表达式恰恰是对字符串进行处理。

因此，将正则表达式使用到html字符串的数据提取中。

import requests
import re
url = "https://s.weibo.com/top/summary?cate=realtimehot"

headers = {
    'cookie': 'SUB=_2AkMTRi1qf8NxqwFRmPEVzmLha4R1yQ3EieKlGtyxJRMxHRl-yT9kqmsmtRB6OMYDhdCTGr2FB95K0HLKHeAZHPKKREb3; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9WF8HDv2g_bdeD8IhYlDFZ.M; _s_tentry=-; Apache=2377795257586.004.1679467161310; SINAGLOBAL=2377795257586.004.1679467161310; ULV=1679467161315:1:1:1:2377795257586.004.1679467161310:',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36'
}

response = requests.get(url, headers=headers)
content = response.content.decode('utf8')
print(type(content)) # <class 'str'>

# 获取热度榜的名称
names = re.findall('<td class="td-02">.*?<a.*?>(.*?)</a>', content, re.DOTALL)[1:]
# print(names)

# 获取热度值
hots = re.findall('<td class="td-02">.*?<span>(.*?)</', content, re.DOTALL)
# print(hots)

# 存储数据
sinas = []
for name,hot in zip(names, hots):
    sina = {
        "name":name,
        "hot":hot
    }
    sinas.append(sina)

print(sinas)

12.re爬取翻页网站数据

import re
import requests
import csv

gushis = []

urls = []
for i in range(0, 5, 1):
    url = 'https://www.gushiwen.cn/default_{}.aspx'.format(i)
    urls.append(url)
    # print(url)

# 定义请求头信息
headers = {
    "cookie" : "login=flase; Hm_lvt_9007fab6814e892d3020a64454da5a55=1679637970; __bid_n=187123b38accfde6624207; FPTOKEN=/iNBklILJISaHg5CmgUVlxivXpv2j8GpwuWrSVKewp1C1HAJE873KXSPPU2Wh6ScBaR1VTAH0m+o44lxRanXXJICZc5mUXYJyNY6+YTF25f9/qE9DUYCzxes7r0Xfkzw0qtfDIW9gWbtt37qnkAYymMequLn1jAyYdzl3Q8M8vctJvoKbEZlf4RLlc16+cT4+aIJiHKDbpe0GKunIpw/71nWFSJgRB7FiSx5ucE07KBux6wfEyuIBxeHp3Ujnx8uvaZQVLZCPjfEwnnTiBw0Py1647QplmV8Qd60x1XLo1huueuZB/k8kL8fzD4q3Lx0jdViFQ8LpQa7xr7vJf1MYLIcGUyyRm5EzWxfp3oJ3PBUR9LN4iG4IFTOMlqYw52yq1cDcIW915wgWN0Oy8oiiWZGgNXdH53GeI0o15VjRGdP6GaFoebQa0RT7tDbM23/|U0Mzy8G2Y2S5UU23cAStWAVr0Z0bxruHz6SZnhkZe9Q=|10|46f2a7e0241d7c98763a077979d80ff8; Hm_lpvt_9007fab6814e892d3020a64454da5a55=1679643073",
    "user-agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36"
}

for url in urls:
    response = requests.get(url, headers=headers)
    content = response.content.decode('utf8')

    # 诗名
    titles = re.findall('<b>(.*?)</b>', content, re.DOTALL)
    # print(titles)

    # 作者
    authors = re.findall('<p class="source".*?<a.*?>(.*?)</a>', content, re.DOTALL)
    # 朝代
    dynasties = re.findall('<p class="source".*?<a.*?<a.*?>(.*?)</a>', content, re.DOTALL)
    # 诗词内容
    poems = re.findall('<div class="contson".*?>(.*?)</div>', content, re.DOTALL)

    new_poems = []
    for poem in poems:
        new_poem = re.sub('<.*?>', '', poem)
        new_poem = re.sub('[\s\u3000]', '', new_poem)
        new_poems.append(new_poem)
    # print(new_poems)

    # 存储数据
    for title,author,dynasty,new_poem in zip(titles,authors,dynasties,new_poems):
        gushi = {
            "诗名" : title,
            "作者" : author,
            "朝代" : dynasty,
            "内容" : new_poem
        }
        gushis.append(gushi)

# 将结果写入到本地
with open("gushis.csv", "w", encoding='utf8', newline="") as f:
    # 字段名
    fieldnames = ['诗名', "作者", "朝代", "内容"]
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()
    # print(gushis)
    writer.writerows(gushis)
    print("数据写入成功！")

版权声明：本文为lizyviking_原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

原文链接：https://blog.csdn.net/lizyviking_/article/details/129817820