首页 > Python爬虫教程 > 第七章: 尘埃落定-数据的解析 > 7.3节：使用BeautifulSoup解析网页

7.3节：使用BeautifulSoup解析网页

薯条老师 2021-03-15 08:01:55 236378 0

编辑收藏

广州番禺Python爬虫小班周末班培训

第四期线下Python爬虫小班周末班已经开课了，授课详情请点击：https://www.chipscoco.com/?id=232

7.3.1 BeautifulSoup简介

Beautiful Soup，中文释义为"美味的汤"，是一个用来对HTML或XML文件进行数据解析、提取的Python框架。Beautiful Soup一词源自于世界经典名著《爱丽丝梦游仙境》。以下内容节选于《爱丽丝梦游仙境》中的第十章：

'Beautiful Soup, so rich and green, Waiting in a hot tureen! Who for such dainties would not stoop? Soup of the evening, beautiful Soup! Soup of the evening, beautiful Soup! Beau--ootiful Soo--oop! Beau--ootiful Soo--oop! Soo--oop of the e--e--evening, Beautiful, beautiful Soup!

打开Beautiful Soup官网，我们可以看到素甲鱼，鹰头狮，以及置身于暗黑仙境中的爱丽斯。

童话中的绿色浓汤，只需两便士一碗，解忧又清香。现实世界中的程序员，也需要一碗提高工作效率的浓汤，把我们从繁琐的网页解析工作中解放出来。唔，让我们一起干了这碗美味的汤。

7.3.2安装BeautifulSoup

进入命令行，执行pip install beautifulsoup4来安装最新版本的Beautiful Soup。在程序中使用BeautifulSoup来解析网页，需要从bs4模块中进行导出。Beautiful Soup安装完毕以后，可以在交互模式中测试是否安装成功：

D:\>python
Python 3.7.6 (tags/v3.7.6:43364a7ae0, Dec 19 2019, 00:42:30) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup

导入的过程中，如未出现任何错误，即表示安装成功。

7.3.3构造BeautifulSoup对象

在交互模式中导出BeautifulSoup以后，可以执行help(BeautifulSoup)来查看其构造函数：

>>> from bs4 import BeautifulSoup
>>> help(BeautifulSoup)
Help on class BeautifulSoup in module bs4:

class BeautifulSoup(bs4.element.Tag)
| BeautifulSoup(markup='', features=None, builder=None, parse_only=None, from_encoding=None, exclude_encodings=None, element_classes=None, **kwargs)

BeautifulSoup的构造函数，需重点掌握markup参数以及features参数。markup参数用来指定标记语言文本，例如xml,html。features参数用来指定文档解析器。BeautifulSoup常用的文档解析器如下表所示：

文档解析器	构造方法
html.parser	BeautifulSoup(markup,features="html.parser")
lxml	BeautifulSoup(markup,features="lxml")
lxml-xml	BeautifulSoup(markup,features="lxml-xml") BeautifulSoup(markup,features="xml")
html5lib	BeautifulSoup(markup,features="html5lib")

在这些解析器中，html.parser是Python自带的解析器，无需额外安装，lxml与html5lib都需要额外安装。在解析速度方面，lxml最优，html.parser与html5lib次之。在网页解析的容错性方面，html5lib最优，lxml,html.parser次之。在命令行中安装lxml:

pip install lxml

在命令行中安装html5lib:

pip install html5lib

代码实例-构造BeautifulSoup对象：

from bs4 import BeautifulSoup
HTML = '<html><body><div><h1>www.chipscoco.com</h1></div></body></html>'
# 使用默认的解析器构造一个BeautifulSoup对象
bs = BeautifulSoup(HTML,features="html.parser")
 
# 也可以通过文件流对象来构造BeautifulSoup
bs =  BeautifulSoup(open("index.html"),features="html.parser"

7.3.4 BeautifulSoup的四类对象

以下内容引自BeautifulSoup官方文档:

BeautifulSoup将HTML文档转换成复杂的树形结构，树结构中的每个节点都是一个Python对象，所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup , Comment。

(1) BeautifulSoup对象

通过BeautifulSoup构造的即是一个BeautifulSoup对象，其表示整个HTML或XML文档。

(2) tag对象

所谓tag对象，对应的是HTML或XML文档中的标签，比如h1标签，img标签，a标签等。构造好BeautifulSoup对象以后，直接通过成员操作符来访问tag对象。对于tag对象，可通过name属性来获取标签名，通过attrs属性来获取文档标签中的属性：

from bs4 import BeautifulSoup
HTML = '<h1 id="heading">www.chipscoco.com</h1>'
bs = BeautifulSoup(HTML, features="html.parser")
 
# 访问文档中的h1标签对象
tag = bs.h1
# 输出标签名
print(tag.name)
# 输出h1标签的class以及id属性
print(tag.attrs["class"], tag.attrs["id"])

(3) NavigableString对象

tag对象的string属性即为NavigableString对象:

from bs4 import BeautifulSoup
HTML = '<h1 id="heading">www.chipscoco.com</h1>'
bs = BeautifulSoup(HTML, features="html.parser")
 
# 访问标签对象中的string属性
navigable_string = bs.h1.string
print(type(navigable_string))

"""
程序输出为:<class 'bs4.element.NavigableString'>
 
"""

如需以字符串对象的方式来对NavigableString对象进行处理，可通过unicode()方法将其转换为unicode字符串。

(4) Comment对象

Comment对象是一种特殊的NavigableString对象，当注释内容被包含于tag中时，通过string属性得到的即为Comment对象：

from bs4 import BeautifulSoup
HTML = '<b><!--it is a comment?--></b>'
bs = BeautifulSoup(HTML, features="html.parser")
 
# 访问b标签中的string，此时的string是一个Comment对象
comment = bs.b.string
print(type(comment))

"""
程序输出为:<class 'bs4.element.Comment'>
""

7.3.5 BeautifulSoup的节点遍历

这里的节点遍历主要是指遍历当前节点的父子节点或兄弟节点。

(1) 遍历子节点

通过tag对象的contents属性，可将当前节点的所有子节点以列表的形式输出。读者需注意的是，contents列表中的元素为tag对象。通过tag对象的children属性遍历当前节点的所有子节点，children为迭代器对象，可在循环中进行遍历。

实例代码

from bs4 import BeautifulSoup
HTML = '<div><p></p><h1></h1></div>'
# 在以上HTML中，p节点以及h1节点均为div节点的子节点
bs = BeautifulSoup(HTML, features="html.parser")
 
div = bs.div
print(div.contents)


"""
程序输出为:[<p></p>, <h1></h1>]
"""
 
# 通过children遍历div节点的所有子节点
for children in div.children:
    print(children.name)

(2) 遍历父节点

通过parent或parents属性遍历当前节点的父节点，前者指向当前节点的父节点，后者为一个生成器，可在循环中递归获取当前节点的所有父节点。

实例代码

from bs4 import BeautifulSoup
HTML = '<body><div><p></p><h1></h1></div></body>'
# 在以上HTML中，p节点以及h1节点均为div节点的子节点
bs = BeautifulSoup(HTML, features="html.parser")
 
h1 = bs.h1
# 获取h1节点的父节点，h1节点的父节点为div
print(h1.parent.name)

"""
程序输出为:div
"""
 
# 在循环中获取h1节点的所有父节点
for parent in h1.parents:
    # 遍历出来的parent为tag对象
    print(parent.name)

(3) 遍历兄弟节点

通过next_sibling或previous_sibling遍历当前节点的兄弟节点，前者表示前驱，后者表示后继。如需遍历所有前驱或所有后继，则通过next_siblings或previous_siblings进行遍历。

实例代码

from bs4 import BeautifulSoup
HTML = '<div><ul></ul><p></p><a></a><h1></h1><img /></div>'
bs = BeautifulSoup(HTML, features="html.parser")
 
a = bs.a
 
# 在循环中获取a节点的所有前向兄弟节点
# 在以上HTML中，a节点的所有前向兄弟节点为p以及ul
for previous_sibling in a.previous_siblings:
    # 遍历出来的parent为tag对象
    print(previous_sibling.name)
 
"""
程序输出为:
p
ul
""

7.3.6 BeautifulSoup的文档搜索

这里介绍的是在爬虫程序开发中常用的文档搜索方法，更详尽的操作方法，读者可查找官方文档。

BeautifulSoup官方文档4.4.0地址:https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

① find(name=None, attrs={}, recursive=True, text=None, **kwargs)

参数	描述
name	表示查找的标签名
attrs	表示标签中的属性值对
recursive	表示是否进行递归搜索
text	表示标签所对应的文本
kwargs	表示可变参数，例如可以通过关键字id和class_来查找指定id，class的标签

返回值

查找到的第一个tag对象

② find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

"""
参数同find方法，limit参数表示限定获取的标签数。
"""

"""
返回值：
find_all方法的返回值列表对象，列表中包含查找到的所有标签
"""

③ select(selector, namespaces=None, limit=None, **kwargs)

参数	描述
selector	表示css选择器
namespaces	用来传递一个字典类型的命名空间对象，通常使用其默认值
limit	表示限定获取的标签数
kwargs	表示可变参数，具体的关键字参数可查阅官方文档

返回值

返回值为列表对象，列表中包含查找到的所有标签

7.3.7 BeautifulSoup实战

现在请同学们按照以下步骤进行操作，手把手教同学们如何利用BeautifulSoup来对网页进行解析。

(1) 创建chipscoco.html

创建HTML文件chipscoco.html,文件内容如下所示：

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> 
        <title>Python网络爬虫核心原理及实战</title>
</head>
 
    <body>
    <div  id="header">
<div id="logo">
<a href="https://www.chipscoco.com/"><img src="logo.png" ></a>
</div>
   </div>
                        
   <div id="main">
   <h1 id="heading">爬虫与数据的纠葛</h1>
   <p>
在广袤的互联网中，有这样一种"爬虫生物"，穿梭于万维网中，将承载信息的网页吞食，然后交由搜索引擎进行转化，
吸收，并最终"孵化"出结构化的数据，供人快速查找，展示。这种"生物"，其名曰"网络蜘蛛"。<br/>
现实中的蜘蛛形状可怖，以致于让大多数人忽略了其对人类有益的事实。<br/>
网络蜘蛛虽以数据为食，但是数据的生产者-网站，需要借助爬虫的帮助，将网页提交给搜索引擎
 </p>
 </div>
 
</body>
</html>

(2) 创建bs_parser.py

bs_parser.py与chipscoco.html位于同一个目录。bs_parser.py中的代码：

from bs4 import BeautifulSoup
 
 
if __name__ == "__main__":
    # 读取chipscoco.html中的
    html = open("chipscoco.html", "r", encoding="utf-8")
 
    """
    (1) 构造BeautifulSoup对象, 指定lxml解析器
    (2) 使用lxml解析器需要先在命令行中执行pip install lxml进行安装
    """
 
    bs = BeautifulSoup(html, "lxml")
 
    # 直接通过BeautifulSoup对象来获取HTML title对象的文本内容
    title = bs.title.string
    
    # 通过find方法中的attrs参数来获取属性id为heading的标签对象
    h1 = bs.find(attrs={"id": "heading"})
    h1_text = h1.string if h1 else ""  
 
    # 执行BeautifulSoup对象的find方法来获取HTML中的a标签
    link = bs.find("a")
 
    # 访问a标签对象的href属性
    href = link["href"] if link else ""
    print("html title:{}\narticle title:{}\nhref:{}".format(title, h1_text, href))

(3) 执行bs_parser.py

进入命令行，在bs_parser.py目录中执行python bs_parser.py,程序的输出为：

html title:Python网络爬虫核心原理及实战
article title:爬虫与数据的纠葛
href:https://www.chipscoco.com/

7.3.8 知识要点

(1) BeautifulSoup一词源自于世界经典名著《爱丽丝梦游仙境》
(2) 执行pip install beautifulsoup4来安装最新版本的BeautifulSoup。在程序中使用BeautifulSoup解析网页，需要从bs4模块中进行导出。
(3) 构造BeautifulSoup对象，需要指定具体的网页解析器。lxml以及html5lib都需要额外安装。安装的命令分别为pip install lxml, pip install html5lib。

7.3.9 高薪就业班

(1) Python后端工程师高薪就业班，月薪8K-15K，免费领取课程大纲
(2) Python爬虫工程师高薪就业班，年薪十万，免费领取课程大纲
(3) Java后端开发工程师高薪就业班，月薪8K-20K, 免费领取课程大纲
(4) Python大数据工程师就业班，月薪12K-25K,免费领取课程大纲

扫码免费领取学习资料：

取消回复欢迎你发表评论:

Python, Java小班课

扫码咨询小班培训

IP代理神器

薯条老师教你学编程

第一章: 初学乍练-Python快速入门

第二章: 初窥门径-从全局把握网络爬虫

第三章: 爬虫数据-网页与JSON

第四章: 爬虫核心-HTTP协议

第五章: 手到擒来-数据的抓包

第六章: 利刃出鞘-HTTP请求库

第七章: 尘埃落定-数据的解析

第八章: 逆向初探-JS逆向

第九章: 爬虫进阶-Selenium, 中间人拦截

第十章：斗转星移-常用的反爬策略及应对方法

7.3节：使用BeautifulSoup解析网页

广州番禺Python爬虫小班周末班培训

7.3.1 BeautifulSoup简介

7.3.2安装BeautifulSoup

7.3.3构造BeautifulSoup对象

7.3.4 BeautifulSoup的四类对象

7.3.5 BeautifulSoup的节点遍历

7.3.6 BeautifulSoup的文档搜索

7.3.7 BeautifulSoup实战

7.3.8 知识要点

7.3.9 高薪就业班

取消回复欢迎你发表评论:

7.3节：使用BeautifulSoup解析网页

广州番禺Python爬虫小班周末班培训

7.3.1 BeautifulSoup简介

7.3.2安装BeautifulSoup

7.3.3构造BeautifulSoup对象

7.3.4 BeautifulSoup的四类对象

7.3.5 BeautifulSoup的节点遍历

7.3.6 BeautifulSoup的文档搜索

7.3.7 BeautifulSoup实战

7.3.8 知识要点

7.3.9 高薪就业班

取消回复欢迎 你 发表评论:

取消回复欢迎你发表评论: