bs4解析

下载 — pip install bs4

示例代码-爱丽丝漫游仙境

html_doc = """
<html><head><title>The Dormouse's story</title></head>
    <body>
        <p class="title"><b>The Dormouse's story</b></p>
        <p class="story">Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
            and they lived at the bottom of a well.
        </p>

<p class="story">...</p>
"""

使用bs4格式化输出代码

from bs4 import BeautifulSoup
# lxml为解析器
soup = BeautifulSoup(html_doc,"lxml")
# 格式化输出代码
print(soup.prettify())

推荐使用lxml作为解析器,因为效率更高. 在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必须安装lxml或html5lib, 因为那些Python版本的标准库中内置的HTML解析方法不够稳定.

浏览结构化数据的方法

print(soup.title)
# <ittle>The Dormouse's story</title>

print(soup.title.name)
# u'title'

print(soup.title.string)
# u'The Dormouse's story'

print(soup.p)
# <p class="title"><b>The Dormouse's story</b></p>

print(soup.p['class'])
# u'title'

print(soup.a)
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>


print(soup.find(id="link3"))
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

print(soup.find_all('a'))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

匹配所有a标签的href属性

for link in soup.find_all("a"):
	print(link.get("href"))
# http://example.com/elsie
# http://example.com/lacie

# http://example.com/tillie

获取所有文本内容

print(soup.get_text())
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
            Elsie,
            Lacie and
            Tillie;
            and they lived at the bottom of a well.
...

遍历文档树

以爱丽丝文档为例

html_doc = """
<html><head><title>The Dormouse's story</title></head>
    <body>
        <p class="title"><b>The Dormouse's story</b></p>
        <p class="story">Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
            and they lived at the bottom of a well.
        </p>

<p class="story">...</p>
"""

子节点

一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点.Beautiful Soup提供了许多操作和遍历子节点的属性.

— 操作文档树最简单的方法就是告诉它你想获取的tag的name.如果想获取标签

soup.head
>>> <head><title>The Dormouse's story</title></head>

soup.title
>>> <title>The Dormouse's story</title>

这是个获取tag的小窍门,可以在文档树的tag中多次调用这个方法.

— 下面的代码可以获取标签中的第一个

标签:

soup.body.p
>>> <p>The Dormouse's story</p>

通过点取属性的方式只能获得当前名字的第一个tag:

soup.a
>>> <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

父节点

每个tag或字符串都有父节点:被包含在某个tag中

parent

通过 parent 属性来获取某个元素的父节点.在例子“爱丽丝”的文档中,标签是标签的父节点: <pre><code class=”language-python”>title_tag = soup.title title_tag # <title>The Dormouse’s story</title> title_tag.parent # <head><title>The Dormouse’s story</title></head> </code></pre> <h4 id=”兄弟节点”>兄弟节点</h4> <blockquote> 看一段代码 </blockquote> <pre><code class=”language-python”>soup = BeautifulSoup(“<a>text1<c>text2</c></a>”) print(soup.prettify()) # <html> # <body> # <a> # # text1 # # <c> # text2 # </c> # </a> # </body> # </html> </code></pre> — 因为标签和<c>标签是同一层:他们是同一个元素的子节点,所以和<c>可以被称为兄弟节点.一段文档以标准格式输出时,兄弟节点有相同的缩进级别.在代码中也可以使用这种关系 next_sibling 和 previous_sibling <blockquote> 在文档树中,使用 next_sibling 和 previous_sibling属性来查询兄弟节点: </blockquote> <pre><code class=”language-python”># 下一个兄弟节点 soup.b.next_sibling >>> <c>text2</c> # 上一个兄弟节点 soup.c.previous_sibling >>> text1 </code></pre> <h3 id=”搜索文档树”>搜索文档树</h3> <blockquote> Beautiful Soup定义了很多搜索方法,这里着重介绍2个: <code>find()</code> 和 <code>find_all()</code> .其它方法的参数和用法类似,请读者举一反三. </blockquote> 依旧以爱丽丝文档为例 <pre><code class=”language-python”>html_doc = “”” <html><head><title>The Dormouse’s story</title></head> <body> The Dormouse’s story Once upon a time there were three little sisters; and their names were <a href=”http://example.com/elsie” class=”sister” id=”link1″>Elsie</a>, <a href=”http://example.com/lacie” class=”sister” id=”link2″>Lacie</a> and <a href=”http://example.com/tillie” class=”sister” id=”link3″>Tillie</a>; and they lived at the bottom of a well. … “”” from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, ‘lxml’) </code></pre> <h4 id=”字符串”>字符串</h4> <blockquote> 最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的标签: </blockquote> <pre><code class=”language-python”>soup.find_all(‘b’) >>> [The Dormouse’s story] </code></pre> <h4 id=”列表”>列表</h4> <blockquote> 如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有<a>标签和标签: </blockquote> <pre><code class=”language-python”>soup.find_all([“a”, “b”]) # [The Dormouse’s story, # <a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>, # <a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>, # <a class=”sister” href=”http://example.com/tillie” id=”link3″>Tillie</a>] </code></pre> <h4 id=”按css搜索”>按CSS搜索</h4> <blockquote> 按照CSS类名搜索tag的功能非常实用,但标识CSS类名的关键字 class在Python中是保留字,使用 class 做参数会导致语法错误.从Beautiful Soup的4.1.1版本开始,可以通过 class_ 参数搜索有指定 </blockquote> <pre><code class=”language-python”>soup.find_all(“a”, class_=”sister”) # [<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>, # <a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>, # <a class=”sister” href=”http://example.com/tillie” id=”link3″>Tillie</a>] </code></pre> — limit参数 <blockquote> 文档树中有3个tag符合搜索条件,但结果只返回了2个,因为我们限制了返回数量: </blockquote> <pre><code class=”language-python”>soup.find_all(“a”, limit=2) # [<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>, # <a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>] </code></pre> <h4 id=”css选择器”>css选择器</h4> <blockquote> Beautiful Soup支持大部分的CSS选择器，在 Tag 或 BeautifulSoup 对象的 .select() 方法中传入字符串参数, 即可使用CSS选择器的语法找到tag: </blockquote> <pre><code class=”language-python”>soup.select(“title”) # [<title>The Dormouse’s story</title>] soup.select(“p:nth-of-type(3)”) # […] </code></pre> <h5 id=”—–通过tag标签逐层查找”>— 通过tag标签逐层查找</h5> <pre><code class=”language-python”>soup.select(“body a”) # [<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>, # <a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>, # <a class=”sister” href=”http://example.com/tillie” id=”link3″>Tillie</a>] soup.select(“html head title”) # [<title>The Dormouse’s story</title>] </code></pre> <h5 id=”—–标签下的直接子标签”>— 标签下的直接子标签</h5> <pre><code class=”language-python”>soup.select(“head > title”) # [<title>The Dormouse’s story</title>] soup.select(“p > a”) # [<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>, # <a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>, # <a class=”sister” href=”http://example.com/tillie” id=”link3″>Tillie</a>] soup.select(“p > a:nth-of-type(2)”) # [<a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>] soup.select(“p > #link1″) # [<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>] soup.select(“body > a”) # [] </code></pre> <h5 id=”—-通过css类名查找”>— 通过css类名查找</h5> <pre><code class=”language-python”>soup.select(“.sister”) # [<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>, # <a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>, # <a class=”sister” href=”http://example.com/tillie” id=”link3″>Tillie</a>] soup.select(“[class~=sister]”) # [<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>, # <a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>, # <a class=”sister” href=”http://example.com/tillie” id=”link3″>Tillie</a>] </code></pre> <h5 id=”—-通过tag的id查找”>— 通过tag的id查找</h5> <pre><code class=”language-python”>soup.select(“#link1″) # [<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>] soup.select(“a#link2″) # [<a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>] </code></pre>

原文地址：http://www.cnblogs.com/blog4lyh/p/16904853.html

1. 本站所有资源来源于用户上传和网络，如有侵权请邮件联系站长！ 2. 分享目的仅供大家学习和交流，请务用于商业用途! 3. 如果你也有好源码或者教程，可以到用户中心发布，分享有积分奖励和额外收入！ 4. 本站提供的源码、模板、插件等等其他资源，都不包含技术服务请大家谅解！ 5. 如有链接无法下载、失效或广告，请联系管理员处理！ 6. 本站资源售价只是赞助，收取费用仅维持本站的日常运营所需！ 7. 如遇到加密压缩包，默认解压密码为"gltf",如遇到无法解压的请联系管理员！ 8. 因为资源和程序源码均为可复制品，所以不支持任何理由的退款兑现，请斟酌后支付下载声明：如果标题没有注明"已测试"或者"测试可用"等字样的资源源码均未经过站长测试.特别注意没有标注的源码不保证任何可用性

17.bs4

bs4解析

浏览结构化数据的方法

匹配所有a标签的href属性

获取所有文本内容

遍历文档树

子节点

父节点

排行榜展示

3D打印机glb模型下载-机械glb模型

树glb模型下载-树2

水稻glb模型下载-水稻1

变电箱1glb模型下载-机械glb模型

树glb模型下载-树3

模型

树glb模型下载-树2

水稻glb模型下载-水稻1

树glb模型下载-树3

变电箱1glb模型下载-机械glb模型

3D打印机glb模型下载-机械glb模型

树glb模型下载-树1