BeautifulSoup4学习笔记

爬取页面的时候可以使用urllib加上正则表达式，正则表达麻烦且不好使用，bs4闪亮登场。除了bs还有属于大型爬虫框架的scrapy。

简单入门

from bs4 import BeautifulSoup
#得到一个bs对象
 req = urllib.request.Request(url,headers=header)
 response = urllib.request.urlopen(req)
 data = response.read()
 soup = BeautifulSoup(data, 'html.parser')
 #打印url内容
 print(soup.prettify())

浏览结构化数据

soup.title
# <title>The Dormouse's story</title>
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'
# 这时获取的内容类型是<class 'bs4.element.NavigableString'>
# 如果想要得到类型为str，可以尝试str(soup.title.string)
soup.title.parent.name
# u'head'
#获取title的父级标签的name
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
# 获取第一个p标签
soup.p['class']
# u'title'
# 获取p标签class内容
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

获取所有a标签

for link in soup.findAll('a'):
    print(link.get('href'))

获取文档全部文字内容

p = soup.p
print(p.get_text())
print(soup.get_text())

对象种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:
Tag , NavigableString , BeautifulSoup , Comment .

tag

tag就是一些html标签，比如hello beautiful
重点说一下tag的name和attr属性

#每个标签都有自己的名字
#属性就是js中的attr那部分，比如是class data-id id 等
#直接通过字典来访问属性
soup1 = BeautifulSoup("<p class='p' id='p1'></p>",'lxml')
p = soup1.p
print(p['class'])
print(p.attrs)
#['p']
#{'class': ['p'], 'id': 'p1'}

多值属性

有些属性是拥有多个值的，最常见的就是class，比如,通过字典来访问属性，返回的是一个list列表，包括所有的class值

soup1 = BeautifulSoup("<p class='p p1 p2' id='p1'></p>",'lxml')
p = soup1.p
print(p['class'])
#['p', 'p1', 'p2']

NavigableString（可以遍历的字符串）

字符串常被包含在tag内.Beautiful Soup用 NavigableString 类来包装tag中的字符串:

tag.string
# u'Extremely bold'
type(tag.string)
# <class 'bs4.element.NavigableString'>

转换成Unicode

unicode_string = unicode(tag.string)
unicode_string
# u'Extremely bold'
type(unicode_string)
# <type 'unicode'>

NavigableString 对象支持遍历文档树和搜索文档树中定义的大部分属性, 并非全部.尤其是,一个字符串不能包含其它内容(tag能够包含字符串或是其它tag),字符串不支持 .contents 或 .string 属性或 find() 方法.

BeautifulSoup

BeautifulSoup 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag 对象

因为 BeautifulSoup 对象并不是真正的HTML或XML的tag,所以它没有name和attribute属性.但有时查看它的 .name 属性是很方便的,所以 BeautifulSoup 对象包含了一个值为 “[document]” 的特殊属性 .name

注释及特殊字符串

主要是获取注释内容

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
print(comment)
#<class 'bs4.element.Comment'>
#Hey, buddy. Want to buy a used parser?

Comment 对象是一个特殊类型的 NavigableString 对象:

核心（遍历文档树）

以下面的html为例

html_doc = """
<html><head><title>The Dormouse's story</title></head>
    <body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#tag的操作
head = soup.head
title = soup.title
b = soup.body.b
#通过.的方式只能获取第一个tag，如果想获取所有tag，可以使用find_all（findAll）
a_list = soup.find_all('a')
print(a_list)
print(type(a_list))
print(soup.find_all('b'))
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
#<class 'bs4.element.ResultSet'>
#[<b>The Dormouse's story</b>]

.contents 和 .children

tag的 .contents 属性可以将tag的子节点以列表的方式输出:

#看上面的html，第一个p包含b子标签，下面的代码输出p下面的子节点列表
tags = soup.p.contents
print(tags)
#[<b>The Dormouse's story</b>]

.descendants

.contents 和 .children 属性仅包含tag的直接子节点.
.descendants可以返回子节点和子节点的子节点（孙节点）

print(type(soup.body))
#<class 'bs4.element.Tag'>
tags = soup.body
for child in tags.descendants:
    print(child)
# <p class="title"><b>The Dormouse's story</b></p>
# <b>The Dormouse's story</b>
# The Dormouse's story
#
#
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# Once upon a time there were three little sisters; and their names were
#
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# Elsie
# ,
#
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# Lacie
#  and
#
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
# Tillie
# ;
# and they lived at the bottom of a well.
#
#
# <p class="story">...</p>
# ...

.string

如果tag只有一个 NavigableString 类型子节点,那么这个tag可以使用 .string 得到子节点:

父节点.parent

通过 .parent 属性来获取某个元素的父节点.在例子“爱丽丝”的文档中,<head>标签是<title>标签的父节点:

父节点.parents

和.descendants对应，获取所有的父节点

兄弟节点

搜索文档书（重要）

上面基本说的是如何获取tag，获取tag下的元素，获取tag的父、子、兄元素。
本章主要讲述如何搜索文档。

find 和find_all

首先介绍一下过滤器类型

字符串

最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的标签:

soup.find_all('b') # [The Dormouse's story]

正则表达式

如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 match() 来匹配内容.下面例子中找出所有以b开头的标签,这表示<body>和标签都应该被找到:

import re for tag in soup.find_all(re.compile("^b")): print(tag.name) # body # b

列表

如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有<a>标签和标签:

soup.find_all(["a", "b"]) # [The Dormouse's story, # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

True
True 可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点

for tag in soup.find_all(True): print(tag.name) # html # head # title # body # p # b # p # a # a # a # p

一些搜索实例

搜索所有的p标签

find_all('p')

搜索id为apple的标签

find_all(id='apple')

通过属性进行搜索

data_soup.find_all(attrs={"attr": "value"})

搜索class = sister的元素

find_all('p',class_='sister')

string 参数

通过 string 参数可以搜搜文档中的字符串内容.与 name 参数的可选值一样, string 参数接受字符串 , 正则表达式 , 列表, True . 看例子:

soup.find_all(string="Elsie") # [u'Elsie'] soup.find_all(string=["Tillie", "Elsie", "Lacie"]) # [u'Elsie', u'Lacie', u'Tillie'] soup.find_all(string=re.compile("Dormouse")) [u"The Dormouse's story", u"The Dormouse's story"]

soup.find_all("a", string="Elsie")

[<a href="http://example.com/elsie" class="sister" id="link1">Elsie]

limit 参数

#只返回前两条 soup.find_all("a", limit=2)

css选择器

#获取title soup.select("title") # [<title>The Dormouse's story</title>] #获取p标签中第三个元素 soup.select("p nth-of-type(3)") # [...]

通过tag标签逐层查找:

soup.select("body a") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select("html head title") # [<title>The Dormouse's story</title>]

找到某个tag标签下的直接子标签

soup.select("head > title") # [<title>The Dormouse's story</title>] soup.select("p > a") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select("p > a:nth-of-type(2)") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] soup.select("p > #link1") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] soup.select("body > a") # []

找到兄弟节点标签

:soup.select("#link1 ~ .sister") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select("#link1 + .sister") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

通过CSS的类名查找:

soup.select(".sister") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] #搜索class 包含sister的元素 soup.select("[class~=sister]") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过tag的id查找:

soup.select("#link1") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] soup.select("a#link2") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

同时用多种CSS选择器查询元素:

soup.select("#link1,#link2") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

通过是否存在某个属性来查找:

soup.select('a[href]') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过属性的值来查找:

soup.select('a[href="http://example.com/elsie"]') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] #搜索以http://example.com/开头的元素 soup.select('a[href^="http://example.com/"]') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select('a[href$="tillie"]') # [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select('a[href*=".com/el"]') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

通过语言设置来查找:

multilingual_markup = """ Hello Howdy, y'all Pip-pip, old fruit Bonjour mes amis """ multilingual_soup = BeautifulSoup(multilingual_markup) #以en为开头的所有元素 multilingual_soup.select('p[lang|=en]') # [Hello, #Howdy, y'all, #Pip-pip, old fruit]

返回查找到的元素的第一个

soup.select_one(".sister") #\<a class="sister" href="http://example.com/elsie" id="link1">Elsie\</a>

BeautifulSoup4学习笔记

爬取页面的时候可以使用urllib加上正则表达式，正则表达麻烦且不好使用，bs4闪亮登场。除了bs还有属于大型爬虫框架的scrapy。

对象种类

tag

多值属性

NavigableString（可以遍历的字符串）

BeautifulSoup

注释及特殊字符串

核心（遍历文档树）

.contents 和 .children

.descendants

.string

父节点.parent

父节点.parents

兄弟节点

搜索文档书（重要）

find 和find_all

字符串

正则表达式

列表

一些搜索实例

string 参数

[<a href="http://example.com/elsie" class="sister" id="link1">Elsie]

limit 参数

css选择器

发表回复 取消回复

发表回复取消回复