差分

1,155 バイト追加、 2020年9月18日 (金) 14:14

→‎解析

==[[Beautiful Soup ]] (HTML XML解析)==[[Python]]|

*http://www.crummy.com/software/BeautifulSoup/

**http://www.crummy.com/software/BeautifulSoup/documentation.html

**http://tdoc.info/beautifulsoup/ 日本語

===Beautiful Soup 4===

*https://www.crummy.com/software/BeautifulSoup/bs4/doc/

*http://kondou.com/BS4/ Beautiful Soup4 日本語

*2012年5月にBS3の開発が終了し、現在ではBS4の利用が推奨されています

*BS3はPython3に対応していません

*ただし、BS3のスクリプトのほとんどはimport文を変えるだけでBS4でも動きます

==インストール==

===~~PIPからインストール~~[[PIP]]からインストール===

# pip install BeautifulSoup

*Python3

$ pip install beautifulsoup4

===ダウンロードして解凍===

===パーサーのインストール===

*[[Beautiful Soup ]] は Pythonの標準ライブラリに含まれているHTML パーサーをサポートしています。*~~その他にもサードパーティー製のパーサーもサポートしています。~~[[その他]]にもサードパーティー製のパーサーもサポートしています。

**その一つが、lxmlパーサーで、以下の様にインストールできます。

$ apt-get install python-lxml

$ pip install lxml

*以下の様なパーサーがあります

**[[Python ]] 標準ライブラリのパーサー**lxml ~~HTMLパーサー~~[[HTML]]パーサー**lxml ~~XMLパーサー~~[[XML]]パーサー

**html5lib

==Import==

from BeautifulSoup import BeautifulSoup # For processing [[HTML]] from BeautifulSoup import BeautifulStoneSoup # For processing [[XML]]

import BeautifulSoup # To get everything

from bs4 import BeautifulSoup # To get everything

==解析==

*===文字列およびファイルハンドルによる文書解析===

soup = BeautifulSoup(open("index.html"))

soup = BeautifulSoup("<html>data</html>")

===URLを指定して解析===*~~URLを指定して解析~~[[https://docs.python.org/ja/2.7/library/urllib2.html urllib2 モジュールは、Python 3 で urllib.request, urllib.error に分割されました。]]<pre> import urllib2 from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(urllib2.urlopen('http://xxxxx.com'))</pre>*Python3<pre>import urllib.request as requestfrom bs4 import BeautifulSoup soup = BeautifulSoup(~~urllib2~~urllib.request.urlopen('http://xxxxx.com'))</pre>===エンコードの変換===*文字化けする場合(例えばSHIFT-JIS)の対処<pre>response = urllib.request.urlopen(url)html = response.read().decode(response.headers.get_content_charset(), errors='ignore')parsed_html = BeautifulSoup(html, 'html.parser')</pre>

==オブジェクト==

*[[Beautiful Soup ]] は複雑なHTML文書を、Python オブジェクトのツリーに変換する

*以下の4種類のオブジェクトを扱うだけでよい

===Tag===

*~~XML、HTMLのタグに一致する~~XML、[[HTML]]のタグに一致する

tag = soup.b

===Name===

del tag['id']

===複数値属性===

*~~HTML4では、いくつかの属性で、複数の値を持つことができる~~[[HTML]]4では、いくつかの属性で、複数の値を持つことができる

*もっとも知られているのが、class

*[[Beautiful ~~Soupでは、リストとして扱う~~Soup]]では、リストとして扱う

css_soup = BeautifulSoup('')

css_soup.p['class']

# ["body", "strikeout"]

*~~XMLでは、複数値として扱わない~~[[XML]]では、複数値として扱わない

xml_soup = BeautifulSoup('', 'xml')

xml_soup.p['class']

# u'body strikeout'

===~~NavigableString~~Na[[vi]]gableString===*~~文字列は、Beautiful~~ 文字列は、[[Beautiful Soup ]] で、NavigableStringを利用する*[[Python ]] のUnicode文字列とほぼ同じ

*treeをナビゲートしたり検索したりする機能がサポートされている

tag.string

# u'Extremely bold'

type(tag.string)

# <class 'bs4.element.~~NavigableString~~Na[[vi]]gableString'>

*unicode() で、Unicode 文字列に変換できる

unicode_string = unicode(tag.string)

==コメントと特殊な文字列==

===コメント===

*~~Commentオブジェクトは、NavigableStringの特殊型~~Commentオブジェクトは、Na[[vi]]gableStringの特殊型

markup = ""

soup = BeautifulSoup(markup)

# <c>text2</c>

sibling_soup.c.~~previous_sibling~~pre[[vi]]ous_sibling

# text1

====.next_siblings と .prebious_siblings====

for sibling in soup.a.next_siblings:

print(repr(sibling))

for sibling in soup.find(id="link3").~~previous_siblings~~pre[[vi]]ous_siblings:

print(repr(sibling))

===後ろ向き前向き===

====.next_element と .~~previous_element~~pre[[vi]]ous_element========.next_elements と .~~previous_elements~~pre[[vi]]ous_elements====

===ツリーの検索===

====string====

soup.find_all('b')

====[[正規表現]]====

import re

for tag in soup.find_all(re.compile("^b")):

=====タグ名を渡す=====

soup.findAll('b')

=====~~正規表現を使う~~[[正規表現]]を使う=====

import re

tagsStartingWithB = soup.findAll(re.compile('^b'))

soup.find('title')

# <title>The Dormouse's story</title>

=====~~CSSクラスで検索~~[[CSS]]クラスで検索=====

soup.find("b", { "class" : "lime" })

# Lime

====find_parents() と find_parent()====

====find_next_siblings() と find_next_sibling()====

*~~あるオブジェクトのnextSiblingメンバー変数を辿り、指定したTagあるいはNavigableTextを集めてきます。~~あるオブジェクトのnextSiblingメンバー変数を辿り、指定したTagあるいはNa[[vi]]gableTextを集めてきます。

paraText = soup.find(text='This is paragraph ')

paraText.findNextSibling(text = lambda(text): len(text) == 1)

# u'.'

====~~find_previous_siblings~~find_pre[[vi]]ous_siblings() と ~~find_previous_sibling~~find_pre[[vi]]ous_sibling()====

====find_all_next() と find_next()====

====~~find_all_previous~~find_all_pre[[vi]]ous() と ~~find_previous~~find_pre[[vi]]ous()=======[[CSS ]] selectors===

====タグ====

soup.select("title")

====直下のタグ====

soup.select("head > title")

====[[CSS ]] class====

soup.select(".sister")

====ID====

===Non-pretty printing===

====unicode() もしくは str()を使う====

===[[XML]]===

from BeautifulSoup import BeautifulStoneSoup

ItemId から、訳語を取得する

'''

url = r'http://btonic.est.co.jp/NetDic/NetDicV09.asmx/GetDicItemLite?Dic=EJdict&Item={0}&Loc=&Prof=~~XHTML~~X[[HTML]]'

url = url.format(itemid)

if hasattr(soup, "title") and hasattr(soup.title, "string"):

print soup.title.string

===Image タグから ~~リンクを抜き出す~~[[リンク]]を抜き出す===

soup = BeautifulSoup(urllib2.urlopen(url))

for tag in soup.findAll():

if tag.name == 'img':

print tag['src']

Piroto

ビューロクラット、インターフェース管理者、管理者

12,673

回編集

MyMemoWiki

差分

Beautiful Soup (ソースを閲覧)

2020年9月18日 (金) 14:14時点における版

案内メニュー

個人用ツール

名前空間

変種

表示

その他

検索

案内

ツール

プログラミング言語

Web

OS/環境

データベース

Database|データベース製品

アーキテクチャ・モデリング・パターン

環境

Webサービス

プロジェクトマネージメント

仮想化

Network

Office

CMS

構成管理

ツール

文章

音楽

教養

デザイン

業務

その他