差分

558 バイト追加、 2020年9月20日 (日) 13:54

==[[自然言語処理]]==[[Python]] | [[Python ライブラリ]] |

===~~言語処理~~[[言語]]処理=======Python形態素解析ライブラリ====*[https://taku910.github.io/mecab/ MeCab]*[https://mocobeta.github.io/janome/ janome]*[https://github.com/taishi-i/nagisa nagisa]

====NLTK(Natural Language Toolkit) のインストール====

*Python NLTK(Natural Language Toolkit)

====Python MeCab(日本語形態素解析) のインストール====

*Python MeCab(日本語形態素解析)

====[[Beautiful Soup ]] (HTML解析) のインストール====*[[Beautiful Soup ]] (HTML解析)

===グラフ===

====NumPy (~~数学関数~~[[数学]]関数) のインストール====*NumPy (~~数学関数~~[[数学]]関数)

====matplotlib (グラフ機能) のインストール====

*matplotlib (グラフ機能)

===コーパス===

*テキストコーパスとは巨大なテキストのこと

*~~1つ以上のジャンルから集められた素材をバランスよく含むようにデザインされる~~1つ以上のジャンルから集められた素材をバランスよく含むように[[デザイン]]される

*Python NLTK(Natural Language Toolkit) をインストールして、以下を試す。

===テキストを検索する===

*text1: Moby Dick by Herman ~~Melville~~ Mel[[vi]]lle 1851 (白鯨) から、"monstrous"という単語を調べる ~~>>>~~ >>> text1.concordance('monstrous')

Building index...

Displaying 11 of 11 matches:

ll over with a heathenish array of monstrous clubs and spears . Some were thick

d as you gazed , and wondered what monstrous cannibal and savage could ever hav

that has ~~survived~~ sur[[vi]]ved the flood ; most monstrous and most mountainous ! That Himmal

they might scout at Moby Dick as a monstrous fable , or still worse and more de

th of ~~Radney~~ [[R]]adney .'" ~~CHAPTER~~ CHAPTE[[R]] 55 Of the Monstrous Pictures of Whales . I shall ere l

ing Scenes . In connexion with the monstrous pictures of whales , I am strongly

ere to enter upon those still more monstrous stories of them which are to be fo

of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u

==~~簡単な統計処理~~簡単な[[統計]]処理==

===分散プロットを用いて表示===

*ある単語がテキストの最初から最後までの間にどの位置に何回出現するのかを調べる

*以下のライブラリが必要

**[[Python NumPy]]**[[Python matplotlib]] ~~>>>~~ >>> text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

[[File:0058_nltk_dispersion_plot01.png]]

=====グラフウィンドウの設定=====

<<blockquote>>[[CentOS ]] にて上記手順でグラフウィンドウを表示できなかったので、ブラウザにグラフを表示する設定、matplotlib.use("WebAgg") を行っている。<</blockquote>>*[[Python matplotlib]]

*モジュールロード時に一度だけ行えばよい

*~~Windows環境では、デフォルトの~~ [[Windows]]環境では、デフォルトの TkAgg でグラフ表示された

import matplotlib

matplotlib.use("WebAgg")

===頻度分布===

*~~頻度分布は言語処理では頻繁に必要になる。~~頻度分布は[[言語]]処理では頻繁に必要になる。

*NLTKではそれを標準でサポートしている。

====FreqDist を使って「白鯨(Moby Dick)」から頻出50単語を取り出す例====

~~>>>~~ >>> fdist = FreqDist(text1) ~~>>>~~ >>> fdist <<FreqDist with 19317 samples and 260819 outcomes>> ~~>>>~~ >>> vocabulary = fdist.keys() ~~>>>~~ >>> vocabulary[:50]

[',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-', 'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for', 'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on', 'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were', 'now', 'which', '?', 'me', 'like']

====試しにNLTKのホームページ語彙の出現頻度を数えてみる====

print freq.keys()[:50]

=====結果=====

<<FreqDist: '': 547, 'NLTK': 23, '>>>': 11, 'and': 10, 'with': 8, '.': 7, 'a': 7, 'for': 7, '=': 6, '[[Python]]': 5, ...>>

[', '\xc2\xb6', ''JJ'),', ')', 'tagged', 'tokens', ''CD'),', ''NNP'),', ''RB'),', '('Thursday',', '('eight',', '('morning',', '('on',', '("o'clock",', '0', '3', '3.0', 'API', 'Contents', 'Data', 'Development', 'HOWTO', 'Index', 'NLTK,', 'News', 'Processing', 'Search' '', 'NLTK', '>>>', 'and', 'with', '.', 'a', 'for', '=', 'Python', 'is', 'the', 'to', ''IN'),', '–', '(', 'Installing', 'Language', 'Natural', 'list', 'mailing', 'nltk', 'of', ']

<<blockquote>>上記ではBeautifulSoupを利用したが、nltk.clean_html() ~~で、HTML文書からタグを取り除くことができる<~~で、[[HTML]]文書からタグを取り除くことができる</blockquote>>

html = urllib2.urlopen("http://nltk.org/").read()

untagged = nltk.clean_html(html)

|累積頻度プロットを生成

|-

|fdist1 < < fdist2

|fdist1のサンプルの頻度がfdist2よりも少ないかをテスト

|-

*特性Pをもつ単語を探し出す場合、「V(語彙)に属し、特性Pを持つすべてのWの集合」と言うことができる。[http://typea.info/material/a_yan_rdb_introduction.pdf [参考]]

{ w | w ∈ V & P(w) }

[[Python ]] の内包表記で表現すると

[ w for w in V if p(w) ]

となる

l.extend(re.split('[ \n\t]',unicode(t)))

freq = FreqDist(l)

print sorted([w for w in set(l) if len(w) >>= 5 and freq[w] >>= 3])

=====結果=====

[u''IN'),', u''JJ'),', u'–', u'>>>', u'Installing', u'Language', u'Natural', u'Python', u'mailing', u'tagged', u'tokens']

**例えば、赤ワインの赤を茶色に変えることは非常に違和感がある

*単語のペアは、バイアグラムとよばれ、bigrams()関数で簡単に抜き出すことができる。

~~>>>~~ >>> from nltk import * ~~>>>~~ >>> bigrams(['this','is','a','pen'])

[('this', 'is'), ('is', 'a'), ('a', 'pen')]

*ここでタプルで表されるペアがバイグラムであり、本質的には頻出するバイグラムがコロケーション。

trigram_measures = nltk.collocations.TrigramAssocMeasures()

finder = BigramCollocationFinder.from_words(tokens)

print finder.nbest(bigram_measures.pmi, 10) # ~~NORMALIZE~~ NO[[R]]MALIZE WHITESPACE

=====結果=====

[(u'Applications', u'Success'), (u'CORBA', u'objects'), (u'CPython', u'uses'), (u'Communications', u'Engine'), (u'Core', u'Development'), (u'Documentation', u'Download'), (u'Download', u'\u4e0b\u8f7d'), (u'English', u'Resources'), (u'Getting', u'Started'), (u'Insider', u'Blog')]

==~~WordNet~~[[Word]]Net==*~~意味により整列された英語辞書~~意味により整列された[[英語]]辞書

===同義語===

===='motorcar'の同義語を調べる例====

motorcar

===階層構造===

*~~WordNetの同義語集合は抽象概念に対応しており、階層構造となっている。~~[[Word]]Netの同義語集合は抽象概念に対応しており、階層構造となっている。

*下位語を調べるには、synset.hyponyms() を利用する。

*上位語を調べるには、synset.hypernyms() を利用する

*** Synset('tree.n.01') ***

1.definition:

a tall perennial woody plant ~~having~~ ha[[vi]]ng a main trunk and branches forming a distinct elevated crown; includes both gymnosperms and angiosperms

2.examples:

[]

Synset('forest.n.01')

:(略)　　

==~~テキストの正規化~~テキストの[[正規化]]==

===ステマー===

*単語からすべての接辞を取り除く処理をステミングと呼ぶ。

===見出し語化===

*語形を辞書に記述されている形に変換する作業を"見出し語化"と呼ぶ

*~~WordNetのレマタイザ~~[[Word]]Netのレマタイザ(見出し語化ツール)は、結果が辞書に存在する場合にのみ、接辞を削除するようになっている。*~~テキストの語彙を収集する、または有効な見出し語のリストが必要な場合は、WordNetのレマタイザを使うのはよい選択だろう。~~テキストの語彙を収集する、または有効な見出し語のリストが必要な場合は、[[Word]]Netのレマタイザを使うのはよい選択だろう。====~~Porterステマー、LancasterステマーとWordNetレマタイザの使用例~~Porterステマー、Lancasterステマーと[[Word]]Netレマタイザの使用例====

import nltk

import urllib2

porter = nltk.PorterStemmer()

lancaster = nltk.LancasterStemmer()

lemmatizer = nltk.~~WordNetLemmatizer~~[[Word]]NetLemmatizer()

print "****************** porter *********************"

print [t + "->>" + porter.stem(t) for t in tokens if t != porter.stem(t)]

print "****************** lancaster *********************"

print [t + "->>" + lancaster.stem(t) for t in tokens if t != lancaster.stem(t)]

print "****************** lemmatizer *********************"

print [t + "->>" + lemmatizer.lemmatize(t) for t in tokens if t != lemmatizer.lemmatize(t)]

=====結果例=====

****************** porter *********************

['semantic->>semant', 'programming->>program', 'sentence->>sentenc', 'resources->>resourc', 'tagged->>tag', 'using->>use', 'linguistics->>linguist', 'terms->>term', 'classification->>classif', 'simple->>simpl', 'writing->>write', 'only->>onli', 'has->>ha'...(略)]

****************** lancaster *********************

['all->>al', 'semantic->>sem', 'programming->>program', 'sentence->>sent', 'over->>ov', 'resources->>resourc', 'tagged->>tag', 'JJ->>jj', 'using->>us', 'Bird->>bird', 'linguistics->>lingu', 'terms->>term', 'classification->>class', 'Klein->>klein'...(略)]

****************** lemmatizer *********************

['resources->>resource', 'terms->>term', 'has->>ha', 'modules->>module', 'updates->>update', 'linguists->>linguist', 'educators->>educator', 'users->>user', 'entities->>entity', 'guides->>guide', 'libraries->>library', 'students->>student', 'interfaces->>interface'...(略)]

Piroto

ビューロクラット、インターフェース管理者、管理者

12,673

回編集

MyMemoWiki

差分

自然言語処理 (ソースを閲覧)

2020年9月20日 (日) 13:54時点における版

案内メニュー

個人用ツール

名前空間

変種

表示

その他

検索

案内

ツール

プログラミング言語

Web

OS/環境

データベース

Database|データベース製品

アーキテクチャ・モデリング・パターン

環境

Webサービス

プロジェクトマネージメント

仮想化

Network

Office

CMS

構成管理

ツール

文章

音楽

教養

デザイン

業務

その他