NLTK 工具包介绍-谢先斌的博客

NLTK 工具包介绍

发布时间： 2023-06-24 更新时间： 2023-07-30 总字数：1124 阅读时间：3m 作者：谢先斌 IP上海

专栏文章

NLTK 是一款非常实用的文本处理工具，主要用于处理英文数据

安装

pip install nltk

下载依赖模型
- https://www.nltk.org/data.html
- https://www.nltk.org/nltk_data/

import nltk

nltk.download()

使用

分词

import nltk
from nltk.text import Text
from nltk.tokenize import word_tokenize

# need download once
# nltk.download('punkt')

input_str = 'Very warm, and feeling humid, especially in the southeast'
tokens = word_tokenize(input_str)
print(f'tokens: {tokens}')

tokens = [word.lower() for word in tokens]
print(tokens[:2])

tokens: ['Very', 'warm', ',', 'and', 'feeling', 'humid', ',', 'especially', 'in', 'the', 'southeast']
['very', 'warm']

Text 对象

使用 Text 对象，方便后续操作。

t = Text(tokens)
t.count('warm')
t.index('warm')
t.plot(5)

png

<Axes: xlabel='Samples', ylabel='Counts'>

停用词

# nltk.download('stopwords')

from nltk.corpus import stopwords

# 作用介绍
stopwords.readme().replace('\n', ' ')

'Stopwords Corpus  This corpus contains lists of stop words for several languages.  These are high-frequency grammatical words which are usually ignored in text retrieval applications.  They were obtained from: http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/  The stop words for the Romanian language were obtained from: http://arlc.ro/resources/  The English list has been augmented https://github.com/nltk/nltk_data/issues/22  The German list has been corrected https://github.com/nltk/nltk_data/pull/49  A Kazakh list has been added https://github.com/nltk/nltk_data/pull/52  A Nepali list has been added https://github.com/nltk/nltk_data/pull/83  An Azerbaijani list has been added https://github.com/nltk/nltk_data/pull/100  A Greek list has been added https://github.com/nltk/nltk_data/pull/103  An Indonesian list has been added https://github.com/nltk/nltk_data/pull/112 '

stopwords.fileids()

['arabic',
 'azerbaijani',
 'basque',
 'bengali',
 'catalan',
 'chinese',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'greek',
 'hebrew',
 'hinglish',
 'hungarian',
 'indonesian',
 'italian',
 'kazakh',
 'nepali',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'slovene',
 'spanish',
 'swedish',
 'tajik',
 'turkish']

stopwords.raw('english').replace('\n', ' ')

"i me my myself we our ours ourselves you you're you've you'll you'd your yours yourself yourselves he him his himself she she's her hers herself it it's its itself they them their theirs themselves what which who whom this that that'll these those am is are was were be been being have has had having do does did doing a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such no nor not only own same so than too very s t can will just don don't should should've now d ll m o re ve y ain aren aren't couldn couldn't didn didn't doesn doesn't hadn hadn't hasn hasn't haven haven't isn isn't ma mightn mightn't mustn mustn't needn needn't shan shan't shouldn shouldn't wasn wasn't weren weren't won won't wouldn wouldn't "

test_words = [word.lower() for word in tokens]
test_words_set = set(test_words)
test_words_set.intersection(set(stopwords.words('english')))

{'and', 'in', 'the', 'very'}

过滤停用词

filtered = [w for w in test_words_set if (w not in stopwords.words('english'))]
print(f'filtered: {filtered}')

filtered: ['humid', ',', 'feeling', 'warm', 'southeast', 'especially']

词性标注

POS Tag: 意思
CC: 并列连词
CD: 基数词
DT: 限定符
EX: 存在词
FW: 外来词
IN: 介词或从属连词
JJ: 形容词
JJR: 比较级的形容词
JJS: 最高级的形容词
LS: 列表项标记
MD: 情态动词
NN: 名词单数
RB: 副词
RBR: 副词比较级
RBS: 副词最高级
RP: 小品词
UH: 感叹词
VB: 动态原型
VBD: 动词过去式
VBG: 动名词或现在分词
VBN: 动词过去分词
VBP: 非第三人称单数的现在时
VBZ: 第三人称单数的现在时
WDT: 以wh开头的限定词

# 下载averaged_perceptron_tagger，没有参数运行时，支持选择对应的 Packages，桌面系统会弹框
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.





True

from nltk import pos_tag
tags = pos_tag(tokens)
tags

[('very', 'RB'),
 ('warm', 'JJ'),
 (',', ','),
 ('and', 'CC'),
 ('feeling', 'VBG'),
 ('humid', 'NN'),
 (',', ','),
 ('especially', 'RB'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('southeast', 'NN')]

分块

from nltk.chunk import RegexpParser
sentence = [('the', 'DT'), ('little', 'JJ'), ('dog', 'NN')]
grammer = 'MY_NP: {<DT>?<JJ>*<NN>}'
cp = nltk.RegexpParser(grammer)  # 生成规则
result = cp.parse(sentence)
print(result)

# 调用 matplotlib 库画图
# result.draw()

(S (MY_NP the/DT little/JJ dog/NN))

命名实体识别

nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data] Error downloading 'maxent_ne_chunker' from
[nltk_data]     <https://raw.githubusercontent.com/nltk/nltk_data/gh-
[nltk_data]     pages/packages/chunkers/maxent_ne_chunker.zip>:
[nltk_data]     <urlopen error [SSL: UNEXPECTED_EOF_WHILE_READING] EOF
[nltk_data]     occurred in violation of protocol (_ssl.c:1002)>
[nltk_data] Downloading package words to /home/jovyan/nltk_data...
[nltk_data]   Package words is already up-to-date!





True

from nltk import ne_chunk
from nltk import pos_tag
from nltk.tokenize import word_tokenize

sentence = 'Edison went to Tsinghua University today.'
print(ne_chunk(pos_tag(word_tokenize(sentence))))

---------------------------------------------------------------------------

LookupError                               Traceback (most recent call last)

Cell In[16], line 6
      3 from nltk.tokenize import word_tokenize
      5 sentence = 'Edison went to Tsinghua University today.'
----> 6 print(ne_chunk(pos_tag(word_tokenize(sentence))))


File /opt/conda/lib/python3.11/site-packages/nltk/chunk/__init__.py:183, in ne_chunk(tagged_tokens, binary)
    181 else:
    182     chunker_pickle = _MULTICLASS_NE_CHUNKER
--> 183 chunker = load(chunker_pickle)
    184 return chunker.parse(tagged_tokens)


File /opt/conda/lib/python3.11/site-packages/nltk/data.py:750, in load(resource_url, format, cache, verbose, logic_parser, fstruct_reader, encoding)
    747     print(f"<<Loading {resource_url}>>")
    749 # Load the resource.
--> 750 opened_resource = _open(resource_url)
    752 if format == "raw":
    753     resource_val = opened_resource.read()


File /opt/conda/lib/python3.11/site-packages/nltk/data.py:876, in _open(resource_url)
    873 protocol, path_ = split_resource_url(resource_url)
    875 if protocol is None or protocol.lower() == "nltk":
--> 876     return find(path_, path + [""]).open()
    877 elif protocol.lower() == "file":
    878     # urllib might not use mode='rb', so handle this one ourselves:
    879     return find(path_, [""]).open()


File /opt/conda/lib/python3.11/site-packages/nltk/data.py:583, in find(resource_name, paths)
    581 sep = "*" * 70
    582 resource_not_found = f"\n{sep}\n{msg}\n{sep}\n"
--> 583 raise LookupError(resource_not_found)


LookupError: 
**********************************************************************
  Resource maxent_ne_chunker not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('maxent_ne_chunker')
  
  For more information see: https://www.nltk.org/data.html

  Attempted to load chunkers/maxent_ne_chunker/PY3/english_ace_multiclass.pickle

  Searched in:
    - '/home/jovyan/nltk_data'
    - '/opt/conda/nltk_data'
    - '/opt/conda/share/nltk_data'
    - '/opt/conda/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************

数据清洗实例

import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

eng_stpwrd = stopwords.words('english')
s = ''

# 去除标签
s = re.sub(r'\&\w*;|#\w*|@\w*', '', s)
# 去掉空格
s = re.sub(r'\s\s+', ' ', s).lstrip(' ')
# 分词
tokens = word_tokenize(s)
# 去停用词
ss = [i for i in tokens if i not in eng_stpwrd]
print(' '.join(ss))

NLTK 工具包介绍

安装

使用

分词

Text 对象

停用词

词性标注

分块

命名实体识别

数据清洗实例

参考

Cookie Notice!