transformers是为PyTorch, TensorFlow和JAX打造的先进的机器学习工具
介绍
文本分类(sentiment-analysis)示例
- 安装 transformers datasets 依赖
pip install transformers datasets
pip install torch
当运行时,会默认从 huggingface 官网下载模型和数据,缓存目录:~/.cache/huggingface/hub/
#!/usr/bin/env python
import warnings
warnings.filterwarnings("ignore")
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
raw_inputs = [
"I've been waiting for a HuggingFace course my whole life.",
"I hate this so much!",
]
results = classifier(raw_inputs)
for result in results:
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
from transformers import AutoTokenizer
checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors='pt')
print(inputs)
第一次执行过程:
$ python test_sentiment_analysis.py
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading (…)lve/main/config.json: 100%|...| 629/629 [00:00<00:00, 158kB/s]
Downloading pytorch_model.bin: 100%|...| 268M/268M [01:43<00:00, 2.58MB/s]
Downloading (…)okenizer_config.json: 100%|...| 48.0/48.0 [00:00<00:00, 22.5kB/s]
Downloading (…)solve/main/vocab.txt: 100%|...| 232k/232k [00:00<00:00, 1.43MB/s]
label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309
说明
pipeline
封装了三个步骤:
- Tokenizer 预处理
- 分词器:分词、分字以及特殊字符(包括:起始、终止、间隔等),称为 token
- 为每个 token 映射一个 id(每个词的 id 唯一,包含特征向量)
- 生成辅助信息,如
attention_mask
等
- 通过模型传递输入
- 后处理
图片参考