4.7课 微调和评估实战 DIY微调开发方案 -- 大语言模型应用开发课程

# 微调代码,方便小伙伴学习
# 微调代码如下
import os
import pandas as pd
import transformers as tr
from datasets import load_dataset
### Step 1 - 准备数据
# 微调过程的第一步是确定特定的任务和支持数据集。在本笔记本中,我们将考虑特定的任务是分类电影评论。这是一个通常简单的任务,其中电影评论以纯文本形式提供,我们希望确定评论是积极还是消极。
# IMDB数据集(https://huggingface.co/datasets/imdb) 可以作为这个任务的支持数据集。该数据集方便地提供了带有标签二进制情感的训练和测试数据集,以及一个未标记的数据集。
imdb_ds = load_dataset("imdb")
# Step 2 - 选预训练模型
# 这里用 t5-small 的6B模型(https://huggingface.co/docs/transformers/model_doc/t5) (论文地址: https://arxiv.org/pdf/1910.10683.pdf)
model_checkpoint = "t5-small"
# Hugging Face提供了Auto*(https://huggingface.co/docs/transformers/model_doc/auto) 套件,以方便地实例化与预训练模型相关的各种组件。在这里,我们使用AutoTokenizer(https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer) 加载与t5-small
模型关联的分词器。
# load the tokenizer that was used for the t5-small model
tokenizer = tr.AutoTokenizer.from_pretrained(
model_checkpoint, cache_dir=DA.paths.datasets
) # Use a pre-cached model
# 如上所述,IMDB数据集是一个二进制情感数据集。因此,其标签被编码为(-1 - 未知;0 - 消极;1 - 积极)的值。为了像T5这样的文本到文本模型使用此数据集,需要将标签集表示为字符串。有几种方法可以实现这一点。在这里,我们将简单地将每个标签ID转换为其相应的字符串值。
def to_tokens(
tokenizer: tr.models.t5.tokenization_t5_fast.T5TokenizerFast, label_map: dict
) -> callable:
"""
Given a `tokenizer` this closure will iterate through `x` and return the result of `apply()`.
This function is mapped to a dataset and returned with ids and attention mask.
"""
def apply(x) -> tr.tokenization_utils_base.BatchEncoding:
"""From a formatted dataset `x` a batch encoding `token_res` is created."""
target_labels = [label_map[y] for y in x["label"]]
token_res = tokenizer(
x["text"],
text_target=target_labels,
return_tensors="pt",
truncation=True,
padding=True,
)
return token_res
return apply
imdb_label_lookup = {0: "negative", 1: "positive", -1: "unknown"}
imdb_to_tokens = to_tokens(tokenizer, imdb_label_lookup)
tokenized_dataset = imdb_ds.map(
imdb_to_tokens, batched=True, remove_columns=["text", "label"]
)
# MAGIC ### Step 3 - 训练设置
# 模型训练过程非常可配置。TrainingArguments类 (https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) 有效地暴露了该过程的可配置方面,允许根据需要进行自定义。在这里,我们将重点关注设置一个执行单个epoch训练的训练过程,批量大小为16。我们还将利用adamw_torch
作为优化器。
checkpoint_name = "test-trainer"
local_checkpoint_path = os.path.join(local_training_root, checkpoint_name)
training_args = tr.TrainingArguments(
local_checkpoint_path,
num_train_epochs=1, # default number of epochs to train is 3
per_device_train_batch_size=16,
optim="adamw_torch",
report_to=["tensorboard"],
)
# `t5-small` 可以通过 AutoModelForSeq2SeqLM(https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForSeq2SeqLM) 类来加载.
# 加载预训练模型
model = tr.AutoModelForSeq2SeqLM.from_pretrained(
model_checkpoint, cache_dir=DA.paths.datasets
) # Use a pre-cached model
# 用于辅助训练器对数据进行批处理。
data_collator = tr.DataCollatorWithPadding(tokenizer=tokenizer)
trainer = tr.Trainer(
model,
training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["test"],
tokenizer=tokenizer,
data_collator=data_collator,
)
### Step 4 - 训练
trainer.train()
# 存储训练模型
trainer.save_model()
trainer.save_state()
# 将微调后的模型固化到DBFS。
final_model_path = f"{DA.paths.working_dir}/llm04_fine_tuning/{checkpoint_name}"
trainer.save_model(output_dir=final_model_path)
### Step 5 - 预测/推理
fine_tuned_model = tr.AutoModelForSeq2SeqLM.from_pretrained(final_model_path)
reviews = [
"""
'Despicable Me' is a cute and funny movie, but the plot is predictable and the characters are not very well-developed. Overall, it's a good movie for kids, but adults might find it a bit boring.""",
""" 'The Batman' is a dark and gritty take on the Caped Crusader, starring Robert Pattinson as Bruce Wayne. The film is a well-made crime thriller with strong performances and visuals, but it may be too slow-paced and violent for some viewers.
""",
"""
The Phantom Menace is a visually stunning film with some great action sequences, but the plot is slow-paced and the dialogue is often wooden. It is a mixed bag that will appeal to some fans of the Star Wars franchise, but may disappoint others.
""",
"""
I'm not sure if The Matrix and the two sequels were meant to have a tigh consistency but I don't think they quite fit together. They seem to have a reasonably solid arc but the features from the first aren't in the second and third as much, instead the second and third focus more on CGI battles and more visuals. I like them but for different reasons, so if I'm supposed to rate the trilogy I'm not sure what to say.
""",
]
inputs = tokenizer(reviews, return_tensors="pt", truncation=True, padding=True)
pred = fine_tuned_model.generate(
input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"]
)
pdf = pd.DataFrame(
zip(reviews, tokenizer.batch_decode(pred, skip_special_tokens=True)),
columns=["review", "classification"],
)
display(pdf)
# 随着模型架构的演变和增长,它们不断挑战可用计算资源的极限。例如,一些具有数百亿个参数的大型LLMs太大,无法适应某些情况下的GPU内存。这种规模的模型因此需要利用分布式处理或高端硬件,有时甚至两者兼备,以支持训练工作。这使得大型模型训练成为一项昂贵的任务,因此加速培训过程非常可取。
# 如上所述,可以利用Microsoft的DeepSpeed框架(https://github.com/microsoft/DeepSpeed) 来加速模型训练过程。该框架提供了在压缩、分布式训练、混合精度、梯度累积和检查点方面的进展。
# 值得注意的是,DeepSpeed旨在加速不适合设备内存的大型模型。我们使用的t5-base
模型不是大型模型,因此DeepSpeed不被期望提供好处。
### 以下是deepspeed相关配置
# 环境建立
# DeepSpeed 的预期用途是在分布式计算环境中使用。因此,每个节点的环境都被分配了一个 rank
和 local_rank
,与分布式环境的大小相对应。在这里,我们将 world_size
设置为 1,并将 ranks
设置为 0。
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "9994" # modify if RuntimeError: Address already in use
os.environ["RANK"] = "0"
os.environ["LOCAL_RANK"] = "0"
os.environ["WORLD_SIZE"] = "1"
### 配置
# DeepSpeed 具有许多可用于优化训练和推理过程的配置选项。这些选项通过 Hugging Face TrainerArguments
接受从 JSON 文件或字典中定义的配置。在这里,我们将定义一个字典。
# [configuration options](https://www.deepspeed.ai/docs/config-json/)
# [ZeRO optimization](https://www.deepspeed.ai/training/#memory-efficiency)
zero_config = {
"zero_optimization": {
"stage": 2,
"offload_optimizer": {"device": "cpu", "pin_memory": True},
"allgather_partitions": True,
"allgather_bucket_size": 5e8,
"overlap_comm": True,
"reduce_scatter": True,
"reduce_bucket_size": 5e8,
"contiguous_gradients": True,
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto",
"torch_adam": True,
},
},
"train_batch_size": "auto",
}
model_checkpoint = "t5-base"
tokenizer = tr.AutoTokenizer.from_pretrained(
model_checkpoint, cache_dir=DA.paths.datasets
)
imdb_to_tokens = to_tokens(tokenizer, imdb_label_lookup)
tokenized_dataset = imdb_ds.map(
imdb_to_tokens, batched=True, remove_columns=["text", "label"]
)
model = tr.AutoModelForSeq2SeqLM.from_pretrained(
model_checkpoint, cache_dir=DA.paths.datasets
)
### 训练
# 训练设置只有两个更改。第一个是设置一个新的检查点名称。第二个是将deepspeed
配置添加到TrainingArguments
中。
checkpoint_name = "test-trainer-deepspeed"
checkpoint_location = os.path.join(local_training_root, checkpoint_name)
training_args = tr.TrainingArguments(
checkpoint_location,
num_train_epochs=3, # default number of epochs to train is 3
per_device_train_batch_size=8,
deepspeed=zero_config, # add the deepspeed configuration
report_to=["tensorboard"],
)
data_collator = tr.DataCollatorWithPadding(tokenizer=tokenizer)
trainer = tr.Trainer(
model,
training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["test"],
tokenizer=tokenizer,
data_collator=data_collator,
)
trainer.train()
trainer.save_model()
trainer.save_state()
# 固化预训练模型到DBFS
final_model_path = f"{DA.paths.working_dir}/llm04_fine_tuning/{checkpoint_name}"
trainer.save_model(output_dir=final_model_path)
### 预测/推理
fine_tuned_model = tr.AutoModelForSeq2SeqLM.from_pretrained(final_model_path)
review = [
"""
I'm not sure if The Matrix and the two sequels were meant to have a tight consistency but I don't think they quite fit together. They seem to have a reasonably solid arc but the features from the first aren't in the second and third as much, instead the second and third focus more on CGI battles and more visuals. I like them but for different reasons, so if I'm supposed to rate the trilogy I'm not sure what to say."""
]
inputs = tokenizer(review, return_tensors="pt", truncation=True, padding=True)
pred = fine_tuned_model.generate(
input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"]
)
pdf = pd.DataFrame(
zip(review, tokenizer.batch_decode(pred, skip_special_tokens=True)),
columns=["review", "classification"],
)
display(pdf)