欢迎光临散文网 会员登陆 & 注册

GATO调研报告

2023-06-27 20:50 作者:Chaton丫  | 我要投稿

GATO

受大规模语言建模进展的启发,我们应用了类似的方法来构建文本输出领域之外的单个通用模型。我们将 Gato 称为 Gato 的代理用作多模态、多任务、多体现通用policy。具有相同权重的相同网络可以玩雅达利、字幕图像、聊天、带有真实机械臂的堆栈块,等等,根据上下文决定是否输出文本、关节扭矩、按钮按下或其他标记。在本报告中,我们描述了模型和数据,并记录了 Gato 的当前能力。

![截屏2023-06-27 20.27.41](/Users/chenwenshuo/Library/Application Support/typora-user-images/截屏2023-06-27 20.27.41.png)

截屏2023-06-27 20.27.41
截屏2023-06-27 20.26.19

本质是一个token预测下一个token

Prediction Problem

  • GATO does not predict observations,only the token of next action

Similar to decision transformers(only no rewards in sequence)

仅仅是模仿experts的行为

  • Instead of one-hot task identifiers,prompt conditioning is used

  • Similar to T5 architecture:<prompt>+<sequence>

  • Prompt:samples of episode,with50% from end, and 50% uniformly sampled

goal-directed learning without rewards??


v2-5838bda0fd7c47c5d27d1eb42045339e_r
截屏2023-06-27 15.27.44

details

like cross-entropy loss for supervised learning

最小化L(θ, B)

截屏2023-06-27 15.37.13

Tokenization


Data TypeMethodOrderingRangeTextSentencePiece with 32000 subwordsText order[0,32000]ImagesSplit into non-overlapping 16*16 pathes and use ViTRaster order[-1,1]for each pixel,divided by <br />quare root of patch size(i.e.4)Discrete Values<br />(e.g.)Atari actionsFlattened into sequences of integersRow-major order[0,1024]Continuous values<br />(e.g.propioceptive inputs)Flattened into sequences of floating point valuesRow-major orderMu-law encoded to [-1,1],<br />dicretized to 1024 uniform bins.Then shifted to[32000,33024]

• Episodes are presented to the agent in order of time (timesteps).

• Timesteps in turn are presented in the following order:

– Observations ([y1:k, x1:m, z1:n]) are ordered lexicographically by key, each item is sequenced as follows:

∗ Text tokens (y1:k) are in the same order as the raw input text.

∗ Image patch tokens (x1:m) are in raster order.

∗ Tensors (z1:n) (such as discrete and continuous observations) are in row-major order.

– Separator (′|′); a designated separator token is provided after observations.

– Actions (a1:A) are tokenized as discrete or continuous values and in row-major order.

A full sequence of tokens is thus given as the concatenation of data from T timesteps:

截屏2023-06-27 15.55.07

where L = T (k + m + n + 1 + A) is the total number of tokens.

Embedding Inputs

  • Parameterized embedding function f(·;θe) to each token

  • text, discrete or continous valued observations or actions are embedded via lookup table into learned vector embedding space

  • image patches embedding using a ResNet to obtain a vector per patch

  • Learnable position encoding vector added to vector



  • Image Embedding

Similar to ViT

截屏2023-06-27 16.06.13
截屏2023-06-27 16.31.22
  • Tokenization + Embedding Pipeline(Image+Dsicrete actions)

截屏2023-06-27 16.08.24
  • Tokenization + Embedding Pipeline(Propioception + Continuous actions)

截屏2023-06-27 16.11.50
  • Mu-law Encoding

对于均匀量化存在的问题则采样非均匀量化解决。它的基本思想是对大信号进行压缩而对小信号进行较大的放大。由于小信号的幅度得到较大的放大 ,从而使小信号的信噪比大为改善。目前常用的压扩方法是对数型的 A律压缩和 μ律压缩 ,其中 μ律压缩公式: y=ln(1+μx)/ln(1+μ) 其中 x 为归一化的量化器输入 , y 为归一化的量化器输出。常数 μ愈大 ,则小信号的压扩效益愈高 ,目前多采用 μ= 255

  • Local Position Embedding

截屏2023-06-27 16.32.20

Original Transformer:

  • neighbouring tokens have high similarity

  • further tokens low similarity

but it is different when we add position encoding

这个或许可以理解为什么action local position encoding 都一样,因为不同的local position encoding会导致产生bias,导致原本的action token意义发生变化

Training Details

  • Hardware:16*16 TPU v3 slice

  • Timesteps:1M

  • Batch size:512

  • Token sequence length:1024

  • Training time:4 days

Datasets

截屏2023-06-27 19.23.29

可以看出weight有很大在3D gaming environmen 只有15%在text方面

Training Procedure

  • Mimic expert trajectories from SOTA or near-SOTA agents

  • Train only with episodes returns of at least 80% of expert return

Decision Transformers: Train on all kinds of episodes because we have the goal as total reward gained

但是Gato不能从坏的样本中学习,因为Gato没有任何形式的奖励模式

所以Gato的根本问题是只从好的样本学习这一方面

因为只从好的学习就不知道坏的是怎么发生的,如果随机初始到一个bad zone 那么就很有可能会出问题,因为以前从来没有见过这种情况

在Robocat其实对这方面做了很大改进

  • future work:Supposedly possible to learn via RL from scratch

extrinsic rewards: environment rewards

spare settings,takes very long to learn (maybe do intrinsic reward modelling)

本人对从0开始学习这一方面存疑

Is the general agent as good as the expert?

截屏2023-06-27 19.43.28

因环境而异,在比较困难的环境中表现其实没那么好,但总体来说还是可以的

Is GATO scalable?

截屏2023-06-27 19.48.15

The answer is yes

  • Normalized return:

  • Each task calculate performance of model as percentage of expert score

  • Average percentage score across all tasks of a domain

  • Mean-aggregate percentage score across all domains

  • Increasing tokens trained = increased performance

  • Increasing model size = increased performance

Can GATO generalize(zero-shot)?

截屏2023-06-27 20.04.03
截屏2023-06-27 20.01.57
  • Not too well on held-out set,zero-shot transfer is a common problem with ML techniques in general

Generalizability experiments(few-shot)

  • should ideally learn by conditioning on different prompts

  • sequence lengths of tokenized demonstrations too long

  • Maximum contnt length insufficient to describe task

  • Instead,fine-tune agent's parameters on new task ,and evaluate fine-tuned model's performance on environment

  • 3 models

  • same domain only data:pretrained only from same domain as task to be fine-tuned on

  • no control data:pretrained only on non-control data

  • scratch:no pretraining at all

截屏2023-06-27 20.18.30
  • Non-image data: if we not have the right proprioception data,no use training on other data

  • Image data:different from different tasks


GATO调研报告的评论 (共 条)

分享到微博请遵守国家法律