GATO调研报告
GATO
受大规模语言建模进展的启发,我们应用了类似的方法来构建文本输出领域之外的单个通用模型。我们将 Gato 称为 Gato 的代理用作多模态、多任务、多体现通用policy。具有相同权重的相同网络可以玩雅达利、字幕图像、聊天、带有真实机械臂的堆栈块,等等,根据上下文决定是否输出文本、关节扭矩、按钮按下或其他标记。在本报告中,我们描述了模型和数据,并记录了 Gato 的当前能力。


本质是一个token预测下一个token
Prediction Problem
GATO does not predict observations,only the token of next action
Similar to decision transformers(only no rewards in sequence)
仅仅是模仿experts的行为
Instead of one-hot task identifiers,prompt conditioning is used
Similar to T5 architecture:<prompt>+<sequence>
Prompt:samples of episode,with50% from end, and 50% uniformly sampled
goal-directed learning without rewards??


details
like cross-entropy loss for supervised learning
最小化L(θ, B)

Tokenization
Data TypeMethodOrderingRangeTextSentencePiece with 32000 subwordsText order[0,32000]ImagesSplit into non-overlapping 16*16 pathes and use ViTRaster order[-1,1]for each pixel,divided by <br />quare root of patch size(i.e.4)Discrete Values<br />(e.g.)Atari actionsFlattened into sequences of integersRow-major order[0,1024]Continuous values<br />(e.g.propioceptive inputs)Flattened into sequences of floating point valuesRow-major orderMu-law encoded to [-1,1],<br />dicretized to 1024 uniform bins.Then shifted to[32000,33024]
• Episodes are presented to the agent in order of time (timesteps).
• Timesteps in turn are presented in the following order:
– Observations ([y1:k, x1:m, z1:n]) are ordered lexicographically by key, each item is sequenced as follows:
∗ Text tokens (y1:k) are in the same order as the raw input text.
∗ Image patch tokens (x1:m) are in raster order.
∗ Tensors (z1:n) (such as discrete and continuous observations) are in row-major order.
– Separator (′|′); a designated separator token is provided after observations.
– Actions (a1:A) are tokenized as discrete or continuous values and in row-major order.
A full sequence of tokens is thus given as the concatenation of data from T timesteps:

where L = T (k + m + n + 1 + A) is the total number of tokens.
Embedding Inputs
Parameterized embedding function f(·;θe) to each token
text, discrete or continous valued observations or actions are embedded via lookup table into learned vector embedding space
image patches embedding using a ResNet to obtain a vector per patch
Learnable position encoding vector added to vector
Image Embedding
Similar to ViT


Tokenization + Embedding Pipeline(Image+Dsicrete actions)

Tokenization + Embedding Pipeline(Propioception + Continuous actions)

Mu-law Encoding
对于均匀量化存在的问题则采样非均匀量化解决。它的基本思想是对大信号进行压缩而对小信号进行较大的放大。由于小信号的幅度得到较大的放大 ,从而使小信号的信噪比大为改善。目前常用的压扩方法是对数型的 A律压缩和 μ律压缩 ,其中 μ律压缩公式: y=ln(1+μx)/ln(1+μ) 其中 x 为归一化的量化器输入 , y 为归一化的量化器输出。常数 μ愈大 ,则小信号的压扩效益愈高 ,目前多采用 μ= 255
Local Position Embedding

Original Transformer:
neighbouring tokens have high similarity
further tokens low similarity
but it is different when we add position encoding
这个或许可以理解为什么action local position encoding 都一样,因为不同的local position encoding会导致产生bias,导致原本的action token意义发生变化
Training Details
Hardware:16*16 TPU v3 slice
Timesteps:1M
Batch size:512
Token sequence length:1024
Training time:4 days
Datasets

可以看出weight有很大在3D gaming environmen 只有15%在text方面
Training Procedure
Mimic expert trajectories from SOTA or near-SOTA agents
Train only with episodes returns of at least 80% of expert return
Decision Transformers: Train on all kinds of episodes because we have the goal as total reward gained
但是Gato不能从坏的样本中学习,因为Gato没有任何形式的奖励模式
所以Gato的根本问题是只从好的样本学习这一方面
因为只从好的学习就不知道坏的是怎么发生的,如果随机初始到一个bad zone 那么就很有可能会出问题,因为以前从来没有见过这种情况
在Robocat其实对这方面做了很大改进
future work:Supposedly possible to learn via RL from scratch
extrinsic rewards: environment rewards
spare settings,takes very long to learn (maybe do intrinsic reward modelling)
本人对从0开始学习这一方面存疑
Is the general agent as good as the expert?

因环境而异,在比较困难的环境中表现其实没那么好,但总体来说还是可以的
Is GATO scalable?

The answer is yes
Normalized return:
Each task calculate performance of model as percentage of expert score
Average percentage score across all tasks of a domain
Mean-aggregate percentage score across all domains
Increasing tokens trained = increased performance
Increasing model size = increased performance
Can GATO generalize(zero-shot)?


Not too well on held-out set,zero-shot transfer is a common problem with ML techniques in general
Generalizability experiments(few-shot)
should ideally learn by conditioning on different prompts
sequence lengths of tokenized demonstrations too long
Maximum contnt length insufficient to describe task
Instead,fine-tune agent's parameters on new task ,and evaluate fine-tuned model's performance on environment
3 models
same domain only data:pretrained only from same domain as task to be fine-tuned on
no control data:pretrained only on non-control data
scratch:no pretraining at all

Non-image data: if we not have the right proprioception data,no use training on other data
Image data:different from different tasks