欢迎光临散文网 会员登陆 & 注册

stable-diffusion里的Clip skip到底是什么?

2023-03-24 21:20 作者:Gnedl  | 我要投稿

CLIP model (The text embedding present in 1.x models) has a structure that is composed of layers. Each layer is more specific than the last. Example if layer 1 is "Person" then layer 2 could be: "male" and "female"; then if you go down the path of "male" layer 3 could be: Man, boy, lad, father, grandpa... etc. Note this is not exactly how the CLIP model is structured, but for the sake of example.

The 1.5 model is for example 12 ranks deep. Where in 12th layer is the last layer of text embedding. Each layer matrix of some size, and each layer is has additional matrixes. So 4x4 first layer has 4 4x4 under it... SO and so forth. So the text space is dimensionally fucking huge.

Now why would you want to stop earlier in the Clip layers? Well if you want picture of "a cow" you might not care about the sub categories of "cow" the text model might have. Especially since these can have varying degrees of quality. So if you want "a cow" you might not want "a abederdeen angus bull".

You can imagine CLIP skip to basically be a setting for "how accurate you want the text model to be". You can test it out, wtih XY script for example. You can see that each clip stage has more definition in the description sense. So if you have a detailed prompt about a young man standing in a field, with lower clip stages you'd get picture of "a man standing", then deeper "young man standing", "Young man standing in a forest"... etc.

CLIP skip really becomes good when you use models that are structured in a special way. Like Booru models. Where "1girl" tag can break down to many sub tags that connect to that one major tag. Whether you get use of from clip skip is really just trial and error.

Now keep in mind that CLIP skip only works in models that use CLIP and or are based on models that use CLIP. As in 1.x models and it's derivates. 2.0 models and it's derivates do not interact with CLIP because they use OpenCLIP.

以下是中文翻译(AI翻译)

CLIP模型(1.x模型中存在的文本嵌入)具有由层组成的结构。每一层比上一层更具体。例如,如果第一层是“人”,则第二层可能是:“男性”和“女性”;然后,如果您沿着“男性”的路径走,第三层可能是:男人、男孩、小伙子、父亲、爷爷等。请注意,CLIP模型的结构并非完全如此,但是为了举例而言。

例如,1.5模型有12个等级。在第12层中,是文本嵌入的最后一层。每个层矩阵有一定的大小,每个层都有额外的矩阵。因此,第一层的4x4有4个4x4在其下面...如此等等。因此,文本空间的维度非常巨大。

现在为什么要在Clip层中停止较早?如果您想要“一头牛”的图片,则可能不关心文本模型可能具有的“牛”的子类别。特别是因为这些可以具有不同程度的质量。因此,如果您想要“一头牛”,您可能不想要“一头阿伯丁安格斯公牛”。

您可以将CLIP skip想象成“您希望文本模型有多准确”的设置。例如,您可以使用XY脚本进行测试。您可以看到每个clip阶段在描述意义上都具有更多的定义。因此,如果您有关于一个年轻男子站在田野上的详细提示,那么在较低的clip阶段中,您会得到“一个站立的男人的图片”,然后更深入的是“站立的年轻男人”,“站在森林中的年轻男人”等等。

当您使用以特殊方式结构化的模型时,CLIP skip真正变得好用。例如,Booru模型。在那里,“1girl”标记可以分解为许多连接到该主要标记的子标记。无论您是否从clip skip中获得使用都只是试错。

现在请记住,CLIP跳过仅适用于使用CLIP或基于使用CLIP的模型。即1.x模型及其派生物。2.0模型及其派生物不与CLIP交互,因为它们使用OpenCLIP。


寻找合适且易懂的解释花了些时间,但还好这是有收获的

此回答转自https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/5674

感兴趣的可以去看看。


stable-diffusion里的Clip skip到底是什么?的评论 (共 条)

分享到微博请遵守国家法律