Generative-Pretraining-from-Pixels-V2_Vision Transformer Self-supervised Learning资源-CSDN下载

80 浏览量 2025-01-05 10:11:59 上传评论收藏 1.21MB PDF 举报

Generative预训练模型一直是深度学习和人工智能领域的一个重要研究方向。在《Generative-Pretraining-from-Pixels-V2》这篇论文中，研究者们探讨了基于像素的生成式预训练技术在图像表示学习方面的最新进展。本文提出了一种全新的方法，通过Transformer架构来自动回归预测像素，进而实现图像的有效表示学习。论文中提到，先前的研究表明，即使是单层学习到的特征或随机特征，也能传递强结果（Coates等，2011年）。然而，随着深度学习技术的进步，特别是在自然语言处理领域取得的显著成果，研究者们开始重新审视这类模型是否能够学习到有用的表现形式。在此背景下，论文作者采用了类似于自然语言处理中的预训练技术，但将其应用于图像领域。他们训练了一个序列到序列的Transformer模型，通过自回归方式预测像素，这一方法没有直接编码模型的先前结构，而是利用大量的监督数据直接学习到表示。在低分辨率的ImageNet数据集上进行训练，并在没有标签的情况下，研究者们发现，采用GPT-2规模的模型能够学习到强有力的图像表示。研究者们使用了线性探针、微调和低数据分类等一系列技术，对模型性能进行了评估。在CIFAR-10数据集上，模型通过线性探针达到了96.3%的准确率，超过了监督训练的Wide ResNet，并且在完全微调后，准确率达到了99.0%，与监督预训练模型相匹配。更大的模型在ImageNet和网页图像的混合数据上进行训练，其性能与自监督基准在ImageNet上的表现相当，取得了72.0%的top-1准确率。作者还指出，由于图像是一种比文本更高维、更噪声、更冗余的模式，它们被认为对于生成模型来说是困难的。论文详细描述了如何通过改进模型结构和训练方法来克服这些挑战，从而实现有效的图像生成和表示学习。论文的重点标签包括“Transformers”，“Self-Supervised Learning”，“Generative Models”，和“Image Representation”，以及“CNN Alternatives”。这些标签反映了研究的主要领域和论文关注的核心问题。例如，“Transformers”标签突出了研究使用的核心模型架构；“Self-Supervised Learning”体现了模型在无监督学习场景中的应用；“Generative Models”则强调了模型的生成性质；“Image Representation”关注于模型在图像数据上学习表示的能力；而“CNN Alternatives”表明了研究旨在探索或提供不同于传统卷积神经网络（CNN）的替代方案。此外，论文在研究方法上也展现了创新性。例如，模型的自回归预测像素能力、通过无监督学习实现的图像特征学习，以及如何处理高维噪声数据等。研究在实验设计和结果分析上也提供了大量有价值的见解，比如不同规模模型在各种基准测试上的性能表现，以及在有限数据条件下的微调效果等。本文的这些研究工作不仅推动了图像生成模型和表示学习的技术进步，同时也为相关的应用领域，如计算机视觉和图像识别提供了新的解决方案和理论支持。论文的研究成果对于理解图像数据的深层表示学习机制，以及未来更深入的探索将具有重要的指导意义。

资源推荐

资源详情

资源评论

Generative Pretraining from Pixels

Mark Chen

Alec Radford

Rewon Child

Jeff Wu

Heewoo Jun

Prafulla Dhariwal

David Luan

Ilya Sutskever

Abstract

Inspired by progress in unsupervised representa-

tion learning for natural language, we examine

whether similar models can learn useful repre-

sentations for images. We train a sequence Trans-

former to auto-regressively predict pixels, without

incorporating knowledge of the 2D input structure.

Despite training on low-resolution ImageNet with-

out labels, we ﬁnd that a GPT-2 scale model learns

strong image representations as measured by lin-

ear probing, ﬁne-tuning, and low-data classiﬁca-

tion. On CIFAR-10, we achieve 96.3% accuracy

with a linear probe, outperforming a supervised

Wide ResNet, and 99.0% accuracy with full ﬁne-

tuning, matching the top supervised pre-trained

models. An even larger model trained on a mix-

ture of ImageNet and web images is competitive

with self-supervised benchmarks on ImageNet,

achieving 72.0% top-1 accuracy on a linear probe

of our features.

1. Introduction

Unsupervised pre-training played a central role in the resur-

gence of deep learning. Starting in the mid 2000’s, ap-

proaches such as the Deep Belief Network (Hinton et al.,

2006) and Denoising Autoencoder (Vincent et al., 2008)

were commonly used in neural networks for computer vi-

sion (Lee et al., 2009) and speech recognition (Mohamed

et al., 2009). It was believed that a model which learned

the data distribution

P (X)

would also learn beneﬁcial fea-

tures for the subsequent supervised modeling of

P (Y |X)

(Lasserre et al., 2006; Erhan et al., 2010). However, advance-

ments such as piecewise linear activation functions (Nair

& Hinton, 2010), improved initializations (Glorot & Ben-

gio, 2010), and normalization strategies (Ioffe & Szegedy,

2015; Ba et al., 2016) removed the need for pre-training in

order to achieve strong results. Other research cast doubt

on the beneﬁts of deep unsupervised representations and re-

OpenAI, San Francisco, CA, USA. Correspondence to: Mark

Chen <[email protected]>.

ported strong results using a single layer of learned features

(Coates et al., 2011), or even random features (Huang et al.,

2014; May et al., 2017). The approach fell out of favor as

the state of the art increasingly relied on directly encoding

prior structure into the model and utilizing abundant su-

pervised data to directly learn representations (Krizhevsky

et al., 2012; Graves & Jaitly, 2014). Retrospective study of

unsupervised pre-training demonstrated that it could even

hurt performance in modern settings (Paine et al., 2014).

Instead, unsupervised pre-training ﬂourished in a differ-

ent domain. After initial strong results for word vectors

(Mikolov et al., 2013), it has pushed the state of the art

forward in Natural Language Processing on most tasks (Dai

& Le, 2015; Peters et al., 2018; Howard & Ruder, 2018;

Radford et al., 2018; Devlin et al., 2018). Interestingly, the

training objective of a dominant approach like BERT, the

prediction of corrupted inputs, closely resembles that of the

Denoising Autoencoder, which was originally developed for

images.

As a higher dimensional, noisier, and more redundant modal-

ity than text, images are believed to be difﬁcult for genera-

tive modeling. Here, self-supervised approaches designed to

encourage the modeling of more global structure (Doersch

et al., 2015) have shown signiﬁcant promise. A combination

of new training objectives (Oord et al., 2018), more recent

architectures (Gomez et al., 2017), and increased model ca-

pacity (Kolesnikov et al., 2019) has allowed these methods

to achieve state of the art performance in low data settings

enaff et al., 2019) and sometimes even outperform super-

vised representations in transfer learning settings (He et al.,

2019; Misra & van der Maaten, 2019; Chen et al., 2020).

Given that it has been a decade since the original wave of

generative pre-training methods for images and considering

their substantial impact in NLP, this class of methods is due

for a modern re-examination and comparison with the recent

progress of self-supervised methods. We re-evaluate genera-

tive pre-training on images and demonstrate that when using

a ﬂexible architecture (Vaswani et al., 2017), a tractable and

efﬁcient likelihood based training objective (Larochelle &

Murray, 2011; Oord et al., 2016), and signiﬁcant compute

resources (2048 TPU cores), generative pre-training is com-

petitive with other self-supervised approaches and learns

Generative Pretraining from Pixels

Figure 1.

An overview of our approach. First, we pre-process raw images by resizing to a low resolution and reshaping into a 1D sequence.

We then chose one of two pre-training objectives, auto-regressive next pixel prediction or masked pixel prediction. Finally, we evaluate

the representations learned by these objectives with linear probes or ﬁne-tuning.

representations that signiﬁcantly improve the state of the

art in low-resolution unsupervised representation learning

settings.

This is especially promising as our architecture uses a dense

connectivity pattern which does not encode the 2D spatial

structure of images yet is able to match and even outperform

approaches which do. We report a set of experiments charac-

terizing the performance of our approach on many datasets

and in several different evaluation settings (low data, linear

evaluation, full ﬁne-tuning). We also conduct several exper-

iments designed to better understand the achieved perfor-

mance of these models. We investigate how representations

are computed inside our model via the performance of linear

probes as a function of model depth as well as studying how

scaling the resolution and parameter count of the approach

affects performance.

2. Approach

Our approach consists of a pre-training stage followed by

a ﬁne-tuning stage. In pre-training, we explore both the

auto-regressive and BERT objectives. We also apply the

sequence Transformer architecture to predict pixels instead

of language tokens.

One way to measure representation quality is to ﬁne-tune for

image classiﬁcation. Fine-tuning adds a small classiﬁcation

head to the model, used to optimize a classiﬁcation objective

and adapts all weights. Pre-training can be viewed as a

favorable initialization or as a regularizer when used in

combination with early stopping (Erhan et al., 2010).

Another approach for measuring representation quality uses

the pre-trained model as a feature extractor. In particular,

given labeled examples

(X, Y )

, the model is applied to

to produce features

. Then, a linear classiﬁer is trained

, Y )

. Linear probing captures the intuition that good

features should linearly separate the classes of transfer tasks.

Furthermore, linear probes help disentangle feature quality

from model architecture: in ﬁne-tuning, one model may

outperform another because its architecture is more suited

for the downstream task rather than because of better pre-

training.

We begin this section by deﬁning the auto-regressive and

BERT objectives in the context of images. Next, we outline

implementation details for our transformer decoder. Finally,

we describe how the transformer is used for ﬁne-tuning and

how features are extracted for linear probes.

2.1. Pre-training

Given an unlabeled dataset

consisting of high dimen-

sional data

x = (x

, ..., x

)

, we can pick a permutation

of the set

[1, n]

and model the density

p(x)

auto-regressively

as follows:

p(x) =

i=1

p(x

, ..., x

i−1

, θ)

When working with images, we pick the identity permuta-

tion

= i

for

1 ≤ i ≤ n

, also known as raster order. We

train our model by minimizing the negative log-likelihood

of the data:

x∼X

[− log p(x)]

We also consider the BERT objective, which samples a

sub-sequence

M ⊂ [1, n]

such that each index

indepen-

dently has probability

0.15

of appearing in

. We call

the BERT mask, and we train our model by minimizing

the negative log-likelihood of the “masked” elements

conditioned on the “unmasked” ones x

[1,n]\M

BERT

x∼X

i∈M



− log p



[1,n]\M



In pre-training, we pick one of

BERT

and mini-

mize the loss over our pre-training dataset.

2.2. Architecture

The transformer decoder takes an input sequence

, ..., x

of discrete tokens and produces a

-dimensional embedding

for each position. The decoder is realized as a stack of

blocks, the

-th of which produces an intermediate em-

bedding

, ..., h

also of dimension

. We use the GPT-2

Generative Pretraining from Pixels

(Radford et al., 2019) formulation of the transformer de-

coder block, which acts on an input tensor h

as follows:

= layer norm(h

)

= h

+ multihead attention(n

)

l+1

= a

+ mlp(layer norm(a

))

In particular, layer norms precede both the attention and

mlp operations, and all operations lie strictly on residual

paths. We ﬁnd that such a formulation allows us to scale the

transformer with ease.

The only mixing across sequence elements occurs in the

attention operation, and to ensure proper conditioning when

training the AR objective, we apply the standard upper

triangular mask to the

n×n

matrix of attention logits. When

using the BERT objective, no attention logit masking is

required: after applying content embeddings to the input

sequence, we zero out the positions in M.

Additionally, since we learn independent position embed-

dings for each sequence element, our BERT model has no

positional inductive biases (i.e. it is permutation invariant).

Put another way, any spatial relationships between posi-

tions must be learned by the model at train time. This is

not entirely true for the AR model, as choosing the raster

order also ﬁxes a prespeciﬁed ordering of the condition-

als. Nevertheless, permutation invariance is a property in

strong contrast to convolutional neural networks, which in-

corporate the inductive bias that features should arise from

spatially proximate elements.

Following the ﬁnal transformer layer, we apply a layer norm

= layer norm(h

)

, and learn a projection from

logits parameterizing the conditional distributions at each

sequence element. When training BERT, we simply ignore

the logits at unmasked positions.

2.3. Fine-tuning

When ﬁne-tuning, we average pool

across the sequence

dimension to extract a

-dimensional vector of features per

example:

= hn

We learn a projection from

to class logits, which we use

to minimize a cross entropy loss L

CLF

While ﬁne-tuning on

CLF

yields reasonable downstream

performance, we ﬁnd empirically that the joint objective

GEN

+ L

CLF

GEN

∈ {L

, L

BERT

}

works even better. Similar ﬁnd-

ings were reported by Radford et al. (2018).

2.4. Linear Probing

Extracting ﬁxed features for linear probing follows a similar

procedure to ﬁne-tuning, except that average pooling is not

always at the ﬁnal layer:

= hn

where

0 ≤ l ≤ L

. We will show in the experiments section

that the best features often lie in the middle of the network.

As in ﬁne-tuning, we project these intermediate features

to produce class logits. Because we view the features as

ﬁxed when linear probing, this projection contains the only

trainable weights, so we can only optimize L

CLF

3. Methodology

Although supervised pre-training is the dominant paradigm

for image classiﬁcation, curating large labeled image

datasets is both expensive and time consuming. Instead

of further scaling up labeling efforts, we can instead as-

pire to learn general purpose representations from the much

larger set of available unlabeled images and ﬁne-tune them

for classiﬁcation. We investigate this setting using Ima-

geNet as a proxy for a large unlabeled corpus, and small

classic labeled datasets (CIFAR-10, CIFAR-100, STL-10)

as proxies for downstream tasks. For our largest model, we

use an additional 100 million unlabeled web images, ﬁltered

to be similar to ImageNet.

Even in cases where labels are available, unsupervised or

self-supervised pre-training can still provide beneﬁts in data

efﬁciency or on ﬁne-tuning speed. We investigate this set-

ting by pre-training without labels and then ﬁne-tuning or

linear probing with labels.

3.1. Dataset and Data Augmentation

We use the ImageNet ILSVRC 2012 training dataset, split-

ting off 4% as our experimental validation set and report

results on the ILSVRC 2012 validation set as our test set.

For CIFAR-10, CIFAR-100 and STL-10, we split off 10%

of the provided training set instead. We ignore the provided

unlabeled examples in STL-10, which constitute a subset of

ImageNet.

No data augmentation is used when pre-training on web

images, and lightweight data augmentation is used when

pre-training or ﬁne-tuning on ImageNet. Speciﬁcally, when

employing data augmentation, we randomly resize an image

such that the shorter sidelength is in the range

[256, 384]

and then take a random

224 × 224

crop. When evaluating

on ImageNet, we resize the image such that the shorter

sidelength is

224

, and use the single

224 × 224

center crop.

When full-network ﬁne-tuning on CIFAR-10 and CIFAR-

100, we use the augmentation popularized by Wide Residual

Networks: 4 pixels are reﬂection padded on each side, and

32 × 32

crop is randomly sampled from the padded image

or its horizontal ﬂip (Zagoruyko & Komodakis, 2016).

Once optimal hyperparameters are found, we fold our ex-

剩余11页未读，继续阅读

评论收藏

内容反馈

AI新纪元

粉丝: 102

Generative-Pretraining-from-Pixels-V2

Generative Pretraining From Pixels.pdf

Generative-models-main.zip

Super-Resolution-using-Generative-Adversarial-Networks-master.zip

interim-measures-for-generative-ai-services.pdf

Generative-Adversarial-Networks-Projects-master.zip

Inpainting of Remote Sensing SST Images with Deep Convolutional Generative-发表论文

ExtremeLearningMachine资源共享-Combining-information-theoretic-kernels-with-generative-embed_2013_Neurocomp.pdf

Gibbs-sampling-in-the-generative-model-of-Latent-Dirichlet-Allocation

Pytorch-Latent-Constraints-Learning-to-Generate-Conditionally-from-Unconditional-Generative-Models:潜在约束

CB-Insights_Generative-AI-Predictions-2024.pdf

tensorflow2-generative-models-master.zip

the-economic-potential-of-generative-ai-the-next-productivity-frontier-vf.pdf

NLP：Improving Language Understanding by Generative Pre-Training

generative-placeholders:将生成艺术用作图像占位符

generative-models-master.zip_GAN网络_GaN_条件生成对抗_生成对抗网络_自动编码器

generative-art-iap:麻省理工学院教授的IAP生成艺术课的资料库

“MelNet：频域音频生成模型” 的实现_pytorch_tts_generative-model

generative-art:随机生成艺术作品的网站

DeepSeek：从入门到精通，清华大学新闻与传播学院，104页PDF

Ollama 0.5.4版本

清华大学DeepSeek从入门到精通-高清免费

ollama 0.5.7.0安装包，来自git

DeepSeek使用指南：涵盖注册登录、基础对话、文件处理到自动化工作流的技术应用与优化

stable-diffusion部署需要的包

Notepad++ 8.5.6最新版 64位安装包

DeepSeek从入门到精通：大语言模型技术及其应用指南

大规模语言模型：从理论到实践

【Cursor无限.exe】，一个可以帮助你绕过Cursor的试用期限制，轻松继续使用这款强大的AI工具！

【实操制作】4.21-树莓派自动化测试工具LTF

我叫你打印最后一个圣诞树

最新资源