人工智能文献翻译训练v2-CSDN博客

本文链接：https://blog.csdn.net/siper12138/article/details/146518460

1.《Imagenet Classification with Deep Convolutional Neural Networks》

convolutional layers

max-pooling layers

fully-connected layers

softmax

overfitting(过拟合)

2.《Automatic Chain of Thought Prompting in Large Language Models》

chain-of-thought (CoT) prompting

3.DeepSeek

3.1.《DeepSeek LLM: Scaling Open - Source Language Models with Longtermism》

tokens

supervised fine-tuning (SFT)

Direct Preference Optimization（DPO）

3.2.《DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture - of - Experts Language Models》

Mixture of Expert（MoE）

3.3.《DeepSeek - V2: A Strong, Economical, and Efficient Mixture - of - Experts Language Model》

Multi-head Latent Attention (MLA)

Reinforcement Learning (RL)

3.4.《DeepSeek - Coder: When the Large Language Model Meets Programming — the Rise of Code Intelligence》

3.5.《DeepSeek - Math: Pushing the Limits of Mathematical Reasoning in Open Language Models》

Group Relative Policy Optimization (GRPO)

Proximal Policy Optimization (PPO)

3.6.《DeepSeek - V3 Technical Report》

auxiliary-loss-free strategy

3.7.《DeepSeek - R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning》

3.8.《Native Sparse Attention: Enabling Efficient Long - Context Modeling for Large - Scale Language Models》

1.《Imagenet Classification with Deep Convolutional Neural Networks》

ImageNet Classification with Deep Convolutional Neural Networkshttps://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdfAbstract

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train ing faster, we used non-saturating neurons and a very efficient GPU implemen tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

Abstract

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution（高分辨率） images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved（达到） top-1 and top-5 error rates of 37.5% and 17.0% which is considerably（相当大地） better than the previous state-of-the-art（最先进的）. The neural network, which has 60 million parameters and 650,000 neurons（神经元）, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train ing faster, we used non-saturating(非饱和的) neurons and a very efficient GPU implementation(实施) of the convolution operation. To reduce overfitting(过拟合) in the fully-connected layers we employed a recently-developed regularization(正则化) method called “dropout” that proved to be very effective. We also entered a variant(变体) of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry(参赛作品)

翻译

我们训练了一个大型深度卷积神经网络，用于将 ImageNet LSVRC - 2010 竞赛中的 120 万张高分辨率图像分类到 1000 个不同类别中。在测试数据上，我们实现了 37.5% 的 top - 1 错误率和 17.0% 的 top - 5 错误率，这比之前的最先进水平有了显著提升。该神经网络有 6000 万个参数和 65 万个神经元，由五个卷积层（其中一些后面跟着最大池化层）以及三个全连接层组成，最后是一个具有 1000 路的 softmax 层。为了加快训练速度，我们使用了非饱和神经元以及卷积操作的高效 GPU 实现。为了减少全连接层中的过拟合现象，我们采用了一种最近开发的名为 “随机失活（dropout）” 的正则化方法，事实证明这种方法非常有效。我们还将该模型的一个变体参加了 ILSVRC - 2012 竞赛，并取得了 15.3% 的 top - 5 测试错误率，获得胜利，相比之下，排名第二的参赛作品的错误率为 26.2% 。

convolutional layers

卷积层（Convolutional Layers）是卷积神经网络（Convolutional Neural Networks, CNN）里极为关键的构成部分，主要用于从输入数据里提取特征。下面为你详细介绍卷积层的相关内容：

原理

卷积操作：卷积层借助卷积核（也叫滤波器）在输入数据上进行滑动操作，对局部区域的数据和卷积核的对应元素做逐元素相乘再求和，从而得到卷积结果。此操作能够有效提取输入数据的局部特征。
参数共享：在卷积层里，同一个卷积核会应用于输入数据的所有位置，这样可以极大减少模型的参数数量，降低计算量，同时让模型具备平移不变性。
多通道卷积：若输入数据具有多个通道（像彩色图像有红、绿、蓝三个通道），卷积核也会有对应数量的通道，在卷积时会对所有通道的数据进行卷积操作，再把结果相加。

作用

特征提取：卷积层能够提取输入数据的不同层次特征，例如边缘、纹理等底层特征，以及更复杂的抽象特征。
降维：通过使用合适的卷积核和步长，卷积层可以减少数据的维度，从而降低计算复杂度。

max-pooling layers

最大池化层（Max - Pooling Layers）是卷积神经网络（CNN）中常用的一种池化层，它在 CNN 中起着重要的作用。

原理

最大池化层通过在输入数据上滑动一个固定大小的窗口（池化窗口），并在每个窗口内选取最大值作为该窗口的输出。具体步骤如下：

定义池化窗口：确定池化窗口的大小（例如 2x2、3x3 等）和步长（窗口每次移动的距离）。
滑动窗口：将池化窗口在输入数据上按照指定的步长进行滑动。
选取最大值：对于每个窗口覆盖的区域，选取其中的最大值作为该窗口的输出。

作用

降维：最大池化层可以显著减少数据的空间尺寸，降低后续层的计算量和参数数量，从而加快模型的训练速度。
特征提取：通过选取最大值，最大池化层能够保留输入数据中最显著的特征，增强模型对特征的鲁棒性。
平移不变性：最大池化层对输入数据的微小平移具有一定的不变性，使得模型在处理不同位置的相同特征时具有更好的稳定性。

fully-connected layers

全连接层（Fully - Connected Layers）是神经网络里常见的一种层类型，在许多深度学习架构中都发挥着关键作用，下面为你详细介绍。

原理

在全连接层中，每一个神经元都和上一层的所有神经元相连接。也就是说，全连接层里每个神经元的输入是上一层所有神经元输出的加权和。其数学表达式为：

作用

特征整合：全连接层能够把前面各层提取到的特征进行整合，将局部特征组合成更高级别的抽象特征。
分类决策：在很多神经网络用于分类任务的场景下，全连接层会把整合后的特征映射到类别空间，通过激活函数（如 Softmax 函数）输出每个类别的概率。
模型非线性：结合合适的激活函数（如 ReLU、Sigmoid 等），全连接层可以为模型引入非线性，从而增强模型的表达能力。

softmax

Softmax 函数是深度学习中非常重要的一个函数，常用于多分类问题的输出层，下面为你详细介绍它的相关内容。

原理

作用

多分类问题：在神经网络的输出层使用 Softmax 函数，可以将网络的输出转换为每个类别的概率。例如，在图像分类任务中，网络的输出经过 Softmax 函数处理后，每个元素代表图像属于对应类别的概率。
梯度计算：Softmax 函数结合交叉熵损失函数在训练过程中能够方便地进行梯度计算，使得模型可以通过反向传播算法进行参数更新。

overfitting(过拟合)

定义

过拟合指的是模型在训练数据上表现得极为出色，然而在未参与训练的新数据（如测试数据）上表现欠佳的状况。简单来说，就是模型过度学习了训练数据中的细节和噪声，把这些本不具有普遍代表性的特征也当作一般性规律进行学习，从而导致模型的泛化能力变差。

表现特征

训练误差与测试误差差距大：训练误差持续降低，而测试误差在经过一定阶段的下降后，开始上升，两者之间的差距逐渐拉大。
模型复杂度高：当模型的参数数量过多、结构过于复杂时，更容易出现过拟合现象。例如，在决策树中，如果树的深度过深，就容易出现过拟合。

产生原因

数据方面
- 数据量不足：若训练数据的规模较小，模型可能会过度学习这些有限数据中的特征，导致在新数据上的表现不佳。
- 数据噪声：训练数据中包含大量噪声或异常值，模型可能会将这些噪声也学习进去，从而影响其泛化能力。
模型方面
- 模型复杂度高：过于复杂的模型具有很强的表达能力，能够很好地拟合训练数据中的噪声和异常值。例如，高阶多项式回归模型容易出现过拟合。
- 训练时间过长：在训练过程中，如果训练轮数过多，模型会不断地调整参数以适应训练数据，从而导致过拟合。

解决方法

数据层面
- 增加数据量：收集更多的数据进行训练，使模型能够学习到更广泛的特征，从而提高泛化能力。例如，在图像分类任务中，可以通过数据增强（如旋转、翻转、缩放等）的方法来增加数据量。
- 数据清洗：去除训练数据中的噪声和异常值，提高数据的质量。
模型层面
- 正则化：通过在损失函数中添加正则化项，限制模型参数的大小，从而防止模型过于复杂。常见的正则化方法有 L1 正则化和 L2 正则化。
- 降低模型复杂度：简化模型结构，减少模型的参数数量。例如，在神经网络中，可以减少神经元的数量或层数。
- 提前停止训练：在训练过程中，监控模型在验证集上的性能，当验证集上的性能不再提升时，停止训练，避免模型过拟合。
- 集成学习：通过组合多个弱模型来提高模型的泛化能力。例如，随机森林就是一种基于决策树的集成学习方法。

2.《Automatic Chain of Thought Prompting in Large Language Models》

2210.03493https://arxiv.org/pdf/2210.03493

ABSTRACT

Large language models (LLMs) can perform complex reasoning by generating intermediate reasoning steps. Providing these steps for prompting demonstrations is called chain-of-thought (CoT) prompting. CoT prompting has two major paradigms. One leverages a simple prompt like “Let’s think step by step” to facilitate step-by-step thinking before answering a question. The other uses a few manual demonstrations one by one, each composed of a question and a reasoning chain that leads to an answer. The superior performance of the second paradigm hinges on the hand-crafting of task-specific demonstrations one by one. We show that such manual efforts may be eliminated by leveraging LLMswith the “Let’s think step by step” prompt to generate reasoning chains for demonstrations one by one, i.e., let’s think not just step by step, but also one by one. However, these generated chains often come with mistakes. To mitigate the effect of such mistakes, we find that diversity matters for automatically constructing demonstrations. We propose an automatic CoT prompting method: Auto CoT. It samples questions with diversity and generates reasoning chains to construct demonstrations. On ten public benchmark reasoning tasks with GPT-3, Auto-CoT consistently matches or exceeds the performance of the CoT paradigm that requires manual designs of demonstrations. Code is available at https://github.com/amazon-research/auto-cot

ABSTRACT

Large language models (LLMs) can perform complex reasoning(推理) by generating(生成) intermediate(中级) reasoning steps. Providing these steps for prompting(提示) demonstrations(演示、示范) is called chain-of-thought (CoT) prompting. CoT prompting has two major paradigms(范式). One leverages(利用) a simple prompt like “Let’s think step by step” to facilitate(促进) step-by-step thinking before answering a question. The other uses a few manual(手工的) demonstrations one by one, each composed of (每一个都由…… 组成)a question and a reasoning chain that leads to an answer. The superior(更好的) performance of the second paradigm hinges(铰链) on the hand-crafting(手工制作) of task-specific(特定任务的) demonstrations one by one. We show that such manual efforts may be eliminated(消除) by leveraging(利用) LLMs with the “Let’s think step by step” prompt to generate reasoning chains for demonstrations one by one, i.e., let’s think not just step by step, but also one by one. However, these generated chains often come with mistakes. To mitigate(减轻；缓解) the effect of such mistakes, we find that diversity(多样性) matters for automatically(自动) constructing(构建) demonstrations. We propose an automatic CoT prompting method: Auto CoT. It samples questions with diversity(多样性) and generates reasoning chains to construct demonstrations. On ten public benchmark(基准测试) reasoning tasks with GPT-3, Auto-CoT consistently matches or exceeds(超过) the performance of the CoT paradigm that requires manual designs of demonstrations. Code is available at https://github.com/amazon-research/auto-cot

翻译

大型语言模型（LLMs）可以通过生成中间推理步骤来执行复杂推理。提供这些步骤用于提示示范被称为思维链（CoT）提示。CoT 提示有两种主要范式。一种利用像 “让我们一步一步地思考” 这样简单的提示，以便在回答问题前推动逐步思考。另一种则逐个使用一些人工示范，每个示范都由一个问题和一条得出答案的推理链组成。第二种范式的卓越性能取决于逐个精心制作特定任务的示范。我们表明，通过利用大语言模型结合 “让我们一步一步地思考” 的提示，逐个生成用于示范的推理链，即不仅要一步一步地思考，还要逐个地思考，就可以省去这种人工工作。然而，这些生成的推理链往往存在错误。为了减轻此类错误的影响，我们发现多样性对于自动构建示范很重要。我们提出了一种自动 CoT 提示方法：自动思维链（Auto CoT）。它多样化地抽样问题并生成推理链来构建示范。在使用 GPT - 3 进行的十个公开基准推理任务上，自动思维链（Auto - CoT）始终达到或超过需要人工设计示范的 CoT 范式的性能。代码可在amazon (Sergey Tuchkin) · GitHub - research/auto - cot 获取。

chain-of-thought (CoT) prompting

Chain - of - Thought (CoT) Prompting 是一种在自然语言处理任务中用于引导模型生成具有逻辑推理步骤输出的技术方法：

定义与原理：CoT Prompting 旨在让模型在生成答案时，不仅仅直接给出最终结果，而是展示出逐步推理的思维过程。它通过在输入提示中提供一些示例，这些示例包含了问题以及解决问题的详细逻辑步骤，引导模型学习并模仿这种推理方式来回答后续的问题，使模型生成的回答更具可解释性，也有助于提高模型在需要复杂推理的任务上的性能。
应用场景：在各种需要推理能力的自然语言处理任务中广泛应用，如数学问题求解、常识推理、逻辑推理等领域。例如，在数学应用题中，模型可以通过 CoT Prompting 展示如何根据题目中的条件进行逐步计算，最终得出答案；在常识推理任务中，模型能够阐述从已知信息到结论的推理依据和步骤。
实现方式：通常是在训练数据或推理阶段的输入文本中，以特定的格式嵌入推理步骤的示例。这些示例可以是人工编写的，也可以通过一定的算法生成。模型在训练过程中学习这些示例中的推理模式，从而在实际应用中能够按照类似的方式进行推理和生成文本。例如，在给出一个问题后，会接着给出 “思路：首先…… 其次…… 最后……” 这样的引导，让模型按照此思路进行回答。
作用与优势：它能够显著提升模型在复杂推理任务上的准确性。通过明确展示推理链条，使得模型的输出更易于人类理解，方便检查和验证推理过程的正确性，也有助于发现模型可能存在的错误和偏差。此外，CoT Prompting 还可以提高模型的泛化能力，使模型能够更好地应对各种不同类型的推理问题，尤其是那些需要综合运用多种知识和推理规则的问题。
挑战与限制：获取高质量的带有详细推理步骤的标注数据并不容易，人工标注成本高且存在主观性差异。同时，不同类型的推理任务可能需要不同形式的 CoT Prompting，如何设计出通用且有效的提示格式仍是一个挑战。另外，虽然 CoT Prompting 能提高模型的推理能力，但并不能完全解决模型在复杂推理中可能出现的逻辑错误或对复杂语义理解不准确的问题。

3.DeepSeek

3.1.《DeepSeek LLM: Scaling Open - Source Language Models with Longtermism》

2024 年 1 月 5 日发布，提出从长期主义视角扩展开源语言模型发展策略，推动技术民主化，提出了社区驱动的开源治理框架和多任务优化方法。

论文链接：https://arxiv.org/abs/2401.02954。

Abstract

The rapid development of open-source large language models(LLMs) has been truly remarkable. However, the scaling laws described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate the scaling of large scale models in two prevalent used open source configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective. To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion tokens and is continuously expanding. We further conduct supervised fine-tuning (SFT) and direct preference optimization (DPO) on DeepSeek LLM Base models, resulting in the creation of DeepSeek Chat models. Our evaluation results demonstrate that DeepSeek LLM 67B surpasses LLaMA-2 70B across a range of benchmarks, especially in the domains of code, mathematics, and reasoning. Furthermore, open-ended evaluations reveal that our DeepSeek LLM67BChatexhibits superior performance compared to GPT-3.5.

Abstract

The rapid（快速） development of open-source（开源的） large language models(LLMs) has been truly(真正地) remarkable(卓越的). However, the scaling laws（缩放定律） described in previous literature（文献资料） presents varying conclusions, which casts（使产生特定结果） a dark cloud over scaling LLMs. We delve（深入研究） into the study of scaling laws and present our distinctive（独特的） findings that facilitate（促进） the scaling of large scale models in two prevalent（流行的） used open source configurations（配置项）, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective（视角）. To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion（万亿） tokens（符号） and is continuously expanding. We further conduct（执行） supervised（有监督的） fine-tuning (SFT) and direct preference（优先选择） optimization（最优化） (DPO) on DeepSeek LLM Base models, resulting in the creation（创建） of DeepSeek Chat models. Our evaluation（评估） results demonstrate（证明） that DeepSeek LLM 67B surpasses（超越） LLaMA-2 70B across a range of benchmarks, especially in the domains（领域） of code, mathematics, and reasoning. Furthermore, open-ended（开放式的；无限制的；未决的） evaluations reveal that our DeepSeek LLM67B Chat exhibits（展示） superior performance compared to GPT-3.5.

翻译：

开源大语言模型（LLMs）的快速发展着实令人瞩目。然而，以往文献中所描述的缩放定律得出了不同的结论，这给大语言模型的扩展蒙上了一层阴影。我们深入研究缩放定律，并展示了独特的研究结果，这些结果有助于在两种常用的开源配置（7B 和 67B）下扩展大规模模型。

在缩放定律的指导下，我们推出了 DeepSeek LLM 项目，该项目致力于从长远角度推动开源语言模型的发展。为支持预训练阶段，我们开发了一个数据集，目前该数据集包含 2 万亿个词元，并且还在不断扩充。

我们进一步对 DeepSeek LLM 基础模型进行了监督微调（SFT）和直接偏好优化（DPO），从而创建了 DeepSeek Chat 模型。评估结果表明，DeepSeek LLM 67B 在一系列基准测试中超越了 LLaMA - 2 70B，尤其是在代码、数学和推理领域。此外，开放式评估显示，我们的 DeepSeek LLM 67B Chat 与 GPT - 3.5 相比表现更优。

tokens

在自然语言处理（NLP）的语境中，“tokens”（符号，也常被译为 “词元”“标记” 等）是指对文本进行分词或标记化处理后得到的基本单元。以下是关于它的详细介绍：

产生方式：文本预处理时，会根据特定规则把文本分割成一个个 tokens。对于英文，常以空格、标点符号等作为分隔标志，将文本拆分成单词或词组，如 “The quick brown fox jumps over the lazy dog.” 会被处理成 ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"] 这些 tokens。对于中文等东亚语言，由于词与词之间没有天然分隔符，常使用专门的分词算法，像基于词典的分词、基于统计的分词等方法，将句子切分成单个字或词语作为 tokens，例如 “我喜欢自然语言处理” 可能被切分成 ["我", "喜欢", "自然语言", "处理"]。
作用：tokens 是 NLP 模型处理文本的基本单位。模型通过学习 tokens 的语义、句法等信息来理解和生成文本。把文本表示为 tokens 序列，能让计算机对文本进行量化和向量化处理，便于模型进行计算和学习。比如在词袋模型中，通过统计每个 token 在文本中出现的频率来表示文本；在更复杂的深度学习模型如 Transformer 中，也是以 tokens 作为输入，模型会为每个 token 生成对应的向量表示，进而对文本进行各种自然语言处理任务，如文本分类、情感分析、机器翻译等。
与其他概念的区别：“tokens” 与 “words”（单词）有相似之处，但并不完全相同。“words” 通常是指自然语言中具有独立意义的最小语言单位，而 “tokens” 是根据特定的分词规则对文本进行分割得到的单元，可能是完整的单词，也可能是单词的一部分、标点符号等。例如，在处理一些包含缩写、数字、特殊符号的文本时，“tokens” 会将它们作为单独的单元进行处理，而不是将其视为一个普通的 “word”。另外，“tokens” 与 “lemmas”（词元）也有所不同，“lemmas” 是指单词的原型或基本形式，例如 “running” 的 “lemma” 是 “run”，“lemmatization”（词元化）是将单词转化为其词元形式的过程，而 “tokenization”（标记化）只是简单地将文本分割成基本单元，不涉及对单词进行词元化处理。

supervised fine-tuning (SFT)

Supervised Fine - Tuning（SFT）即监督微调，是一种基于已有的大规模预训练模型进行针对性优化的技术，旨在让模型更好地适应特定任务或数据集，以下是关于它的更多介绍：

微调过程
- 准备标注数据：根据具体任务收集和整理带有明确标注信息的数据集。例如在情感分类任务中，需要准备大量已标注好积极或消极情感的文本数据；在问答任务中，要准备包含问题与对应正确答案的数据集。
- 选择预训练模型：基于任务需求挑选合适的预训练语言模型，如 BERT、GPT 等。这些预训练模型已在大规模语料上进行训练，学习到了丰富的语言知识和通用特征。
- 模型微调：将选定的预训练模型在准备好的标注数据集上进行训练，通过调整模型的参数，使模型的输出尽可能接近标注数据中的真实标签或目标输出。在微调过程中，通常会使用一些常见的优化算法，如随机梯度下降（SGD）、Adagrad、Adadelta 等来最小化损失函数。
技术优势
- 高效利用数据：由于预训练模型已经学习到了大量的通用语言知识，只需使用相对较少的特定任务标注数据进行微调，就能使模型在该任务上取得较好的性能，有效解决了数据稀缺问题，提高了数据利用效率。
- 迁移学习能力：实现了知识从大规模通用语料库到特定任务的迁移。预训练模型在大规模无监督数据上学习到的语言的语法、语义和句法等知识，能够帮助模型更快地适应新任务，减少了训练时间和计算资源的消耗。
- 性能提升显著：在各种自然语言处理任务中，如文本分类、命名实体识别、情感分析等，经过监督微调的模型往往能够显著优于直接在小规模标注数据上训练的模型，甚至可以达到或超过专门为特定任务设计的复杂模型的性能。
应用领域
- 信息检索：在信息检索系统中，利用 SFT 技术可以根据用户的搜索意图和相关的标注数据，对预训练模型进行微调，使其能够更好地理解用户的查询需求，从而返回更准确、相关度更高的搜索结果，提高信息检索的精度和召回率。
- 文本摘要：针对文本摘要任务，通过使用人工标注的摘要数据对预训练模型进行监督微调，可以让模型学习到如何从原始文本中提取关键信息并生成简洁、准确的摘要，从而为用户提供快速获取文本核心内容的服务。
- 智能客服：在智能客服系统中，运用 SFT 技术可以基于客服领域的标注数据对预训练模型进行微调，使模型能够理解客户的问题意图，准确地给出相应的解答和解决方案，提高客服效率和客户满意度。

Direct Preference Optimization（DPO）

Direct Preference Optimization（直接偏好优化，简称 DPO）是一种用于训练语言模型的方法。以下是其相关介绍：

原理：DPO 直接基于人类的偏好数据进行优化，旨在使模型生成的文本更符合人类的偏好和期望。它通过引入一个偏好奖励函数，该函数基于人类对不同文本的偏好判断来定义，从而引导模型学习生成更受人类喜爱的文本。
优势：与传统的强化学习方法相比，DPO 不需要额外的奖励模型来估计奖励，而是直接利用人类的偏好数据进行训练，减少了训练过程中的误差传播，提高了模型的训练效率和生成文本的质量。
应用：在自然语言处理的多个领域有广泛应用，如文本生成、对话系统等，能够使模型生成的文本在内容、风格等方面更贴合人类用户的需求和喜好。例如，在生成新闻报道时，能根据人类对新闻内容的偏好，生成更具吸引力和可读性的报道；在对话系统中，使对话回复更符合人类的交流习惯和期望。

3.2.《DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture - of - Experts Language Models》

2024 年 1 月 11 日发布，提出细粒度专家分割和共享专家隔离策略，通过更灵活的专家组合提升模型性能，同时保持计算成本不变。
论文链接：https://arxiv.org/abs/2401.06066。

Abstract

In the era of large language models, Mixture-of-Experts (MoE) is a promising architecture for managing computational costs when scaling up model parameters. However, conventional MoE architectures like GShard, which activate the top-𝐾 out of 𝑁 experts, face challenges in ensuring expert specialization, i.e. each expert acquires non-overlapping and focused knowledge. In response, we propose the DeepSeekMoE architecture towards ultimate expert specialization. It involves two principal strategies: (1) finely segmenting the experts into 𝑚𝑁 ones and activating 𝑚𝐾 from them, allowing for a more flexible combination of activated experts; (2) isolating 𝐾𝑠 experts as shared ones, aiming at capturing common knowledge and mitigating redundancy in routed experts. Starting from a modest scale with 2B parameters, we demonstrate that DeepSeekMoE 2B achieves comparable performance with GShard 2.9B, which has 1.5× expert parameters and computation. In addition, DeepSeekMoE 2B nearly approaches the performance of its dense counterpart with the same number of total parameters, which set the upper bound of MoE models. Subsequently, we scale up DeepSeekMoE to 16B parameters and show that it achieves comparable performance with LLaMA2 7B, with only about 40% of computations. Further, our preliminary efforts to scale up DeepSeekMoE to 145B parameters consistently validate its substantial advantages over the GShard architecture, and show its performance comparable with DeepSeek 67B, using only 28.5% (maybe even 18.2%) of computations.

Abstract

In the era of large language models, Mixture-of-Experts (MoE)(专家混合体) is a promising(有前途的) architecture for managing computational costs when scaling up(扩大规模) model parameters. However, conventional(传统的) MoE architectures like GShard, which activate(激活) the top-𝐾 out of 𝑁 experts(专家), face challenges in ensuring expert specialization(专业化), i.e. each expert acquires non-overlapping(不重叠的) and focused knowledge. In response, we propose the DeepSeekMoE architecture towards ultimate(终极的) expert specialization. It involves two principal(主要的) strategies: (1) finely segmenting the experts into 𝑚𝑁 ones and activating 𝑚𝐾 from them, allowing for a more flexible(灵活的) combination of activated experts; (2) isolating(隔离) 𝐾𝑠 experts as shared ones, aiming at capturing(捕获；获取) common knowledge and mitigating(减轻；缓解) redundancy(冗余、多余、过剩) in routed(已选路的) experts. Starting from a modest scale with 2B parameters, we demonstrate that DeepSeekMoE 2B achieves comparable performance with GShard 2.9B, which has 1.5× expert parameters and computation. In addition, DeepSeekMoE 2B nearly approaches the performance of its dense(密集的) counterpart(对等物) with the same number of total parameters, which set the upper bound of(上限) MoE models. Subsequently(随后), we scale up DeepSeekMoE to 16B parameters and show that it achieves comparable performance with LLaMA2 7B, with only about 40% of computations. Further, our preliminary(初步的) efforts to scale up DeepSeekMoE to 145B parameters consistently validate its substantial(大量的) advantages over the GShard architecture, and show its performance comparable with DeepSeek 67B, using only 28.5% (maybe even 18.2%) of computations.

翻译

在大语言模型时代，混合专家（MoE）架构是一种在扩展模型参数规模时控制计算成本的有前景的架构。然而，像 GShard 这样的传统 MoE 架构，从 N 个专家中激活前 K 个专家，在确保专家专业化方面面临挑战，即每个专家获取不重叠且集中的知识。对此，我们提出了旨在实现极致专家专业化的 DeepSeekMoE 架构。它包含两个主要策略：（1）将专家精细划分为 mN 个，从中激活 mK 个，从而实现激活专家的更灵活组合；（2）分离出 Ks 个专家作为共享专家，旨在捕捉共同知识并减少路由专家中的冗余。从 20 亿参数的较小规模开始，我们证明 DeepSeekMoE 20 亿参数模型与具有 29 亿参数的 GShard 性能相当，而 GShard 的专家参数和计算量是 DeepSeekMoE 20 亿参数模型的 1.5 倍。此外，DeepSeekMoE 20 亿参数模型几乎接近具有相同总参数数量的密集模型的性能，而密集模型的性能为 MoE 模型设定了上限。随后，我们将 DeepSeekMoE 扩展到 160 亿参数，并表明它与 LLaMA2 70 亿参数模型性能相当，但计算量仅为其约 40%。此外，我们将 DeepSeekMoE 扩展到 1450 亿参数的初步尝试，持续验证了其相对于 GShard 架构的显著优势，并表明其性能与 DeepSeek 670 亿参数模型相当，而计算量仅为其 28.5%（甚至可能低至 18.2%）。

Mixture of Expert（MoE）

Mixture of Experts (MoE)，即专家混合模型，是一种机器学习技术，以下是关于它的详细介绍：

模型原理：MoE 模型将多个不同的 “专家” 模型组合在一起，每个专家模型都擅长处理输入空间的某个特定部分或任务的某个特定方面。通过一个门控机制（gating mechanism）来决定对于给定的输入，应该由哪个或哪些专家模型来进行处理，以及如何组合各个专家模型的输出，从而得到最终的结果。例如，在处理图像识别任务时，不同的专家模型可能分别擅长识别不同类型的物体，门控机制会根据输入图像的特征，决定让擅长识别该图像中物体类型的专家模型发挥更大作用。
优点：具有很强的灵活性和可扩展性。它可以处理复杂的、具有多模态或多语义的任务，能够自适应地根据输入数据的特点选择合适的专家进行处理，从而提高模型的泛化能力和性能。此外，MoE 模型可以有效地利用模型的参数资源，避免在处理不同类型的数据时使用单一的、庞大的模型结构，导致某些参数在处理某些数据时利用率低下的问题。
应用领域：自然语言处理领域的机器翻译、文本生成等任务，通过多个专家模型分别处理不同语言风格、主题等的文本，提高翻译或生成的质量；计算机视觉领域的目标检测、图像分类等任务，不同专家模型专注于不同类型的目标或图像特征，提升检测和分类的准确性；语音识别领域，用于处理不同口音、语速或语言场景的语音数据，提高识别的鲁棒性。
发展现状：随着深度学习的发展，MoE 模型在大规模预训练模型中得到了广泛应用，如 Google 的 Switch Transformer、DeepMind 的 GShard 等模型，都采用了 MoE 结构来提高模型的性能和效率，能够处理海量的数据和复杂的任务，为人工智能的发展提供了更强大的模型支持。

3.3.《DeepSeek - V2: A Strong, Economical, and Efficient Mixture - of - Experts Language Model》

2024 年 5 月发布，引入多头潜在注意力和 DeepSeekMoE 架构，在推理效率和训练成本上进行了优化，为后续版本奠定了基础。
论文链接：https://arxiv.org/abs/2405.04434。

Abstract

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLAguarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KVcache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V2.

Abstract

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical(经济的；节约的；合算的) training and efficient(高效的) inference(推理). It comprises(包括；由…… 组成) 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts(采用) innovative(创新的) architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees(保证；担保；保障) efficient inference through significantly(显著地) compressing(压缩) the Key-Value (KV) cache into a latent(潜在的) vector(向量), while DeepSeekMoE enables training strong models at an economical cost through sparse(稀疏的) computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache(键值缓存) by 93.3%, and boosts the maximum generation throughput(吞吐量) to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source(多源的) corpus(语料库) consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock(解锁) its potential(潜能). Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier(顶级的) performance among open-source models. The model checkpoints(模型检查点) are available at https://github.com/deepseek-ai/DeepSeek-V2.

翻译

我们推出了 DeepSeek-V2，这是一款强大的混合专家（MoE）语言模型，具有经济的训练过程和高效的推理能力。它总共有 2360 亿个参数，每个词元激活 210 亿个参数，支持 12.8 万个词元的上下文长度。DeepSeek-V2 采用了包括多头潜在注意力（MLA）和 DeepSeekMoE 在内的创新架构。MLA 通过将键值（KV）缓存显著压缩成一个潜在向量，确保了高效的推理，而 DeepSeekMoE 则通过稀疏计算，以较低成本训练强大的模型。与 DeepSeek 670 亿参数模型相比，DeepSeek-V2 性能显著提升，同时节省了 42.5% 的训练成本，将 KV 缓存减少了 93.3%，并将最大生成吞吐量提高到 5.76 倍。我们在一个由 8.1 万亿个词元组成的高质量多源语料库上对 DeepSeek-V2 进行了预训练，并进一步进行了监督微调（SFT）和强化学习（RL），以充分释放其潜力。评估结果表明，即使只有 210 亿个激活参数，DeepSeek-V2 及其聊天版本在开源模型中仍能取得顶级性能。模型检查点可在https://github.com/deepseek-ai/DeepSeek-V2获取。

Multi-head Latent Attention (MLA)

Multi - head Latent Attention（MLA）即多头潜在注意力机制，是一种将多头注意力机制与潜在表示学习相结合的高效注意力模型4。以下是关于它的详细介绍26：

背景与动机：传统多头注意力机制计算复杂度高，在高维输入或长序列场景下计算成本极高，且难以充分捕捉复杂全局依赖。MLA 旨在解决这些问题，通过在潜在空间中执行注意力计算，有效降低复杂度，同时提升建模能力。
核心思想：MLA 是 DeepSeek 提出的一种高效注意力机制，通过低秩联合压缩将高维的键值信息映射到低维潜在空间，结合多头分治策略动态聚焦不同语义子空间，在显著降低计算资源消耗的同时保持核心语义结构的完整性，实现效率与性能的协同优化。具体来说，它通过低秩分解技术，将传统注意力机制中的键（Key）和值（Value）矩阵进行联合压缩，用更少的参数表示原始信息，从而减少推理过程中所需的内存和计算资源。
技术特点：相比传统的注意力机制，它能让模型在训练时同时预测更远位置的 token，增强了对未来的感知能力，有助于模型更好地捕捉文本中的长距离依赖关系，提升对语义的理解和生成能力。
应用：是 DeepSeek 降低大模型成本使用的关键技术之一，2025 年 2 月 24 日，DeepSeek 开源的 Flash MLA，是一个针对 Hopper GPU 优化的高效 MLA 解码内核，专为处理可变长度序列而设计，能在不损失模型性能的前提下，将缓存体积压缩至原来的 1/4，从而大幅降低显存需求。

Reinforcement Learning (RL)

Reinforcement Learning (RL)，即强化学习，是机器学习中的一个重要领域，它主要研究智能体（agent）如何在环境中采取一系列行动，以最大化累积奖励。以下是关于强化学习的详细介绍：

基本原理：强化学习基于马尔可夫决策过程（Markov Decision Process，MDP），其核心要素包括智能体、环境、状态、行动和奖励。智能体根据当前环境状态选择一个行动，环境根据智能体的行动转移到新的状态，并反馈给智能体一个奖励信号。智能体的目标是通过不断地与环境交互，学习到一个策略，使得长期累积奖励最大化。
主要算法：分为基于价值函数的算法和基于策略梯度的算法。基于价值函数的算法如 Q - Learning，通过估计状态 - 行动值函数（Q 函数）来确定最优策略，适用于离散行动空间和小规模问题。Sarsa 算法与 Q - Learning 类似，不过它是一种在线学习算法，更适合实时决策场景。基于策略梯度的算法，如 A2C（Advantage Actor - Critic）、A3C（Asynchronous Advantage Actor - Critic）和 PPO（Proximal Policy Optimization）等，直接对策略网络进行优化，能够处理连续行动空间和大规模问题，学习效率较高。
应用领域：在机器人控制领域，可用于机器人的路径规划、姿态控制等，使机器人能适应不同环境和任务要求。在游戏领域，能让智能体通过学习掌握游戏策略和技巧，如 AlphaGo Zero 通过强化学习在围棋游戏中取得了惊人成果。在自动驾驶领域，用于车辆的决策与控制，如车道保持、速度控制和避障等，以提高自动驾驶的安全性和效率。在资源管理领域，如网络资源分配、数据中心能耗管理等，可通过强化学习实现资源的合理分配和利用，提升系统性能和降低成本。
挑战与发展方向：面临的挑战包括样本效率低，即智能体需要大量的交互样本才能学习到较好的策略，这在实际应用中可能成本高昂甚至不可行；探索与利用平衡问题，智能体需要在探索新的行动和利用已有的经验之间找到平衡，否则可能无法发现最优策略；以及处理复杂环境和大规模状态空间时，算法的计算复杂度和收敛性难以保证等。未来发展方向则包括与其他机器学习方法（如深度学习）的深度融合，以处理更复杂的感知和决策任务；研究更高效的样本学习算法，提高学习效率和泛化能力；拓展应用到更多复杂的实际问题中，如智能交通系统、能源管理系统等，推动强化学习技术在各领域的广泛应用和创新发展。

3.4.《DeepSeek - Coder: When the Large Language Model Meets Programming — the Rise of Code Intelligence》

2024 年发布，探讨了大型语言模型与编程结合时代码智能的兴起。

论文链接：https://arxiv.org/abs/2401.14196。

Abstract

The rapid development of large language models has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, we introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens. These models are pre-trained on a high-quality project-level code corpus and employ a fill-in-the-blank task with a 16K window to enhance code generation and infilling. Our extensive evaluations demonstrate that DeepSeek-Coder not only achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5. Furthermore, DeepSeek-Coder models are under a permissive license that allows for both research and unrestricted commercial use.

Abstract

The rapid development of large language models has revolutionized（彻底改变了） code intelligence in software development. However, the predominance（主导地位） of closed-source models has restricted（受限制的） extensive（大规模的） research and development. To address(应对) this, we introduce the DeepSeek-Coder series(系列), a range of open-source code models with sizes from 1.3B to 33B, trained from scratch(从头开始训练) on 2 trillion tokens. These models are pre-trained on a high-quality project-level code corpus and employ a fill-in-the-blank(填空) task with a 16K window to enhance code generation and infilling(填充). Our extensive(广泛的) evaluations demonstrate(演示) that DeepSeek-Coder not only achieves state-of-the-art performance among open-source code models across multiple benchmarks (在多个基准上)but also surpasses existing closed-source models like Codex and GPT-3.5. Furthermore, DeepSeek-Coder models are under a permissive(宽松的；许可的；纵容的。) license that allows for both research(研究) and unrestricted(无限制的、不受约束的) commercial(商业的) use.

翻译

大语言模型的迅速发展给软件开发中的代码智能带来了变革。然而，闭源模型的主导地位限制了广泛的研究与开发。为解决这一问题，我们推出了 DeepSeek-Coder 系列，这是一系列开源代码模型，参数规模从 13 亿到 330 亿不等，在 2 万亿个词元上从头开始训练。这些模型在高质量的项目级代码语料库上进行预训练，并采用窗口为 16K 的填空任务来增强代码生成和填充能力。我们广泛的评估表明，DeepSeek-Coder 不仅在多个基准测试中在开源代码模型中取得了最先进的性能，而且在某些情况下超越了诸如 Codex 和 GPT-3.5 等现有的闭源模型。此外，DeepSeek-Coder 模型遵循宽松的许可协议，允许用于研究以及无限制的商业用途。

3.5.《DeepSeek - Math: Pushing the Limits of Mathematical Reasoning in Open Language Models》

2024 年发布，致力于推动开放语言模型中数学推理能力的极限。
论文链接：https://arxiv.org/abs/2402.03300。

Abstract

Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO),avariant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO

Abstract

Mathematical（数学的） reasoning poses(造成) a significant challenge for language models due to its complex(复杂的) and structured(结构化的) nature(性质). In this paper, we introduce DeepSeekMath 7B, which continues pre training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive(令人印象深刻的) score of 51.7% on the competition-level MATH benchmark without relying on external toolkits(工具包) and voting(投票) techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency(自我一致性) over 64 samples(样本) from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to(归因于) two key factors(因素): First, we harness(利用) the significant potential(潜在的；可能性) of publicly available web data through a meticulously(小心翼翼地) engineered data selection pipeline(数据选择管道). Second, we introduce（“介绍”“引入”） Group Relative Policy Optimization (GRPO),avariant of Proximal Policy Optimization (PPO), that enhances（增强了） mathematical reasoning abilities while concurrently（同时地） optimizing the memory usage of PPO.

翻译

由于数学推理具有复杂且结构化的特性，它给语言模型带来了重大挑战。在本文中，我们介绍了 DeepSeekMath 7B。该模型使用从 Common Crawl 获取的 1200 亿与数学相关的词元，结合自然语言和代码数据，对 DeepSeek-Coder-Base-v1.5 7B 继续进行预训练。DeepSeekMath 7B 在不依赖外部工具包和投票技术的情况下，在竞赛级别的 MATH 基准测试中取得了令人瞩目的 51.7% 的成绩，接近 Gemini-Ultra 和 GPT-4 的性能水平。DeepSeekMath 7B 对 64 个样本进行自一致性处理后，在 MATH 基准测试中达到了 60.9% 的成绩。DeepSeekMath 的数学推理能力归因于两个关键因素：其一，我们通过精心设计的数据筛选流程，充分挖掘了公开网络数据的巨大潜力；其二，我们引入了近端策略优化（PPO）的变体 —— 组相对策略优化（GRPO），该方法在优化 PPO 内存使用的同时，增强了模型的数学推理能力。

Group Relative Policy Optimization (GRPO)

群体相对策略优化（Group Relative Policy Optimization，GRPO）是由深度求索（DeepSeek）团队提出的一种强化学习算法，用于优化大语言模型（LLM）的训练。它旨在解决传统近端策略优化（Proximal Policy Optimization，PPO）算法在大模型训练中资源消耗大的问题。以下是详细介绍：

核心思想
- 通过群体相对奖励估计优势函数：GRPO 通过计算群体内的相对奖励来估计优势函数，无需额外的价值函数模型（评论家模型）。对于每个输入问题q，策略模型生成多个输出{o1,o2,⋯,oG}，然后由奖励模型对这些输出进行评分。每个输出的奖励ri经过归一化处理，得到相对奖励r~i=std(r)ri−mean(r)，该相对奖励将用作每个标记（token）的优势函数。
关键步骤
- 输出生成：对于每个输入问题，当前策略生成多个输出。
- 输出评分：使用奖励模型对这些输出进行评分。
- 优势计算：将这些奖励的平均值作为基线来计算优势。
- 策略更新：更新策略以最大化 GRPO 目标，该目标包括优势项和 KL 散度项。
优势
- 高资源效率：通过省略价值模型，GRPO 显著降低了内存和计算资源的消耗。
- 低成本多候选优化：在需要生成多个候选答案的场景中，例如数学推理和对话生成任务，GRPO 可以通过比较多个候选答案来改进策略，且采样成本相对较低。
- 训练稳定性：GRPO 保留了 PPO 的比率范式和 KL 正则化机制，确保训练过程稳定且可控。
技术改进
- 细粒度监督：GRPO 探索如何将细粒度的过程监督纳入群体奖励中，以获得更准确的反馈。
- 鲁棒性增强：研究在奖励模型存在不确定性（如噪声标记或不完善的奖励模型）情况下 GRPO 的鲁棒性改进方案。
- 与其他策略集成：GRPO 可以与其他对齐策略（如拒绝采样微调（Rejection Sampling Fine-Tuning，RFT）和直接策略优化（Direct Policy Optimization，DPO））深度集成，形成更通用的框架。

Proximal Policy Optimization (PPO)

近端策略优化算法（Proximal Policy Optimization，PPO）是 OpenAI 公司于 2017 年开发的一系列无模型强化学习算法，用于优化策略网络以最大化累计奖励。它是基于策略梯度的算法，旨在解决策略梯度算法中存在的一些问题，例如训练过程中的不稳定性和样本效率低下等。以下是其详细介绍：

算法原理
- 重要性采样：PPO 算法使用重要性采样技术来估计策略梯度。重要性采样可以在不重新采样的情况下，利用旧策略收集的数据来估计新策略的梯度，从而提高样本效率。
- 近端策略优化：PPO 算法通过限制策略更新的幅度，来保证策略的更新不会过于剧烈，从而避免性能下降。具体来说，PPO 算法通过引入一个截断项，将策略更新限制在一个较小的范围内，使得新策略与旧策略之间的差异不会过大。
关键步骤
- 收集样本：使用当前策略在环境中进行采样，收集一组轨迹数据，包括状态、动作、奖励等信息。
- 计算优势函数：利用收集到的轨迹数据，计算每个状态 - 动作对的优势函数估计值。优势函数表示采取某个动作相对于平均策略的优势程度。
- 更新策略网络：基于收集到的样本和计算出的优势函数，使用随机梯度下降等优化算法更新策略网络的参数，以最大化累计奖励。在更新过程中，通过限制策略更新的幅度，确保策略的稳定性。
- 重复迭代：重复上述步骤，不断收集新的样本并更新策略网络，直到达到预设的停止条件，如训练步数达到一定阈值或性能指标不再提升等。
算法优势
- 样本效率高：通过重要性采样和对策略更新的有效控制，PPO 算法能够更充分地利用收集到的样本数据进行学习，减少了采样的需求，从而在相对较少的样本数量下取得较好的学习效果，提高了训练效率。
- 训练稳定性好：PPO 算法通过限制策略更新的幅度，避免了策略在训练过程中发生剧烈变化，从而保证了训练的稳定性。这使得 PPO 算法在面对复杂的环境和任务时，能够更可靠地收敛到较优的策略，减少了训练过程中出现性能波动或发散的可能性。
- 通用性强：PPO 算法是一种无模型的强化学习算法，不需要对环境进行显式的建模，因此可以应用于各种不同类型的环境和任务中，具有很强的通用性和适应性。无论是离散动作空间还是连续动作空间的问题，PPO 算法都能够有效地处理。
应用领域
- 机器人控制：在机器人的运动控制任务中，如机器人的行走、抓取等动作的学习和优化。PPO 算法可以通过不断地与环境进行交互，学习到最优的控制策略，使机器人能够适应不同的环境条件和任务要求，实现高效、稳定的动作执行。
- 自动驾驶：在自动驾驶领域，PPO 算法可用于训练车辆的驾驶策略，包括车速控制、车道保持、变道决策等。通过在模拟环境中进行大量的训练，PPO 算法能够学习到安全、高效的驾驶策略，以应对各种复杂的交通场景和路况。
- 游戏领域：在各种游戏环境中，如 Atari 游戏、围棋、象棋等，PPO 算法可以训练智能体学习游戏策略，通过不断地与游戏环境进行交互，智能体能够逐渐掌握游戏的规则和技巧，提高游戏水平，甚至达到超越人类玩家的表现。

3.6.《DeepSeek - V3 Technical Report》

2024 年 12 月 27 日发布，设计了一种高效的混合专家模型，通过激活少量参数实现性能和计算成本的平衡，是大规模模型优化的重要突破。
论文链接：https://arxiv.org/abs/2412.19437。

Abstract

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.

Abstract

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient（高效的） inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers（先驱者；开拓者） an auxiliary-loss-free strategy（辅助无损失策略） for load balancing and sets a multi-token prediction training objective（目标） for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion（万亿） diverse（多样的） and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness（利用） its capabilities. Comprehensive（全面的） evaluations reveal（显示） that DeepSeek-V3 outperforms（胜过） other open-source models and achieves performance comparable（相当的） to leading（领先的） closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable（非常稳定）. Throughout the entire training process, we did not experience（体验） any irrecoverable（不可恢复的） loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.

翻译

我们推出 DeepSeek-V3，这是一个强大的混合专家（MoE）语言模型，总参数达 6710 亿，每个词元激活 370 亿参数。为实现高效推理和经济高效的训练，DeepSeek-V3 采用了在 DeepSeek-V2 中经过充分验证的多头潜在注意力（MLA）和 DeepSeekMoE 架构。此外，DeepSeek-V3 开创了一种无辅助损失的负载均衡策略，并设定了多词元预测训练目标以提升性能。我们在 14.8 万亿个多样且高质量的词元上对 DeepSeek-V3 进行预训练，随后经过监督微调与强化学习阶段，充分发挥其能力。综合评估表明，DeepSeek-V3 性能优于其他开源模型，达到了与领先闭源模型相当的水平。尽管性能卓越，DeepSeek-V3 的完整训练仅需 278.8 万个 H800 GPU 小时。此外，其训练过程非常稳定。在整个训练过程中，我们没有遇到任何不可恢复的损失激增情况，也未进行任何回滚操作。模型检查点可在https://github.com/deepseek-ai/DeepSeek-V3获取。

auxiliary-loss-free strategy

该策略通过在不引入辅助损失的情况下控制专家负载均衡，避免了传统方法中辅助损失带来的干扰梯度，从而提高了模型性能。

3.7.《DeepSeek - R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning》

2025 年 1 月发布，提出了一种使用强化学习而非监督学习的方法，显著提升了语言模型在数学和逻辑推理任务中的表现，开辟了新的研究方向。
论文链接：https://arxiv.org/abs/2501.12948。

Abstract

We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without super vised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.

Abstract

We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via(经由) large-scale(大规模的) reinforcement(强化) learning (RL) without super vised fine-tuning (SFT) as a preliminary(初步的) step, demonstrates(证实) remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally(自然地) emerges(出现) with numerous(众多) powerful and intriguing(引人入胜的) reasoning behaviors. However, it encounters(遇到；遭遇；碰到；遭遇（困难、问题等）) challenges such as poor readability(可读性), and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates(包含；纳入；吸收) multi-stage training and cold-start data before RL. DeepSeek R1 achieves performance comparable(可比较的) to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense(密集的) models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled(蒸馏的) from DeepSeek-R1 based on(基于) Qwen and Llama.

翻译

我们介绍第一代推理模型 DeepSeek - R1 - Zero 和 DeepSeek - R1。DeepSeek - R1 - Zero 是一款通过大规模强化学习（RL）训练的模型，无需先进行监督微调（SFT）作为初步步骤，它展现出卓越的推理能力。通过强化学习，DeepSeek - R1 - Zero 自然呈现出众多强大且引人入胜的推理行为。然而，它面临一些挑战，如可读性差和语言混杂问题。为解决这些问题并进一步提升性能，我们推出了 DeepSeek - R1，该模型在强化学习之前采用了多阶段训练和冷启动数据。DeepSeek R1 在各项任务上取得了与 OpenAI - o1 - 1217 相当的性能。为支持科研界，我们开源了 DeepSeek - R1 - Zero、DeepSeek - R1，以及基于 Qwen 和 Llama 从 DeepSeek - R1 提炼出的六个稠密模型（15 亿、70 亿、80 亿、140 亿、320 亿、700 亿参数）。

3.8.《Native Sparse Attention: Enabling Efficient Long - Context Modeling for Large - Scale Language Models》

2025 年 2 月 18 日发布，提出了一种结合硬件优化和训练感知设计的稀疏注意力方法，解决长文本处理中注意力机制的高计算成本和内存需求问题。
论文链接：‍‬‬‬‌⁠‌‍‬‌⁠⁠‌‬‌‌‍‍⁠‬DeepSeek | Native Sparse Attention（NSA） - 飞书云文档

Abstract

Long-context modeling is crucial for next-generation language models, yet the high compu-tational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities. We present NSA, a Natively trainable Sparse Attention mechanism that integratesal gorithmic innovations with hardware-aligned optimizations to achieve efficient long-context modeling.NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision. Our approach advances sparse attention design with two key innovations:(1) We achieve substantial speedups through arithmetic intensity-balanced algorithm design,with implementation optimizations for modern hardware. (2) We enable end-to-end training,reducing pretraining computation without sacrificing model performance. As shown in Figure 1,experiments show the model pretrained with NSA maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and instruction-based reasoning. Meanwhile,NSA achieves substantial speedups over Full Attention on 64k-length sequences across decod-ing, forward propagation, and backward propagation, validating its efficiency throughout themodel lifecycle.

Abstract

Long-context modeling is crucial(至关重要的) for next-generation language models, yet the high computational cost of standard(标准的) attention mechanisms(机制) poses(引起) significant computational challenges. Sparse(稀疏的) attention offers a promising(有前途的) direction for improving efficiency while maintaining(保持) model capabilities. We present NSA, a Natively trainable Sparse Attention mechanism that integratesal(集成所有的) gorithmic(算法的) innovations(创新) with hardware-aligned（硬件对齐的） optimizations to achieve efficient long-context modeling. NSA employs a dynamic hierarchical(分层的) sparse strategy, combining coarse-grained(粗粒度的) token compression(压缩) with fine-grained(细粒度的) token selection to preserve(保存) both global context awareness and local precision(精度). Our approach advances sparse(稀疏的) attention design with two key innovations(创新):(1) We achieve substantial(可观的) speedups through arithmetic(算术) intensity-balanced(强度平衡的) algorithm design,with implementation optimizations for modern hardware(硬件). (2) We enable end-to-end training,reducing pretraining computation without sacrificing(牺牲) model performance. As shown in Figure 1,experiments show the model pretrained with NSA maintains or exceeds(超过) Full Attention models across general benchmarks, long-context tasks, and instruction-based(基于指令的) reasoning. Meanwhile,NSA achieves substantial(可观的) speedups over Full Attention on 64k-length sequences(序列) across decod-ing, forward propagation(传播), and backward propagation, validating its efficiency(“效率、效能”) throughout themodel lifecycle(生命周期).

翻译

长上下文建模对于下一代语言模型至关重要，然而标准注意力机制的高计算成本带来了巨大的计算挑战。稀疏注意力为在保持模型能力的同时提高效率提供了一个有前景的方向。我们提出了 NSA，一种原生可训练的稀疏注意力机制，它将算法创新与硬件适配优化相结合，以实现高效的长上下文建模。

NSA 采用动态分层稀疏策略，将粗粒度的标记压缩与细粒度的标记选择相结合，既能保留全局上下文感知，又能保证局部精度。我们的方法通过两项关键创新推动了稀疏注意力设计的发展：
（1）我们通过算术强度平衡的算法设计实现了显著的加速，并针对现代硬件进行了实现优化。
（2）我们实现了端到端训练，在不牺牲模型性能的前提下减少了预训练计算量。

如图 1 所示，实验表明，使用 NSA 预训练的模型在通用基准测试、长上下文任务和基于指令的推理中，性能保持或超过了全注意力模型。同时，在 64k 长度序列的解码、前向传播和反向传播过程中，NSA 相对于全注意力实现了显著加速，这验证了它在整个模型生命周期中的效率。

下篇

gpt