DeepSeek-R1 大模型:本地部署与性能深度解析

本文深入探讨了 DeepSeek-R1 大语言模型,该模型通过结合强化学习和结构化训练技术,在性能和成本效益方面对现有 AI 市场领导者构成了挑战。文章详细解析了 DeepSeek-R1 的核心特性、多阶段训练流程及其在各项基准测试中的卓越表现。此外,还提供了使用 Llama.cpp 在本地部署 DeepSeek-R1-Distill-Qwen-1.5B 模型的详细步骤,包括环境设置、模型下载和利用 LangChain 构建本地问答助手的完整代码示例,旨在帮助中文开发者理解并实践 DeepSeek-R1 的本地化应用。

阅读时长: 5 分钟
共 2459字
作者: eimoon.com

DeepSeek-R1 大模型:本地部署与性能深度解析

简介

2025年1月20日,DeepSeek 发布的 R1 模型在全球科技界引起了巨大震动,甚至对英伟达(NVIDIA)、博通(Broadcom)、AMD 等在 AI 领域投入巨大的美国科技公司股价造成了显著影响,其中英伟达市值一度损失超过5000亿美元。DeepSeek-R1 的竞争性能和极低成本对 OpenAI 和英伟达等现有巨头构成了挑战,预示着 AI 采纳的加速和市场动态的转变。本文将深入探讨 DeepSeek-R1 大语言模型(LLM)及其通过 Llama.cpp 在本地的实现。

理解 DeepSeek-R1 模型

DeepSeek-R1 是 DeepSeek 推出的第一代推理模型之一,与 DeepSeek-R1-Zero 同时发布。

  • DeepSeek-R1-Zero:这是一个纯粹通过大规模强化学习(Reinforcement Learning, RL)训练的模型,展现了惊人的推理能力,但存在可理解性差和语言混淆等问题。
  • DeepSeek-R1:旨在解决 DeepSeek-R1-Zero 的局限性。它引入了多阶段训练流程:
    1. 冷启动预训练:在强化学习之前,使用少量高质量的“冷启动数据”(cold-start data)进行预训练,以稳定早期训练并奠定强大的推理基础,特别是促进**链式思考(Chain-of-Thought)**能力。
    2. 强化学习与监督微调结合:在强化学习趋于收敛后,还会结合 DeepSeek-V3 的监督数据(涵盖事实问答、自我认知、写作等领域)进行新的监督微调(Supervised Fine-Tuning, SFT),并进行第二阶段的强化学习,以进一步提升模型的帮助性(helpfulness)和无害性(harmlessness)。

DeepSeek-R1 的性能在推理任务上可与 OpenAI-o1-1217 相媲美。同时,DeepSeek 还发布了从 DeepSeek-R1 蒸馏(distilled)而来的六个密集模型(1.5B, 7B, 8B, 14B, 32B, 70B),它们基于 Qwen 和 Llama 架构。研究表明,从大型基础模型中提取的推理模式对于提升推理能力至关重要。

DeepSeek-R1 的核心特性与优势

DeepSeek-R1 主要通过以下方式增强其推理能力:

  • 多阶段训练:结合冷启动数据、强化学习(RL)和监督微调(SFT),形成高效的训练范式。
  • 链式思考(Chain-of-Thought):通过精心策划的数据集引导模型生成详细的逐步解释,从而提高推理的透明度和准确性。
  • 语言一致性奖励:在强化学习中引入奖励机制,有效减少了语言混淆问题,提升了模型的语言输出质量。
  • 人类偏好对齐:通过第二阶段强化学习,使模型更好地符合人类偏好,提高其帮助性和无害性。

DeepSeek-R1 在编程、数学、逻辑推理和科学等推理密集型任务中表现出色。

DeepSeek-R1 评估与基准测试

DeepSeek-R1 在以下多个基准测试中进行了评估,以全面衡量其性能:

  • 语言理解:Massive Multitask Language Understanding (MMLU)、MMLU-Redux、C-Eval、CMMLU。
  • 指令遵循:IFEval (Instruction Following Evaluation)。
  • 事实性、检索与推理:FRAMES (Factuality, Retrieval, and Reasoning Measurement Set)。
  • 科学问答:GPQA (Graduate-Level Google-Proof Q&A Benchmark)。
  • 简单问答:SimpleQA、C-SimpleQA。
  • 编码:SWE-Bench Verified、Aider、LiveCodeBench、Codeforces。
  • 数学与逻辑推理:CNMO 2024 (Chinese National Mathematical Olympiad 2024)、AIME 2024 (American Mathematics Examination 2024)。

评估设置:最大生成长度设置为32,768个令牌。为避免贪婪解码(greedy decoding)在长输出推理模型中导致的高重复率和不稳定性,评估采用 pass@k 策略,并使用非零温度(0.6 的采样温度和 0.95 的 top-p 值)生成 k 个响应(通常为4到64个),最终报告 pass@1 结果。

通过 Llama.cpp 本地部署 DeepSeek-R1

本教程将指导您如何使用 Llama.cpp 在本地实现 DeepSeek-R1 模型。

步骤 1: 环境设置与库安装

首先,创建并激活一个 Python 虚拟环境,然后安装所需的库:

python3 -m venv deepseek_llamacpp
source deepseek_llamacpp/bin/activate
pip install langchain langchain-core langchain-community
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python

步骤 2: 下载 DeepSeek-R1-Distill-Qwen-1.5B 模型

使用 Git LFS 下载预训练的 DeepSeek-R1-Distill-Qwen-1.5B 模型并将其转换为 GGUF 格式(如果未安装 Git LFS,请先安装):

sudo apt install -y git-lfs # if git-lfs is not installed
git lfs clone https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF

步骤 3: 使用 LangChain 初始化 Llama.cpp 并构建 Q&A 助手

接下来,我们将使用 LangChain 初始化 Llama.cpp,并构建一个具有内存功能的本地问答 AI 助手:

from langchain_core.prompts import PromptTemplate
from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler
from langchain_community.llms import LlamaCpp
from langchain.memory import ConversationBufferMemory

# defining the prompt template and callback for streaming
template = """Question: {question}

Answer: Let's work this out in a step-by-step way to ensure we get the right answer."""
prompt = PromptTemplate.from_template(template)
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

# setting parameters for llamacpp (gguf) model 
memory = ConversationBufferMemory()
n_gpu_layers = -1 # 设置为-1以利用所有可用的GPU层
n_batch = 512 
llm = LlamaCpp(
   model_path="/Users/sachintripathi/Documents/Py_files/Deepseek /DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q6_K.gguf", # 请替换为你的模型路径
   n_gpu_layers=n_gpu_layers,
   n_batch=n_batch,
   f16_kv=True,
   callback_manager=callback_manager,
   verbose=True
)

# chaining prompt and llm
llm_chain = prompt | llm

def main_with_memory():
   memory.clear() # 清空历史对话记录
   while True:
      user_question = input("Enter your question (or type 'quit' to exit): ")
      if user_question.lower() == 'quit':
         break
      # past message retrieval from memory
      context = " ".join([f"Q: {msg['input']}\nA: {msg['output']}" for msg in memory.chat_memory.messages])
      combined_prompt = f"{context}\nQuestion: {user_question}"
      generated_text = llm_chain.invoke({"question": combined_prompt})
      print("Answer:", generated_text)
      memory.save_context({"input": user_question}, {"output": generated_text})

if __name__ == "__main__":
   main_with_memory()

示例输出:

Enter your question (or type 'quit' to exit): What is the meaning of LLMs? 

First, understand the definition of LLMs. An LLM is a type of machine learning model that predicts probabilities by approximating a function using a neural network.
Step 1: Identify what an LLM is.
An LLM is a type of machine learning model. It's defined as a deep learning system composed of interconnected neurons, or units, with multiple layers.
Step 2: Break down the definition into parts to identify key terms.
The key term here is "deep learning," which refers to neural networks with many layers. The next part is "comprised of interconnected neurons." Each neuron has a weighted connection to other neurons and processes information using an activation function, typically ReLU or sigmoid.
Step 3: Identify the main components in this definition.
The main component is the LLM itself because it's the subject of study. Key terms are neural network, which refers to the structure with layers; deep learning refers to the specific subfield within machine learning that uses many layers. The activation function (ReLU or sigmoid) is also a key term.
Step 4: Consider any exceptions.
Is there any exception to this definition? For example, different types of LLMs might use different activation functions, but ReLU and sigmoid are commonly used for standard
llama_perf_context_print:      load time =     167.56 ms
llama_perf_context_print: prompt eval time =     167.13 ms /    37 tokens (    4.52 ms per token,   221.38 tokens per second)
llama_perf_context_print:        eval time =    7105.58 ms /   255 runs   (  27.87 ms per token,   35.89 tokens per second)
llama_perf_context_print:       total time =    7790.66 ms /   292 tokens
Answer:  First, understand the definition of LLMs. An LLM is a type of machine learning model that predicts probabilities by approximating a function using a neural network.
Step 1: Identify what an LLM is.
An LLM is a type of machine learning model. It's defined as a deep learning system composed of interconnected neurons, or units, with multiple layers.
Step 2: Break down the definition into parts to identify key terms.
The key term here is "deep learning," which refers to neural networks with many layers. The next part is "comprised of interconnected neurons." Each neuron has a weighted connection to other neurons and processes information using an activation function, typically ReLU or sigmoid.
Step 3: Identify the main components in this definition.
The main component is the LLM itself because it's the subject of study. Key terms are neural network, which refers to the structure with layers; deep learning refers to the specific subfield within machine learning that uses many layers. The activation function (ReLU or sigmoid) is also a key term.
Step 4: Consider any exceptions.
Is there any exception to this definition? For example, different types of LLMs might use different activation functions, but ReLU and sigmoid are commonly used for standard
Enter your question (or type 'quit' to exit): quit
ggml_metal_free: deallocating

总结

DeepSeek-R1 是一个强大的大语言模型,它利用冷启动数据和迭代强化学习微调,在 LLM 领域取得了显著性能。其开源的推理能力在数学、编码和通用推理方面可与 OpenAI 及其他闭源模型相媲美,并且成本效益更高。通过以更低的成本提供具有竞争力的性能,DeepSeek-R1 使先进的 AI 能力更广泛地普及,其开源性质挑战了闭源模型的传统主导地位,证明了高性能 LLM 可以开放开发和共享,从而对 AI 行业的既有参与者施加了压力。

参考资料

关于

关注我获取更多资讯

公众号
📢 公众号
个人号
💬 个人号
使用 Hugo 构建
主题 StackJimmy 设计