【运维】SGLang 安装指南-Euler的博客

SGLang 是一个高性能的大语言模型推理框架，支持多种安装方式。本文档提供了详细的安装步骤和常见问题解决方案。

系统要求

Python 3.8+
CUDA 11.8+ (GPU 推理)
推荐使用 uv 进行依赖管理以获得更快的安装速度

安装方法

方法一：使用 pip 或 uv (推荐)

# 升级 pip
pip install --upgrade pip

# 安装 uv (推荐)
pip install uv

# 使用 uv 安装 SGLang
uv pip install "sglang[all]>=0.4.9.post2"

常见问题快速修复：

FlashInfer 相关问题
- SGLang 当前使用 torch 2.7.1，需要安装对应版本的 flashinfer
- 如需单独安装 flashinfer，请参考 FlashInfer 安装文档
- 注意：FlashInfer 的 PyPI 包名为 flashinfer-python 而不是 flashinfer

CUDA_HOME 环境变量问题

# 解决方案 1：设置 CUDA_HOME 环境变量
export CUDA_HOME=/usr/local/cuda-<your-cuda-version>

# 解决方案 2：先安装 FlashInfer，再安装 SGLang
# 参考 FlashInfer 安装文档

方法二：从源码安装

# 使用最新发布分支
git clone -b v0.4.9.post2 https://github.com/sgl-project/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python[all]"

AMD ROCm 系统 (Instinct/MI GPU)：

# 使用最新发布分支
git clone -b v0.4.9.post2 https://github.com/sgl-project/sglang.git
cd sglang

pip install --upgrade pip
cd sgl-kernel
python setup_rocm.py install
cd ..
pip install -e "python[all_hip]"

方法三：使用 Docker(推荐)

Docker 镜像可在 Docker Hub 上获取：lmsysorg/sglang

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<your-huggingface-token>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 30000

AMD ROCm 系统 Docker 使用：

# 构建 ROCm 镜像
docker build --build-arg SGL_BRANCH=v0.4.9.post2 -t v0.4.9.post2-rocm630 -f Dockerfile.rocm .

# 设置别名
alias drun='docker run -it --rm --network=host --device=/dev/kfd --device=/dev/dri --ipc=host \
    --shm-size 16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
    -v $HOME/dockerx:/dockerx -v /data:/data'

# 运行服务
drun -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<your-huggingface-token>" \
    v0.4.9.post2-rocm630 \
    python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 30000

方法四：使用 Docker Compose（推荐）

推荐用于服务化部署，更好的方式是使用 k8s-sglang-service.yaml。

# 1. 复制 compose.yml 到本地机器
# 2. 执行命令
docker compose up -d

compose.yml 文件内容如下：

services:
  sglang:
    image: lmsysorg/sglang:latest
    container_name: sglang
    volumes:
      - ${HOME}/.cache/huggingface:/root/.cache/huggingface
      # If you use modelscope, you need mount this directory
      # - ${HOME}/.cache/modelscope:/root/.cache/modelscope
    restart: always
    network_mode: host # required by RDMA
    privileged: true # required by RDMA
    # Or you can only publish port 30000
    # ports:
    #   - 30000:30000
    environment:
      HF_TOKEN: <secret>
      # if you use modelscope to download model, you need set this environment
      # - SGLANG_USE_MODELSCOPE: true
    entrypoint: python3 -m sglang.launch_server
    command: --model-path meta-llama/Llama-3.1-8B-Instruct
      --host 0.0.0.0
      --port 30000
    ulimits:
      memlock: -1
      stack: 67108864
    ipc: host  # 允许容器使用主机的 IPC 命名空间，提高内存共享效率
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:30000/health || exit 1"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]

方法五：使用 Kubernetes

单节点部署（模型大小适合单节点 GPU）：

kubectl apply -f docker/k8s-sglang-service.yaml

多节点部署（大模型需要多 GPU 节点，如 DeepSeek-R1）：

# 修改模型路径和参数后执行
kubectl apply -f docker/k8s-sglang-distributed-sts.yaml

方法六：使用 SkyPilot 在 Kubernetes 或云端部署

支持在 Kubernetes 或 12+ 云平台上部署。

安装 SkyPilot：

# 安装 SkyPilot 并设置 Kubernetes 集群或云访问
# 参考 SkyPilot 文档：https://skypilot.readthedocs.io/en/latest/getting-started/installation.html

部署配置 (sglang.yaml)：

# sglang.yaml
envs:
  HF_TOKEN: null

resources:
  image_id: docker:lmsysorg/sglang:latest
  accelerators: A100
  ports: 30000

run: |
  conda deactivate
  python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 30000

部署命令：

# 在任何云或 Kubernetes 集群上部署
HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml

# 获取 HTTP API 端点
sky status --endpoint 30000 sglang

常见注意事项

FlashInfer 支持
- FlashInfer 是默认的注意力内核后端
- 仅支持 sm75 及以上架构
- 如果在 sm75+ 设备上遇到 FlashInfer 相关问题，可切换到其他内核：
```
--attention-backend triton --sampling-backend pytorch
```
轻量级安装
- 如果只需要使用 OpenAI 模型的前端语言，可以使用：
```
pip install "sglang[openai]"
```
前后端分离安装
- 前端语言独立于后端运行时
- 前端可在本地安装（无需 GPU）：
```
pip install sglang
```
- 后端在 GPU 机器上安装：
```
pip install sglang[srt]  # srt 是 SGLang runtime 的缩写
```

重新安装 FlashInfer

pip3 install --upgrade flashinfer-python --force-reinstall --no-deps
rm -rf ~/.cache/flashinfer

验证安装

安装完成后，可以通过以下方式验证：

# 检查 SGLang 版本
python -c "import sglang; print(sglang.__version__)"

# 启动测试服务器
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000

故障排除

CUDA 版本不兼容
- 确保 CUDA 版本与 PyTorch 版本兼容
- 检查 nvidia-smi 输出确认 GPU 驱动版本
内存不足
- 增加 Docker 的共享内存大小：--shm-size 32g
- 检查系统可用内存
模型下载失败
- 设置 HuggingFace token：export HF_TOKEN=<your-token>
- 检查网络连接和防火墙设置
权限问题
- 确保有足够的权限访问 GPU 设备
- 在 Docker 中使用 --gpus all 参数

服务测试代码

from openai import OpenAI

base_url = "http://192.168.6.9:7890/v1"
api_key = "no"

messages = [
    {
        "role": "user",
        "content": "你好，帮我写一首6字6韵6行包含各种6的诗"
    }
]

client = OpenAI(
    base_url=base_url,
    api_key=api_key
)
response = client.chat.completions.create(
    model="no",
    messages=messages
)
print(response.choices[0].message.content)

目录CONTENT

【运维】SGLang 安装指南