GLM-4.7-flash-128k 测试报告（代码场景）

时间：2026-02-22

1) 已实施配置变更

Provider API：openai-completions -> ollama（原生）
Base URL：http://127.0.0.1:11434/v1 -> http://127.0.0.1:11434
模型参数：
- contextWindow: 131072 -> 65536
- maxTokens: 16384 -> 8192
- agents.defaults.models[ollama/glm-4.7-flash-128k].params：
- temperature: 0.2
- num_ctx: 65536
- num_predict: 4096

2) 环境与硬件快照

CPU: i3-12100F (4C/8T)
RAM: 15GiB
GPU:
- RTX 2080 Ti 22GB
- Tesla P100 16GB
Ollama: 0.16.3
OpenClaw: 2026.2.19-2

3) 压测结果（代码编写多轮）

测试文件：reports/ollama-coding-bench.json

三组配置（ctx32k / ctx64k / ctx96k），每组 5 轮代码任务。

结果：

ctx32k: 第1轮超时
ctx64k: 第1轮超时
ctx96k: 第1轮超时

额外单轮短任务验证（ctx64k, num_predict=256）：

成功返回，耗时约 13.19s

4) 结论

你的模型“能工作”，但在“长输出+代码多轮”下非常容易触发超时。
当前主要瓶颈不是消息通道，而是推理吞吐（长响应生成速度不足）。
5 轮代码压测失败说明：当前参数对该硬件+模型规模来说仍偏激进。

5) 推荐稳定参数（优先稳定）

建议改成：

num_ctx: 32768
num_predict: 1024（必要时 768）
temperature: 0.2

使用策略：

代码场景默认先短答，必要时再“继续”生成下一段
避免一次性超长代码块

6) 可观测性（你能确认我是否在工作）

建议固定用：

openclaw status
openclaw models status
ollama ps
tail -f /tmp/openclaw/openclaw-$(date +%F).log | grep -Ei "embedded run (start|done|timeout)|FailoverError|timed out"

这样你可以实时看到：是否在跑、是否超时、是否切换fallback。