How we built our Multi-Agent Research System

Multi-Agent
Research System

"Claude의 Research 기능은 단일 에이전트가 아니라 — Lead 한 명과 여러 Subagent의 협업이다. 그리고 그 협업이 만들어내는 토큰 비용은 우리가 처음 예상한 것의 15배였다."

제품 Claude.ai· Research

아키텍처 Orch–Worker

토큰 비용 ~15xvs chat

교훈 Prompts > Models

작성 2026.05.17

Implementation

실전 코드

Anthropic SDK 기준 최소 구현. Lead → Subagent dispatch → 병렬 실행 → 합성 → Citation의 전 파이프라인을 프레임워크 없이 — 일반 Python 함수와 `asyncio`만으로 표현한다.

01 · Subagent를 도구로 정의

tool schema

from anthropic import Anthropic, AsyncAnthropic

DISPATCH_SUBAGENT = {
    "name": "dispatch_subagent",
    "description": (
        "하위 리서치 작업을 Subagent에게 위임한다. "
        "task에 'A 회사 매출 추이를 2020~2024 조사' 처럼 단일·구체 목표를 적어라. "
        "동시 dispatch 가능. 작업이 독립적일수록 좋다."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "task": {"type": "string", "description": "단일 목표, 한 문장"},
            "tools": {"type": "array", "items": {"type": "string"}},
        },
        "required": ["task"],
        "additionalProperties": False,
    },
}

포인트. 도구 description의 품질이 곧 Lead의 분해 품질. 주니어에게 설명하듯 명확히 — "단일·구체 목표", "독립적일수록 좋다" 같은 가이드까지 description에 넣는 게 정답.

02 · Subagent 실행 (독립 context)

worker

aclient = AsyncAnthropic()

SUBAGENT_SYSTEM = """너는 리서치 Subagent다. 단일 task에만 집중하라.
- 결과는 사실 기반으로 압축. 5문장 이내.
- 각 사실 옆에 (source: URL) 표기.
- task 범위를 넘는 추가 탐색 금지."""

async def run_subagent(task: str, tools: list) -> dict:
    messages = [{"role": "user", "content": task}]

    for step in range(8):  # max_steps 안전핀
        resp = await aclient.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            system=SUBAGENT_SYSTEM,
            tools=tools,
            messages=messages,
        )
        if resp.stop_reason == "end_turn":
            return {"task": task, "findings": resp.content[0].text}

        messages.append({"role": "assistant", "content": resp.content})
        results = []
        for block in resp.content:
            if block.type == "tool_use":
                output = await run_tool(block.name, block.input)  # 외부 도구
                results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": output[:8000],  # truncate — 결과 폭발 방지
                })
        messages.append({"role": "user", "content": results})

    return {"task": task, "findings": "<incomplete: max_steps>"}

포인트. Subagent의 system 프롬프트가 scope 강제의 핵심. "task 범위를 넘는 추가 탐색 금지"가 없으면 Subagent가 자기 판단으로 영역을 확장해 토큰을 폭주시킨다. tool_result도 반드시 truncate.

03 · Lead의 병렬 dispatch

orchestrator

import asyncio

LEAD_SYSTEM = """너는 Lead Research Agent다.
1) 사용자 질문을 받아 step-by-step으로 사고하라.
2) 독립적으로 병렬 탐색 가능한 sub-task로 분해.
3) 각 sub-task를 dispatch_subagent 도구로 한 번에 여러 개 호출.
4) findings가 모이면 synthesize.

규칙: sub-task 수는 3~8개. 강한 의존성이 있으면 분해하지 마라."""

async def lead_agent(user_query: str) -> str:
    messages = [{"role": "user", "content": user_query}]
    tools = [DISPATCH_SUBAGENT]

    for turn in range(5):
        resp = await aclient.messages.create(
            model="claude-opus-4-7",  # Lead는 비싸도 강한 모델
            max_tokens=4096,
            system=LEAD_SYSTEM,
            tools=tools,
            messages=messages,
        )
        if resp.stop_reason == "end_turn":
            return resp.content[0].text

        messages.append({"role": "assistant", "content": resp.content})

        # 같은 turn 안의 모든 tool_use를 모아 한 번에 병렬 실행
        dispatches = [b for b in resp.content if b.type == "tool_use"]
        sub_results = await asyncio.gather(
            *[run_subagent(b.input["task"], SUBAGENT_TOOLS) for b in dispatches],
            return_exceptions=True,
        )

        tool_results = []
        for b, r in zip(dispatches, sub_results):
            content = (f"<error>{r}</error>" if isinstance(r, Exception)
                       else r["findings"])
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": b.id,
                "content": content,
            })
        messages.append({"role": "user", "content": tool_results})

    raise RuntimeError("lead exhausted turns without end_turn")

포인트. 같은 assistant turn 안의 모든 tool_use 블록을 한 번에 모아 asyncio.gather로 처리하는 게 핵심. 직렬 실행하면 멀티 에이전트의 이점이 사라진다. return_exceptions=True로 부분 실패 허용.

04 · Citation Agent (인용 후처리)

post-processor

CITATION_SYSTEM = """너는 Citation Agent다. Lead가 작성한 본문의 각 주장 옆에
findings에 있는 (source: URL)을 [^N] 형태로 매핑하라.
- 본문은 그대로 두고 인용 표기만 추가.
- 근거가 없는 주장은 [^?] 로 표시 (사후 검토 대상)."""

async def add_citations(synthesis: str, findings: list[dict]) -> str:
    sources = "\n".join(f"- {f['task']}\n{f['findings']}" for f in findings)
    resp = await aclient.messages.create(
        model="claude-haiku-4-5",  # 후처리는 작은 모델로 충분
        max_tokens=4096,
        system=CITATION_SYSTEM,
        messages=[{"role": "user", "content":
            f"<synthesis>\n{synthesis}\n</synthesis>\n\n<findings>\n{sources}\n</findings>"}],
    )
    return resp.content[0].text

포인트. 인용을 본문 작성과 분리한 것이 hallucination을 크게 줄였다. 본문 작성 LLM은 사실을 외우려 하지 않고 findings만 참조, citation LLM은 매핑만 담당. 작은 모델로도 충분.

05 · LLM Judge로 end-state 평가

eval

JUDGE_RUBRIC = """다음 기준으로 0~10점 채점하라.
- 정확성 (claims가 sources에 의해 뒷받침되는가)
- 완전성 (질문의 모든 측면을 커버했는가)
- 인용 품질 (claim ↔ source 매핑이 정확한가)

마지막 줄: score=N (정수)."""

async def judge(query: str, output: str) -> int:
    resp = await aclient.messages.create(
        model="claude-opus-4-7",  # judge는 강한 모델로
        max_tokens=1024,
        system=JUDGE_RUBRIC,
        messages=[{"role": "user", "content":
            f"<query>{query}</query>\n<output>{output}</output>"}],
    )
    text = resp.content[0].text
    last_line = text.splitlines()[-1]
    return int(last_line.split("=")[1].strip())

# 평가셋 n=20도 의미 있다 — early signal
async def eval_suite(queries: list[str]):
    outputs = await asyncio.gather(*[lead_agent(q) for q in queries])
    scores = await asyncio.gather(*[judge(q, o) for q, o in zip(queries, outputs)])
    return sum(scores) / len(scores)

포인트. judge에 generator와 다른 모델 / 다른 시스템 프롬프트를 써야 confirmation bias를 피할 수 있다. 평가셋 크기보다 평가 기준의 명확성이 더 중요.

프로덕션에서는 여기에 trace 인프라(OpenTelemetry · LangFuse 등) · 비용 캡 알림 · rainbow 배포 게이트 · HITL 체크포인트를 더해야 한다. 위 코드는 시스템의 "뼈대"이고, 프로덕션의 90%는 그 주변 인프라가 차지한다.

Multi-Agent
Research System

한 줄로 이해하기

왜 멀티 에이전트인가

핵심 아키텍처: Orchestrator–Worker

토큰 경제학 — 15x의 비용

프롬프트가 모델 업그레이드보다 중요했다

평가는 어렵다 — LLM judge로 풀어라

프로덕션 운영은 새로운 게임이다

그림으로 보는 구조

01 · 아키텍처Lead–Subagent Fan-out

02 · 비용토큰 경제학

03 · 스케일링Effort Scaling

04 · 평가End-state Eval with LLM Judge

Multi-Agent Run

LLM Judge

05 · 운영프로덕션 도전과제

Rainbow Deployment

Heisenbug

그래서 — 언제 멀티 에이전트를 쓸까?

멀티 에이전트가 적합