llama.cpp로 로컬 LLM 띄우기 — 빌드, 모델 다운로드, 서빙, 벤치마크

클라우드 API 없이 GGUF 모델을 내 장비에서 직접 돌릴 때 쓰는 최소 워크플로우를 정리한다. 빌드 → 모델 다운로드 → 서버 → 벤치마크 순서다.

빌드

기본 빌드는 cmake 두 줄이다.

# build whole
cmake -B build
cmake --build build --config Release

Hugging Face에서 모델을 바로 받아오려면 curl을 켜야 한다. -DLLAMA_CURL=ON 플래그를 주기 전에 시스템에 curl과 개발 헤더가 깔려 있어야 한다.

# for hugging-face, add curl option
# make sure to install curl
sudo apt install curl libcurl4-openssl-dev
cmake -B build -DLLAMA_CURL=ON

전체를 다 빌드할 필요 없이 서버 바이너리 하나만 필요하면 타깃을 지정한다.

# build subset
cmake -B build build --config Release -t llama-server

모델 다운로드

GGUF 파일은 보통 수 GB라서 받는 방법이 두 가지다.

git으로 리포지토리째 받을 때는 git-lfs가 있어야 대용량 파일이 제대로 따라온다.

# for large file download from git
sudo apt install git-lfs
git clone {hf repository}

특정 파일/디렉토리만 골라 받고 싶으면 huggingface-cli가 더 편하다.

# or, you can use huggingface-cli
pip install -U "huggingface_hub[cli]"
huggingface-cli download {hf repository name} --local-dir .

다운로드·로딩 중 메모리와 코어 사용량은 htop으로 본다.

서버 띄우기

파이썬 바인딩인 llama-cpp-python은 OpenAI 호환 서버를 바로 제공한다.

# llama-cpp-python
pip install llama-cpp-python[server]
python3 -m llama_cpp.server --model {model_gguf_path} --host 0.0.0.0

서버가 OpenAI 스펙을 따르므로 클라이언트는 openai SDK를 그대로 쓰면 된다. base_url만 로컬 서버로 바꾸고, api_key는 검사하지 않으니 아무 값이나 넣는다. (base_url의 포트는 서버가 실제로 떠 있는 포트에 맞춘다.)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080",
    api_key="none",
)

stream = client.chat.completions.create(
    model="random",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is your name?"},
    ],
    stream=True,
    temperature=0.9,
    max_tokens=1000,
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

앞서 빌드한 C++ 쪽 llama-server 역시 동일한 OpenAI 호환 엔드포인트를 노출한다. 파이썬 의존성 없이 띄우고 싶다면 그쪽을 쓰면 된다.

벤치마크

모델의 처리량(tokens/s 등)을 재려면 llama-bench를 쓴다.

build/bin/llama-bench -m {gguf_path}

빌드한 바이너리가 build/bin/ 아래에 들어가므로 경로를 그대로 가리키면 된다.