npuserver

NPU Server (`npuserver`)

Welcome to the npuserver documentation!

npuserver is a high-performance Python library and Flask backend tailored specifically for running Large Language Models (LLMs) locally on Intel NPUs using OpenVINO GenAI.

This server provides an OpenAI-compatible API for seamless integration with existing tools, robust NPU memory management, and dynamic on-the-fly hardware compilation of Hugging Face models into optimized NPU blobs.

Core Features

🚀 OpenAI-Compatible API: Seamlessly integrate with any existing LLM tooling (like LangChain, AutoGen, or custom frontends) using the standard /v1/chat/completions endpoint. Fully supports real-time Server-Sent Event (SSE) streaming.
🧠 Strict Memory Management: The Intel NPU has limited, highly specialized memory. This server gives you complete explicit control over it. Load and unload models programmatically while aggressively garbage-collecting to prevent NPU memory leaks.
⚡ On-The-Fly Compilation: If a downloaded Hugging Face model hasn’t been compiled for the NPU, the server intelligently intercepts the load request and dynamically compiles an optimized OpenVINO .blob before serving it.
🚫 No Background Downloads: To prevent runaway bandwidth usage and unexpected latency, the server strictly enforces that models must be downloaded locally before it attempts to load or compile them.

📦 Installation

Ensure you have Python installed and your Intel NPU drivers configured properly on Windows.

1. Clone the repository

git clone https://github.com/durgasai299792458/npuserver.git
cd npuserver

2. Setup a virtual environment

python -m venv venv
venv\Scripts\activate

3. Install the package locally

pip install -e .

Required Core Dependencies: openvino-genai, flask, huggingface-hub

🛠️ Getting Started

Starting the Server

The server runs on Flask. You can spin it up programmatically using the library or via a Python script:

from npuserver import run_server

# Starts the NPU backend on port 8080
run_server(port=8080)

By default, the npuserver binds to localhost on port 8080. All endpoint paths in this documentation are relative to this base URL: http://localhost:8080

Next Steps

Head over to the Usage Examples to see how to load models and start chatting with the server.
Explore the Endpoints & Routes section for a deep dive into the raw HTTP requests and JSON responses.