Deploying the Magistral vLLM Server on Modal with Python

Deploying the Magistral vLLM Server on Modal: A Comprehensive Guide

This guide provides a step-by-step walkthrough for Python beginners to build, deploy, and test a Magistral reasoning model using Modal and vLLM. It covers setting up the Modal environment, creating a vLLM server with GPU acceleration, and securely deploying it to the cloud. The tutorial also details how to test the deployed server using both CURL and the OpenAI SDK.

Introduction to Modal and vLLM

Modal is a serverless platform that simplifies the process of running code remotely, including applications that require GPUs, web endpoints, and scheduled jobs. It's particularly beneficial for beginners and those who wish to avoid complex cloud infrastructure management. vLLM is an open-source library for fast and efficient LLM inference and serving.

1. Setting Up Modal

To begin, install the Modal Python client:

pip install modal

Next, set up Modal on your local machine by running:

python -m modal setup

This command will guide you through account creation and device authentication. For secure endpoint access, set a VLLM_API_KEY environment variable using a Modal Secret:

modal secret create vllm-api VLLM_API_KEY=your_actual_api_key_here

Replace your_actual_api_key_here with your chosen API key.

2. Creating the vLLM Application with Modal

This section focuses on building a scalable vLLM inference server on Modal. It involves using a custom Docker image, persistent storage, and GPU acceleration. The tutorial uses the mistralai/Magistral-Small-2506 model.

Key steps include:

Defining the vLLM Image: Create a Docker image based on Debian Slim with Python 3.12, installing necessary packages like vllm, huggingface_hub, and flashinfer-python. Environment variables are set for faster model transfers and optimized inference.
Utilizing Modal Volumes: Two Modal Volumes are created for Hugging Face models and vLLM cache to avoid repeated downloads and speed up cold starts.
Configuring the Model: Specify the model name (mistralai/Magistral-Small-2506) and revision (48c97929837c3189cb3cf74b1b5bc5824eef5fcc). The vLLM V1 engine is enabled for improved performance.
Setting Up the Modal App: Configure the Modal app with GPU resources (e.g., A100:2), scaling parameters, timeouts, storage, and secrets. Concurrent requests per replica are limited for stability.
Creating a Web Server: A web server is set up using the Python subprocess library to execute the command for running the vLLM server.

The Python code for this setup is provided, defining the vllm_image, model details, Modal volumes, and the serve function which launches the vLLM server.

3. Deploying the vLLM Server on Modal

Once the vllm_inference.py file is prepared, deployment is a single command:

modal deploy vllm_inference.py

Modal builds the container image and deploys the application. The output will include a URL for the deployed web function, such as https://abidali899--magistral-small-vllm-serve.modal.run.

After deployment, the server downloads and loads model weights, which can take several minutes. You can monitor the process in your Modal dashboard.

Testing the Deployment:

Once the server is running, you can test it using CURL to list available models:

curl -X 'GET' \
  'https://abidali899--magistral-small-vllm-serve.modal.run/v1/models' \
  -H 'accept: application/json' \
  -H 'Authorization: Bearer <api-key>'

This command confirms that the mistralai/Magistral-Small-2506 model is loaded and accessible.

An image of the Modal dashboard showing deployment status is included.

The API documentation is available at the deployment URL followed by /docs, allowing direct testing of endpoints.

An image showing the API documentation interface is included.

4. Using the vLLM Server with OpenAI SDK

Interact with your vLLM server using the OpenAI-compatible endpoints and the OpenAI Python SDK.

Steps:

Create a .env file with your API key:
```
VLLM_API_KEY=your-actual-api-key-here
```
Install necessary packages:

pip install python-dotenv openai

*   Create a `client.py` file to test functionalities like chat completions and streaming responses. The script includes examples for simple completion, synchronous streaming, and asynchronous streaming.

**Example `client.py`:**

```python
import asyncio
import json
import os

from dotenv import load_dotenv
from openai import AsyncOpenAI, OpenAI

load_dotenv()
api_key = os.getenv("VLLM_API_KEY")

client = OpenAI(
    api_key=api_key,
    base_url="https://abidali899--magistral-small-vllm-serve.modal.run/v1",
)

MODEL_NAME = "mistralai/Magistral-Small-2506"

def run_simple_completion():
    # ... (code for simple completion)

def run_streaming():
    # ... (code for streaming)

async def run_async_streaming():
    # ... (code for async streaming)

if __name__ == "__main__":
    run_simple_completion()
    run_streaming()
    asyncio.run(run_async_streaming())

Execute the test script:
```
python client.py
```

The output demonstrates successful and fast response generation.

An image showing the output of the client.py script is included.

Conclusion

Modal is a versatile platform suitable for various applications, from simple Python scripts to complex ML training and deployments. Its serverless nature abstracts away infrastructure complexities, making it ideal for users who want to deploy applications quickly without managing servers, storage, or networking. The platform supports tasks beyond serving endpoints, such as fine-tuning LLMs remotely.

The author, Abid Ali Awan, is a certified data scientist focused on content creation and technical blogging in AI and data science. His vision includes developing an AI product for mental health support.

Resources

GitHub Repository: kingabzpro/Deploying-the-Magistral-with-Modal
Modal Documentation: modal.com
vLLM: vllm.ai