SOFTWARE

RedGX

Redis GPU eXchange

Keeps your AI services steady under heavy traffic.

v1.0 (Phase 1~5 implementation complete)REST API + Native hot-path acceleration + Asynchronous isolation of external inference engines

Architecture

ArchitectureClient requests pass through Nginx and enter the RedGX API Router. CPU-based regular REST requests are processed immediately in the main Redis, while GPU requests are asynchronously brokered through redis-gpu's 3-stage pipeline (Inbox/Queue -> Processing -> Outbox/Cache). In a real production environment, only a single inference container and its corresponding worker are activated according to the planned service type.

The Problem We Solve

Running heavy AI inference directly on a normal web or app server is risky. Not only does loading an AI model take a long time, but it also consumes a large amount of graphics card memory (VRAM), and a sudden surge in user requests can cause the main server itself to halt due to insufficient memory (OOM). RedGX safely stores AI processing requests from clients in a queue immediately upon receipt. An independent background worker then sends the request to an isolated AI engine, and the finished result is stored separately in an outbox so the client can retrieve it at any time. This keeps the AI computation load from affecting the main service, so the entire service runs reliably without interruption 24/7.

Usage Example

GPU Embedding Async Call (curl)

# 1) Submit embedding request (returns 202 immediately with req_id)
curl -k -X POST https://gateway/api/v1/ns/shared/gpu/embed \
  -H "X-API-Key: $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"inputs": ["Sentence 1", "Sentence 2"]}'
# → {"ok": true, "data": {"req_id": "..."}}

# 2) Query result (Retrieved from Outbox after Worker processing)
curl -k https://gateway/api/v1/ns/shared/gpu/embed/$REQ_ID \
  -H "X-API-Key: $API_KEY"
# → {"ok": true, "data": {"status": "done", "vectors": [[...], [...]]}}

All requests are authenticated using the X-API-Key header, and read/write permissions can be controlled per namespace. Responses are consistently returned in the {"ok": true, "data": {...}} format.

Standard API Interface (8 Areas — Same as RedGW)

KV Redis String
Map Redis Hash
Queue Redis List
Group Redis Set
Rank Redis SortedSet
Event Redis Stream + Consumer Group
Pub/Sub Real-time + WebSocket subscription
Admin Key management · Metrics · Clients

The endpoint structure is the same as RedGW: /api/v1/ns/{namespace}/{resource}/{key}. Detailed structures can be found on the RedGW introduction page.

GPU Operation Brokering (Supported Model Examples)

embedding → TEI
Sentence/Document Embedding (Search/RAG)

Sample Model
bge-m3, etc.

VRAM
~2 GB
generation → vLLM
Text Generation, Summarization, Classification

Sample Model
EXAONE, etc.

VRAM
~2.4 GB
translation → NLLB (Custom)
Multilingual Translation

Sample Model
NLLB-200

VRAM
~3 GB
stt → faster-whisper
Speech to Text

Sample Model
whisper large-v3

VRAM
~3 GB
ocr → PaddleOCR (Custom)
Document Image to Text

Sample Model
PaddleOCR v5

VRAM
~500 MB

Call Method: Send a request to POST /api/v1/ns/{ns}/gpu/{task} to first obtain a request ID (req_id), then retrieve the computation result using GET /api/v1/ns/{ns}/gpu/{task}/{req_id}. RedGX can flexibly integrate various AI engines depending on the infrastructure situation and requirements, and in the actual service operation phase, it is standard to connect and use only the single AI model/engine needed.

Inference Server Isolation — Inviolable Principle

RedGX's workers do not load AI models directly into their internal memory. All AI computations run in a completely isolated, dedicated AI engine container, and the worker merely acts as a bridge that transmits processing requests via HTTP communication.

Fault Prevention — Even if an error occurs or an Out of Memory (OOM) situation arises during AI inference, the main web/app server continues to operate safely without any disruption.
Fast Updates — When server settings are changed and restarted, they are applied immediately without the wait time required to reload massive AI models.
Inference Speed Optimization — Makes full use of the inference engine's built-in speed-optimization features, such as batching high-volume requests.
Easy Model Replacement — Switch to the model you want just by swapping the inference container — no changes to the server itself.

The AI inference server uses open-source engines specialized for high-volume processing (vLLM, TEI, etc.) or custom inference servers tailored to the situation. During actual service operation, only one AI engine is connected and used according to the required function, and the 5 functions and model names shown above are representative examples configured for integration testing.

Features

Easy Data Management — Includes all of RedGW's existing features (8 areas such as KV, Map, and Queue), so managing data stays simple.
Broad AI Operations Support — Asynchronously handles a range of AI tasks: sentence-similarity (embedding), text generation, translation, speech recognition (STT), and optical character recognition (OCR).
Stable 3-Stage Processing — Operates securely step-by-step in the order of [Queue Registration] → [Batch Task Processing] → [Result Storage].
Automatic Batch Processing — Efficiently processes requests by automatically grouping them when a certain number accumulates or a time limit passes.
Server Overload Prevention (Backpressure) — Checks the queue size in real-time to prevent server crashes that can occur when requests pile up excessively.
Secure Access Control — Issues dedicated API keys per service name (namespace) and restricts access only to allowed IPs.
Secure Transport & Rate Control — Provides encrypted HTTPS connections and keeps any single client from flooding the service with requests.
At-a-Glance Monitoring — Monitors queue sizes and processing speeds in real-time using visual graphs (Prometheus + Grafana).

Product Specifications

Version: v1.0 (Feature implementation complete)
License: Private (Internal company project)
Sister Project: RedGW — /en/software/redgw/
Execution Mode: Separated operation of Web API and Background Workers
Storage: Redis for base data + Separate Redis for AI computation processing
AI Engine Integration: Supports integration with specialized inference engines (vLLM, TEI, etc.) (Connect only one required engine in production)
Security & Relay: Encrypted connections and rate limiting via Nginx
Status Monitoring: Real-time service metrics and queue monitoring
AI Model Independence: The gateway never loads AI models in-process; all inference is delegated to independent external engines, isolating the system from model-side failures.

Security & Compliance

License: Private (Internal company project)
Operating Premise: Closed Network Support — Both the AI models and runtime images are staged into the internal network ahead of deployment, with no outbound network dependency at runtime
Auth & Access Control: Client authentication via X-API-Key header and granular read/write permission control per namespace
Safe Isolation: Independent architecture where workers do not directly execute AI models, preventing AI failures from cascading to the gateway service
Transport & Limits: Provides Nginx-based HTTPS connections plus per-client rate limiting to block abusive request bursts
Security Measures Inquiry: Contact info@cubiware.co.kr (Complies with internal security guidelines)

Getting Started

Define your use case — Decide which AI functions you need (embedding, text generation, translation, etc.) and shortlist candidate models
Installation & Environment Config — Pre-import server container images and the AI models to be integrated, then configure connections
Observation & Monitoring — Monitor real-time request inflow metrics and processing status using visualization tools (Prometheus + Grafana)

During the service integration process, we provide standard guidance on tuning the optimal batch size (batch_size) and maximum wait time (max_wait_ms) for your usage patterns.

Recent Changes

v1.0.0 Official Release — Web API brokering and asynchronous AI computation request integration implementation completed
Introduced a secure 3-stage (Queue-Process-Outbox) processing method to handle high-volume requests
Added safeguards that guarantee no requests are lost during enqueue/dequeue and apply backpressure to absorb traffic spikes gracefully

Roadmap

Create speed optimization guides for AI inference engines (TEI, vLLM, etc.) tailored to real production environments
Improve monitoring features to detect stalled tasks and automatically retry them when unexpected AI engine downtime occurs
Enhance memory utilities to continuously monitor graphics card memory occupancy and automatically clean up old processing results

Considering Cubiware for your organization?

We will guide you through setup and rollout tailored to your requirements and operating environment. Reach out for a demo or a proposal.