SOFTWARE
RedGX
Redis GPU eXchange
Keeps your AI services steady under heavy traffic.
Architecture
The Problem We Solve
Running heavy AI inference directly on a normal web or app server is risky. Not only does loading an AI model take a long time, but it also consumes a large amount of graphics card memory (VRAM), and a sudden surge in user requests can cause the main server itself to halt due to insufficient memory (OOM). RedGX safely stores AI processing requests from clients in a queue immediately upon receipt. An independent background worker then sends the request to an isolated AI engine, and the finished result is stored separately in an outbox so the client can retrieve it at any time. This keeps the AI computation load from affecting the main service, so the entire service runs reliably without interruption 24/7.
Usage Example
GPU Embedding Async Call (curl)
# 1) Submit embedding request (returns 202 immediately with req_id)
curl -k -X POST https://gateway/api/v1/ns/shared/gpu/embed \
-H "X-API-Key: $API_KEY" \
-H "Content-Type: application/json" \
-d '{"inputs": ["Sentence 1", "Sentence 2"]}'
# → {"ok": true, "data": {"req_id": "..."}}
# 2) Query result (Retrieved from Outbox after Worker processing)
curl -k https://gateway/api/v1/ns/shared/gpu/embed/$REQ_ID \
-H "X-API-Key: $API_KEY"
# → {"ok": true, "data": {"status": "done", "vectors": [[...], [...]]}}
All requests are authenticated using the X-API-Key header, and read/write permissions can be controlled per namespace. Responses are consistently returned in the {"ok": true, "data": {...}} format.
Standard API Interface (8 Areas — Same as RedGW)
- KV Redis String
- Map Redis Hash
- Queue Redis List
- Group Redis Set
- Rank Redis SortedSet
- Event Redis Stream + Consumer Group
- Pub/Sub Real-time + WebSocket subscription
- Admin Key management · Metrics · Clients
The endpoint structure is the same as RedGW:
/api/v1/ns/{namespace}/{resource}/{key}.
Detailed structures can be found on the RedGW introduction page.
GPU Operation Brokering (Supported Model Examples)
- embedding → TEISentence/Document Embedding (Search/RAG)
- generation → vLLMText Generation, Summarization, Classification
- translation → NLLB (Custom)Multilingual Translation
- stt → faster-whisperSpeech to Text
- ocr → PaddleOCR (Custom)Document Image to Text
Call Method:
Send a request to POST /api/v1/ns/{ns}/gpu/{task} to first obtain a request ID (req_id), then retrieve the computation result using
GET /api/v1/ns/{ns}/gpu/{task}/{req_id}.
RedGX can flexibly integrate various AI engines depending on the infrastructure situation and requirements, and in the actual service operation phase, it is standard to connect and use only the single AI model/engine needed.
Inference Server Isolation — Inviolable Principle
RedGX's workers do not load AI models directly into their internal memory. All AI computations run in a completely isolated, dedicated AI engine container, and the worker merely acts as a bridge that transmits processing requests via HTTP communication.
- Fault Prevention — Even if an error occurs or an Out of Memory (OOM) situation arises during AI inference, the main web/app server continues to operate safely without any disruption.
- Fast Updates — When server settings are changed and restarted, they are applied immediately without the wait time required to reload massive AI models.
- Inference Speed Optimization — Makes full use of the inference engine's built-in speed-optimization features, such as batching high-volume requests.
- Easy Model Replacement — Switch to the model you want just by swapping the inference container — no changes to the server itself.
The AI inference server uses open-source engines specialized for high-volume processing (vLLM, TEI, etc.) or custom inference servers tailored to the situation. During actual service operation, only one AI engine is connected and used according to the required function, and the 5 functions and model names shown above are representative examples configured for integration testing.
Features
- Easy Data Management — Includes all of RedGW's existing features (8 areas such as KV, Map, and Queue), so managing data stays simple.
- Broad AI Operations Support — Asynchronously handles a range of AI tasks: sentence-similarity (embedding), text generation, translation, speech recognition (STT), and optical character recognition (OCR).
- Stable 3-Stage Processing — Operates securely step-by-step in the order of [Queue Registration] → [Batch Task Processing] → [Result Storage].
- Automatic Batch Processing — Efficiently processes requests by automatically grouping them when a certain number accumulates or a time limit passes.
- Server Overload Prevention (Backpressure) — Checks the queue size in real-time to prevent server crashes that can occur when requests pile up excessively.
- Secure Access Control — Issues dedicated API keys per service name (namespace) and restricts access only to allowed IPs.
- Secure Transport & Rate Control — Provides encrypted HTTPS connections and keeps any single client from flooding the service with requests.
- At-a-Glance Monitoring — Monitors queue sizes and processing speeds in real-time using visual graphs (Prometheus + Grafana).
Product Specifications
- Version
- v1.0 (Feature implementation complete)
- License
- Private (Internal company project)
- Sister Project
- RedGW — /en/software/redgw/
- Execution Mode
- Separated operation of Web API and Background Workers
- Storage
- Redis for base data + Separate Redis for AI computation processing
- AI Engine Integration
- Supports integration with specialized inference engines (vLLM, TEI, etc.) (Connect only one required engine in production)
- Security & Relay
- Encrypted connections and rate limiting via Nginx
- Status Monitoring
- Real-time service metrics and queue monitoring
- AI Model Independence
- The gateway never loads AI models in-process; all inference is delegated to independent external engines, isolating the system from model-side failures.
Security & Compliance
- License
- Private (Internal company project)
- Operating Premise
- Closed Network Support — Both the AI models and runtime images are staged into the internal network ahead of deployment, with no outbound network dependency at runtime
- Auth & Access Control
- Client authentication via X-API-Key header and granular read/write permission control per namespace
- Safe Isolation
- Independent architecture where workers do not directly execute AI models, preventing AI failures from cascading to the gateway service
- Transport & Limits
- Provides Nginx-based HTTPS connections plus per-client rate limiting to block abusive request bursts
- Security Measures Inquiry
- Contact info@cubiware.co.kr (Complies with internal security guidelines)
Getting Started
- Define your use case — Decide which AI functions you need (embedding, text generation, translation, etc.) and shortlist candidate models
- Installation & Environment Config — Pre-import server container images and the AI models to be integrated, then configure connections
- Observation & Monitoring — Monitor real-time request inflow metrics and processing status using visualization tools (Prometheus + Grafana)
During the service integration process, we provide standard guidance on tuning the optimal batch size (batch_size) and maximum wait time (max_wait_ms) for your usage patterns.
Recent Changes
- v1.0.0 Official Release — Web API brokering and asynchronous AI computation request integration implementation completed
- Introduced a secure 3-stage (Queue-Process-Outbox) processing method to handle high-volume requests
- Added safeguards that guarantee no requests are lost during enqueue/dequeue and apply backpressure to absorb traffic spikes gracefully
Roadmap
- Create speed optimization guides for AI inference engines (TEI, vLLM, etc.) tailored to real production environments
- Improve monitoring features to detect stalled tasks and automatically retry them when unexpected AI engine downtime occurs
- Enhance memory utilities to continuously monitor graphics card memory occupancy and automatically clean up old processing results
Considering Cubiware for your organization?
We will guide you through setup and rollout tailored to your requirements and operating environment. Reach out for a demo or a proposal.