Skip to main content

SOFTWARE

RedGX

Redis GPU eXchange

Keeps your AI services steady under heavy traffic.

v1.0 (Phase 1~5 implementation complete)REST API + Native hot-path acceleration + Asynchronous isolation of external inference engines

Architecture

ArchitectureClient requests pass through Nginx and enter the RedGX API Router. CPU-based regular REST requests are processed immediately in the main Redis, while GPU requests are asynchronously brokered through redis-gpu's 3-stage pipeline (Inbox/Queue -> Processing -> Outbox/Cache). In a real production environment, only a single inference container and its corresponding worker are activated according to the planned service type.
Pinch or scroll to zoom · drag to pan · double-tap or double-click to reset

The Problem We Solve

Running heavy AI inference directly on a normal web or app server is risky. Not only does loading an AI model take a long time, but it also consumes a large amount of graphics card memory (VRAM), and a sudden surge in user requests can cause the main server itself to halt due to insufficient memory (OOM). RedGX safely stores AI processing requests from clients in a queue immediately upon receipt. An independent background worker then sends the request to an isolated AI engine, and the finished result is stored separately in an outbox so the client can retrieve it at any time. This keeps the AI computation load from affecting the main service, so the entire service runs reliably without interruption 24/7.

Usage Example

GPU Embedding Async Call (curl)

# 1) Submit embedding request (returns 202 immediately with req_id)
curl -k -X POST https://gateway/api/v1/ns/shared/gpu/embed \
  -H "X-API-Key: $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"inputs": ["Sentence 1", "Sentence 2"]}'
# → {"ok": true, "data": {"req_id": "..."}}

# 2) Query result (Retrieved from Outbox after Worker processing)
curl -k https://gateway/api/v1/ns/shared/gpu/embed/$REQ_ID \
  -H "X-API-Key: $API_KEY"
# → {"ok": true, "data": {"status": "done", "vectors": [[...], [...]]}}

All requests are authenticated using the X-API-Key header, and read/write permissions can be controlled per namespace. Responses are consistently returned in the {"ok": true, "data": {...}} format.

Standard API Interface (8 Areas — Same as RedGW)

The endpoint structure is the same as RedGW: /api/v1/ns/{namespace}/{resource}/{key}. Detailed structures can be found on the RedGW introduction page.

GPU Operation Brokering (Supported Model Examples)

Call Method: Send a request to POST /api/v1/ns/{ns}/gpu/{task} to first obtain a request ID (req_id), then retrieve the computation result using GET /api/v1/ns/{ns}/gpu/{task}/{req_id}. RedGX can flexibly integrate various AI engines depending on the infrastructure situation and requirements, and in the actual service operation phase, it is standard to connect and use only the single AI model/engine needed.

Inference Server Isolation — Inviolable Principle

RedGX's workers do not load AI models directly into their internal memory. All AI computations run in a completely isolated, dedicated AI engine container, and the worker merely acts as a bridge that transmits processing requests via HTTP communication.

The AI inference server uses open-source engines specialized for high-volume processing (vLLM, TEI, etc.) or custom inference servers tailored to the situation. During actual service operation, only one AI engine is connected and used according to the required function, and the 5 functions and model names shown above are representative examples configured for integration testing.

Features

Product Specifications

Version
v1.0 (Feature implementation complete)
License
Private (Internal company project)
Sister Project
RedGW — /en/software/redgw/
Execution Mode
Separated operation of Web API and Background Workers
Storage
Redis for base data + Separate Redis for AI computation processing
AI Engine Integration
Supports integration with specialized inference engines (vLLM, TEI, etc.) (Connect only one required engine in production)
Security & Relay
Encrypted connections and rate limiting via Nginx
Status Monitoring
Real-time service metrics and queue monitoring
AI Model Independence
The gateway never loads AI models in-process; all inference is delegated to independent external engines, isolating the system from model-side failures.

Security & Compliance

License
Private (Internal company project)
Operating Premise
Closed Network Support — Both the AI models and runtime images are staged into the internal network ahead of deployment, with no outbound network dependency at runtime
Auth & Access Control
Client authentication via X-API-Key header and granular read/write permission control per namespace
Safe Isolation
Independent architecture where workers do not directly execute AI models, preventing AI failures from cascading to the gateway service
Transport & Limits
Provides Nginx-based HTTPS connections plus per-client rate limiting to block abusive request bursts
Security Measures Inquiry
Contact info@cubiware.co.kr (Complies with internal security guidelines)

Getting Started

  1. Define your use case — Decide which AI functions you need (embedding, text generation, translation, etc.) and shortlist candidate models
  2. Installation & Environment Config — Pre-import server container images and the AI models to be integrated, then configure connections
  3. Observation & Monitoring — Monitor real-time request inflow metrics and processing status using visualization tools (Prometheus + Grafana)

During the service integration process, we provide standard guidance on tuning the optimal batch size (batch_size) and maximum wait time (max_wait_ms) for your usage patterns.

Recent Changes

Roadmap

Considering Cubiware for your organization?

We will guide you through setup and rollout tailored to your requirements and operating environment. Reach out for a demo or a proposal.