Documentation

Machines & Infrastructure

Machines & Infrastructure Specifications

Overview

ProYaro AI Stack operates on two dedicated servers with complementary capabilities:

  1. Mac Mini M4 - Apple Silicon powerhouse for MLX inference
  2. Ubuntu Server - NVIDIA GPU server for production AI workloads

Mac Mini M4 (10.0.0.188)

Hardware Specifications

Model: Mac Mini (Late 2024) Chip: Apple M4 Pro RAM: 48GB Unified Memory Storage: High-speed NVMe SSD (capacity TBD) Neural Engine: 16-core (Apple Silicon integrated) GPU Cores: 20-core GPU (M4 Pro configuration)

Operating System

OS: macOS Sequoia 15.7.2 Kernel: Darwin 24.6.0 Architecture: ARM64 (Apple Silicon)

AI Capabilities

MLX Framework

  • Native Apple Silicon optimization
  • Unified memory architecture (48GB shared between CPU/GPU/Neural Engine)
  • Mixed precision support: FP32, FP16, BF16, 4-bit, 8-bit quantization
  • Token generation speed: 30-50 tokens/sec (7B-14B models), 15-25 tokens/sec (30B+ models)

Supported Models (MLX)

  1. qwen2.5-combined-egyptian-marketing-4bit (Active by default)

    • Size: ~4-8GB RAM usage
    • Specialty: Egyptian Arabic marketing content
    • Speed: ~40 tokens/sec
  2. Qwen3-Coder-30B-A3B-Instruct-4bit

    • Size: ~15-20GB RAM usage
    • Specialty: Code generation
    • Speed: ~20 tokens/sec
  3. Qwen2.5-14B-Instruct-4bit

    • Size: ~8-12GB RAM usage
    • Specialty: General instruction following
    • Speed: ~35 tokens/sec

Embeddings

  • Model: multilingual-e5-small
  • Dimensions: 384
  • Speed: Very fast (Apple Neural Engine)
  • Languages: 100+ including Arabic

Image Generation (ComfyUI)

  • Model: Z-Image Turbo (SDXL-based, BF16)
  • CLIP: Qwen 3 4B (multilingual)
  • Resolution: Up to 2048x2048
  • Speed: ~8-15 seconds per 1024x1024 image
  • Batch size: Limited by RAM (can do 1-4 images)

Video Generation (ComfyUI)

  • Model: Wan2.2-T2V (quantized to Q4)
  • CLIP: UMT5-XXL (FP16)
  • Frame limit: 41 frames maximum (M4 constraint)
  • Resolution: 480x720, 640x480, etc.
  • Speed: ~30-60 seconds for 30-frame video
  • Format: MP4

Resource Limits

Memory:

  • Total: 48GB
  • OS + System: ~8-10GB
  • MLX Model: 4-20GB (depending on model)
  • ComfyUI Models: ~8-12GB loaded
  • Available for processing: ~10-25GB

Storage:

  • Models: ~50-100GB
  • Generated content: Limited by SSD capacity
  • Workspace: Plenty available

Thermal:

  • Active cooling (fan)
  • Can sustain continuous inference
  • May throttle under extreme sustained load (unlikely in normal use)

Performance Benchmarks

Text Generation:

  • Qwen2.5-14B-4bit: 35 tokens/sec average
  • Qwen3-Coder-30B-4bit: 20 tokens/sec average
  • Prompt processing: Extremely fast (batch processing optimized)

Image Generation:

  • Z-Image Turbo (1024x1024, 9 steps): 8-12 seconds
  • SDXL (1024x1024, 20 steps): 20-30 seconds

Video Generation:

  • Wan2.2 (30 frames, 640x480): 45-60 seconds
  • Limited by frame count (max 41)

Embeddings:

  • Single text: <10ms
  • Batch 100 texts: <500ms

Installed Services

┌─────────────────────────────────────────┐
│         Mac Mini M4 Services             │
├─────────────────────────────────────────┤
│ MLX FastAPI          :8004  (Auto-start)│
│ ComfyUI              :8188  (Manual)    │
│ a2zadd Backend       :3000  (Manual)    │
│ Frontend Dev         :5173  (Manual)    │
└─────────────────────────────────────────┘

Auto-start Services:

  • MLX FastAPI: LaunchAgent configured (com.mlx.service)

Service Locations:

  • MLX: /Users/yaro/Documents/new-stack/mlx_service/
  • ComfyUI: System-wide installation
  • Backend: /Users/yaro/Documents/a2zadd/packages/backend/
  • Frontend: /Users/yaro/Documents/a2zadd/packages/frontend/

Software Stack

Runtime:

  • Python 3.11+
  • Node.js 20+
  • Bun (JavaScript runtime)

Python Packages:

  • mlx
  • mlx-lm
  • mlx-embedding-models
  • fastapi
  • uvicorn
  • numpy
  • ComfyUI (full installation)

System Tools:

  • git
  • curl
  • wget
  • homebrew
  • docker (optional, not currently used)

Ubuntu Server (10.0.0.11)

Hardware Specifications

CPU: Multi-core x86_64 processor (specific model TBD) RAM: 32GB DDR4/DDR5 GPU: NVIDIA GeForce RTX 3060

  • VRAM: 12GB GDDR6 Storage:
  • Primary: SSD for OS and Docker
  • Secondary: /mnt/storage - Large capacity mount for AI models and data

Operating System

OS: Ubuntu (Linux) Kernel: Linux (version TBD) Architecture: x86_64 (AMD64) Container Runtime: Docker + Docker Compose

AI Capabilities

GPU-Accelerated Services

Whisper (Speech-to-Text):

  • Model: faster-whisper-large-v3
  • Languages: 100+ (optimized for Arabic)
  • Speed: Real-time factor ~0.1-0.3x (faster than audio length)
  • Audio formats: WAV, MP3, etc.
  • GPU: CUDA-accelerated

XTTS-v2 (Text-to-Speech):

  • Languages: Arabic, English, + 15 others
  • Voice cloning: Yes (from sample audio)
  • Speed: ~1-3 seconds per sentence
  • Quality: 24kHz, 16-bit WAV
  • GPU: CUDA-accelerated

Embeddings:

  • Model: multilingual-e5-large
  • Dimensions: 1024
  • Languages: 100+ including Arabic
  • Speed: GPU-accelerated, batch processing
  • Max batch: 1000 texts

ComfyUI (Image Generation):

  • Model: Z-Image Turbo SDXL (BF16)
  • CLIP: Qwen 3 4B
  • Resolution: Up to 2048x2048
  • Speed: ~10-20 seconds per image (GPU-dependent)
  • Queue: Yes (job-based system)

Resource Limits

GPU:

  • Single NVIDIA GPU shared across:
    • Whisper STT
    • XTTS-v2 TTS
    • Embeddings
    • ComfyUI
  • Concurrency: Jobs queued, one GPU-heavy task at a time
  • VRAM: Shared pool

Memory:

  • Total: 32GB
  • OS + Docker: ~4-6GB
  • AI Services: ~10-15GB
  • Database + Redis: ~2-4GB
  • Available: ~8-12GB

Storage:

  • /mnt/storage: Large capacity
  • Docker volumes: ~100-200GB
  • AI models: ~50-100GB
  • Audio files: Growing (managed)
  • Database: PostgreSQL (growing)

Network:

  • Gigabit ethernet (local)
  • Internet: Upload/download speed (TBD)

Performance Benchmarks

Speech-to-Text (Whisper):

  • Arabic audio (1 minute): ~10-20 seconds
  • English audio (1 minute): ~8-15 seconds
  • Real-time factor: 0.15-0.35x

Text-to-Speech (XTTS-v2):

  • Arabic (50 chars): ~1-2 seconds
  • English (50 chars): ~1-2 seconds
  • Voice cloning setup: ~5-10 seconds

Embeddings:

  • Single text: ~50-100ms
  • Batch 100 texts: ~2-5 seconds
  • Batch 1000 texts: ~15-30 seconds

Image Generation:

  • ComfyUI (1024x1024, 9 steps): 10-20 seconds
  • Depends on GPU load

Docker Stack

┌──────────────────────────────────────────┐
│      Ubuntu Docker Services               │
├──────────────────────────────────────────┤
│ proyaro-api-backend    :8000  (Healthy)  │
│ proyaro-whisper        :8001  (Healthy)  │
│ proyaro-tts            :8002  (Healthy)  │
│ proyaro-embeddings     :8003  (Healthy)  │
│ proyaro-comfyui        :8188  (Healthy)  │
│ proyaro-job-worker     (background)      │
│ proyaro-postgres       :5432  (Healthy)  │
│ proyaro-redis          :6379  (Healthy)  │
│ proyaro-frontend       :3000  (Healthy)  │
│ proyaro-caddy          :80,:443 (Healthy)│
└──────────────────────────────────────────┘

Health Monitoring:

  • All services have health checks
  • Docker Compose manages restarts
  • Logs available via docker compose logs

Software Stack

Container Images:

  • API Backend: Python 3.11 + FastAPI
  • AI Workers: NVIDIA CUDA base images
  • Database: PostgreSQL 16
  • Cache: Redis 7
  • Frontend: Node.js (production build)
  • Proxy: Caddy 2

Python Stack:

  • fastapi
  • uvicorn
  • sqlalchemy (async)
  • alembic (migrations)
  • aiohttp (async HTTP)
  • faster-whisper
  • TTS (Coqui)
  • sentence-transformers
  • torch (CUDA support)

System:

  • NVIDIA Docker runtime
  • CUDA Toolkit
  • Docker Compose V2

Comparison Matrix

FeatureMac Mini M4Ubuntu Server
Primary RoleDevelopment, MLXProduction, GPU Jobs
ArchitectureARM64 (Apple Silicon)x86_64 (AMD64)
AI FrameworkMLX (Apple)CUDA (NVIDIA)
RAM48GB Unified32GB System
GPU/AcceleratorApple Neural Engine + GPUNVIDIA RTX 3060 12GB
Text GenerationMLX (Fast, local)Via MLX (proxied)
Image GenerationComfyUI (CPU/GPU)ComfyUI (GPU queue)
Video Generation✅ Yes (max 41 frames)❌ No
Speech-to-Text❌ No✅ Whisper (GPU)
Text-to-Speech❌ No✅ XTTS-v2 (GPU)
Embeddings (384-dim)✅ MLX (fast)❌ No
Embeddings (1024-dim)❌ No✅ E5-Large (GPU)
Job Queue❌ No (direct API)✅ Redis + Worker
Database❌ No✅ PostgreSQL
WebSocket❌ No✅ Yes
External Access❌ Internal only✅ HTTPS (Caddy)
Auto-restartLaunchAgent (MLX)Docker (all services)
MonitoringManualDocker health checks

Capacity Planning

Mac Mini M4

Current Usage:

  • MLX model loaded: ~8GB
  • System + apps: ~10GB
  • Free RAM: ~30GB
  • Capacity: Can load one large model at a time

Growth Potential:

  • Limited by 48GB RAM (careful with model selection)
  • Storage may need expansion for generated content
  • Thermal limits unlikely to be reached

Bottlenecks:

  • RAM limits (48GB total)
  • Model switching time (10-30 seconds)
  • Video frame limit (41 frames max)
  • Single-machine (no horizontal scaling)

Ubuntu Server

Current Usage:

  • Docker containers: ~20-25GB RAM
  • GPU VRAM: 12GB (RTX 3060)
  • Storage: Growing with audio files and DB

Growth Potential:

  • Can scale vertically (upgrade to 64GB+ RAM)
  • Can add more GPU for parallel processing
  • Can scale horizontally (add more Ubuntu servers)

Bottlenecks:

  • 32GB RAM (limits concurrent services)
  • Single RTX 3060 12GB GPU (serializes GPU-heavy jobs)
  • Network bandwidth (for large file uploads)
  • Storage (audio files accumulate)

Recommended Use Cases by Machine

Use Mac Mini For:

Text generation - Fastest, lowest latency ✅ Rapid prototyping - Direct API access ✅ Image generation (single requests) - No queue ✅ Video generation - Only available here ✅ Small embeddings (384-dim) - Very fast ✅ Development - Full dev environment

Use Ubuntu Server For:

Production applications - Stable, monitored ✅ Speech-to-text - Only available here ✅ Text-to-speech - Only available here ✅ Large embeddings (1024-dim) - Production quality ✅ Batch processing - Job queue system ✅ External access - HTTPS, authentication ✅ Database-backed apps - PostgreSQL available


Maintenance & Updates

Mac Mini

Regular:

  • Monitor MLX service logs: /tmp/mlx-service.log
  • Check available storage
  • Update MLX models when new versions release

Periodic:

  • macOS system updates (careful with major versions)
  • Python package updates
  • ComfyUI updates

As Needed:

  • Add new MLX models
  • Clear generated content cache

Ubuntu Server

Regular:

  • Monitor Docker container health: docker compose ps
  • Check disk space: /mnt/storage
  • Review job queue length (Redis)
  • Database backups

Periodic:

  • Ubuntu security updates: apt update && apt upgrade
  • Docker image updates: Pull new images
  • Database maintenance (VACUUM)

As Needed:

  • Clear old audio files
  • Clear old job results
  • Optimize database queries
  • Scale GPU resources

Disaster Recovery

Mac Mini Backup

Critical Data:

  • MLX converted models (can be regenerated but slow)
  • ComfyUI workflows (JSON files)
  • Project code (Git repos)

Backup Strategy:

  • Code: Git (remote repositories)
  • Models: Can re-download/convert if needed
  • Config files: Include in Git

Recovery:

  1. Reinstall macOS (if necessary)
  2. Install Homebrew, Python, Node.js
  3. Clone repositories
  4. Install MLX and dependencies
  5. Download/convert models
  6. Start services

Ubuntu Server Backup

Critical Data:

  • PostgreSQL database (user data, jobs, results)
  • Docker volumes (models, audio files)
  • Environment configuration
  • SSL certificates (Caddy manages auto-renewal)

Backup Strategy:

  • Database: Daily pg_dump to external storage
  • Models: Can re-download if needed
  • Audio files: Periodic backup or accept loss
  • Config: Git repository

Recovery:

  1. Reinstall Ubuntu Server
  2. Install Docker + Docker Compose
  3. Restore /mnt/storage mount
  4. Clone repository
  5. Restore database dump
  6. Re-download AI models (or restore from backup)
  7. Start Docker Compose stack

Security Considerations

Mac Mini

Network Security:

  • Not exposed to internet (internal only)
  • No authentication on MLX service (trusted network)
  • macOS firewall enabled (recommended)

File Security:

  • User permissions: Standard user (yaro)
  • No sensitive data stored (all data is generated)

Ubuntu Server

Network Security:

  • Firewall: Only ports 80, 443, 22 open
  • All AI services isolated in Docker network
  • Caddy handles SSL/TLS termination

Application Security:

  • JWT authentication on all API endpoints
  • Password-protected database
  • Environment variables for secrets
  • Docker container isolation

Physical Security:

  • Server location: Secure (assumed)
  • SSH access: Password-protected (consider key-based)

Cost Analysis

Mac Mini M4

One-time:

  • Hardware: ~$2,500-4,000 (depending on config)
  • Software: Free (open source)

Recurring:

  • Electricity: ~50W idle, 100W load ($10-20/month)
  • Internet: Shared
  • Maintenance: Minimal

Total Cost of Ownership (3 years): ~$3,000-4,500

Ubuntu Server

One-time:

  • Hardware: ~$1,500-3,000 (GPU-dependent)
  • Software: Free (open source, Linux)

Recurring:

  • Electricity: ~150W idle, 300W load ($20-40/month)
  • Domain: ~$15/year (proyaro.com)
  • SSL: Free (Let's Encrypt via Caddy)
  • Hosting: $0 (self-hosted)

Total Cost of Ownership (3 years): ~$2,500-4,500

Alternative (Cloud)

Equivalent cloud costs (estimated):

  • GPU instance: $500-2,000/month
  • MLX equivalent: Not available (Apple Silicon)
  • Storage: $50-100/month
  • Data transfer: $50-200/month

Total (3 years): $21,600-82,800

Savings with self-hosted: ~$16,000-75,000 🎉


Future Expansion Possibilities

Mac Mini

  • More storage: External SSD for model/content storage
  • More models: Add specialized models (code, creative, etc.)
  • Clustering: Add more Mac Minis (requires network setup)
  • RAM upgrade: Could upgrade to higher capacity (currently 48GB)

Ubuntu Server

  • More GPUs: Add second NVIDIA GPU for parallel processing
  • More RAM: Upgrade to 128GB or 256GB
  • More storage: Expand /mnt/storage
  • Clustering: Add more Ubuntu nodes (horizontal scaling)
  • Kubernetes: Migrate from Docker Compose to K8s

Quick Specs Reference

╔═══════════════════════════════════════════════════════╗
║           MACHINE QUICK REFERENCE                     ║
╠═══════════════════════════════════════════════════════╣
║ MAC MINI M4 PRO                                       ║
║   IP: 10.0.0.188                                      ║
║   RAM: 48GB Unified Memory                            ║
║   GPU: 20-core Apple GPU + Neural Engine             ║
║   Best for: MLX text gen, rapid prototyping          ║
║   Specialty: Egyptian Arabic, video generation       ║
╠═══════════════════════════════════════════════════════╣
║ UBUNTU SERVER                                         ║
║   IP: 10.0.0.11 (external: api.proyaro.com)          ║
║   RAM: 32GB System Memory                             ║
║   GPU: NVIDIA RTX 3060 (12GB VRAM)                   ║
║   Best for: Production, STT, TTS, embeddings         ║
║   Specialty: Job queue, WebSocket, database          ║
╚═══════════════════════════════════════════════════════╝

Last Updated: 2026-01-02

ProYaro AI Infrastructure Documentation • Version 1.2