Documentation

Machines & Infrastructure

Machines & Infrastructure Specifications

Overview

ProYaro AI Stack operates on two dedicated servers with complementary capabilities:

Mac Mini M4 - Apple Silicon powerhouse for MLX inference
Ubuntu Server - NVIDIA GPU server for production AI workloads

Mac Mini M4 (10.0.0.188)

Hardware Specifications

Model: Mac Mini (Late 2024) Chip: Apple M4 Pro RAM: 48GB Unified Memory Storage: High-speed NVMe SSD (capacity TBD) Neural Engine: 16-core (Apple Silicon integrated) GPU Cores: 20-core GPU (M4 Pro configuration)

Operating System

OS: macOS Sequoia 15.7.2 Kernel: Darwin 24.6.0 Architecture: ARM64 (Apple Silicon)

AI Capabilities

MLX Framework

Native Apple Silicon optimization
Unified memory architecture (48GB shared between CPU/GPU/Neural Engine)
Mixed precision support: FP32, FP16, BF16, 4-bit, 8-bit quantization
Token generation speed: 30-50 tokens/sec (7B-14B models), 15-25 tokens/sec (30B+ models)

Supported Models (MLX)

qwen2.5-combined-egyptian-marketing-4bit (Active by default)
- Size: ~4-8GB RAM usage
- Specialty: Egyptian Arabic marketing content
- Speed: ~40 tokens/sec
Qwen3-Coder-30B-A3B-Instruct-4bit
- Size: ~15-20GB RAM usage
- Specialty: Code generation
- Speed: ~20 tokens/sec
Qwen2.5-14B-Instruct-4bit
- Size: ~8-12GB RAM usage
- Specialty: General instruction following
- Speed: ~35 tokens/sec

Embeddings

Model: multilingual-e5-small
Dimensions: 384
Speed: Very fast (Apple Neural Engine)
Languages: 100+ including Arabic

Image Generation (ComfyUI)

Model: Z-Image Turbo (SDXL-based, BF16)
CLIP: Qwen 3 4B (multilingual)
Resolution: Up to 2048x2048
Speed: ~8-15 seconds per 1024x1024 image
Batch size: Limited by RAM (can do 1-4 images)

Video Generation (ComfyUI)

Model: Wan2.2-T2V (quantized to Q4)
CLIP: UMT5-XXL (FP16)
Frame limit: 41 frames maximum (M4 constraint)
Resolution: 480x720, 640x480, etc.
Speed: ~30-60 seconds for 30-frame video
Format: MP4

Resource Limits

Memory:

Total: 48GB
OS + System: ~8-10GB
MLX Model: 4-20GB (depending on model)
ComfyUI Models: ~8-12GB loaded
Available for processing: ~10-25GB

Storage:

Models: ~50-100GB
Generated content: Limited by SSD capacity
Workspace: Plenty available

Thermal:

Active cooling (fan)
Can sustain continuous inference
May throttle under extreme sustained load (unlikely in normal use)

Performance Benchmarks

Text Generation:

Qwen2.5-14B-4bit: 35 tokens/sec average
Qwen3-Coder-30B-4bit: 20 tokens/sec average
Prompt processing: Extremely fast (batch processing optimized)

Image Generation:

Z-Image Turbo (1024x1024, 9 steps): 8-12 seconds
SDXL (1024x1024, 20 steps): 20-30 seconds

Video Generation:

Wan2.2 (30 frames, 640x480): 45-60 seconds
Limited by frame count (max 41)

Embeddings:

Single text: <10ms
Batch 100 texts: <500ms

Installed Services

┌─────────────────────────────────────────┐
│         Mac Mini M4 Services             │
├─────────────────────────────────────────┤
│ MLX FastAPI          :8004  (Auto-start)│
│ ComfyUI              :8188  (Manual)    │
│ a2zadd Backend       :3000  (Manual)    │
│ Frontend Dev         :5173  (Manual)    │
└─────────────────────────────────────────┘

Auto-start Services:

MLX FastAPI: LaunchAgent configured (com.mlx.service)

Service Locations:

MLX: /Users/yaro/Documents/new-stack/mlx_service/
ComfyUI: System-wide installation
Backend: /Users/yaro/Documents/a2zadd/packages/backend/
Frontend: /Users/yaro/Documents/a2zadd/packages/frontend/

Software Stack

Runtime:

Python 3.11+
Node.js 20+
Bun (JavaScript runtime)

Python Packages:

mlx
mlx-lm
mlx-embedding-models
fastapi
uvicorn
numpy
ComfyUI (full installation)

System Tools:

git
curl
wget
homebrew
docker (optional, not currently used)

Ubuntu Server (10.0.0.11)

Hardware Specifications

CPU: Multi-core x86_64 processor (specific model TBD) RAM: 32GB DDR4/DDR5 GPU: NVIDIA GeForce RTX 3060

VRAM: 12GB GDDR6 Storage:
Primary: SSD for OS and Docker
Secondary: /mnt/storage - Large capacity mount for AI models and data

Operating System

OS: Ubuntu (Linux) Kernel: Linux (version TBD) Architecture: x86_64 (AMD64) Container Runtime: Docker + Docker Compose

AI Capabilities

GPU-Accelerated Services

Whisper (Speech-to-Text):

Model: faster-whisper-large-v3
Languages: 100+ (optimized for Arabic)
Speed: Real-time factor ~0.1-0.3x (faster than audio length)
Audio formats: WAV, MP3, etc.
GPU: CUDA-accelerated

XTTS-v2 (Text-to-Speech):

Languages: Arabic, English, + 15 others
Voice cloning: Yes (from sample audio)
Speed: ~1-3 seconds per sentence
Quality: 24kHz, 16-bit WAV
GPU: CUDA-accelerated

Embeddings:

Model: multilingual-e5-large
Dimensions: 1024
Languages: 100+ including Arabic
Speed: GPU-accelerated, batch processing
Max batch: 1000 texts

ComfyUI (Image Generation):

Model: Z-Image Turbo SDXL (BF16)
CLIP: Qwen 3 4B
Resolution: Up to 2048x2048
Speed: ~10-20 seconds per image (GPU-dependent)
Queue: Yes (job-based system)

Resource Limits

GPU:

Single NVIDIA GPU shared across:
- Whisper STT
- XTTS-v2 TTS
- Embeddings
- ComfyUI
Concurrency: Jobs queued, one GPU-heavy task at a time
VRAM: Shared pool

Memory:

Total: 32GB
OS + Docker: ~4-6GB
AI Services: ~10-15GB
Database + Redis: ~2-4GB
Available: ~8-12GB

Storage:

/mnt/storage: Large capacity
Docker volumes: ~100-200GB
AI models: ~50-100GB
Audio files: Growing (managed)
Database: PostgreSQL (growing)

Network:

Gigabit ethernet (local)
Internet: Upload/download speed (TBD)

Performance Benchmarks

Speech-to-Text (Whisper):

Arabic audio (1 minute): ~10-20 seconds
English audio (1 minute): ~8-15 seconds
Real-time factor: 0.15-0.35x

Text-to-Speech (XTTS-v2):

Arabic (50 chars): ~1-2 seconds
English (50 chars): ~1-2 seconds
Voice cloning setup: ~5-10 seconds

Embeddings:

Single text: ~50-100ms
Batch 100 texts: ~2-5 seconds
Batch 1000 texts: ~15-30 seconds

Image Generation:

ComfyUI (1024x1024, 9 steps): 10-20 seconds
Depends on GPU load

Docker Stack

┌──────────────────────────────────────────┐
│      Ubuntu Docker Services               │
├──────────────────────────────────────────┤
│ proyaro-api-backend    :8000  (Healthy)  │
│ proyaro-whisper        :8001  (Healthy)  │
│ proyaro-tts            :8002  (Healthy)  │
│ proyaro-embeddings     :8003  (Healthy)  │
│ proyaro-comfyui        :8188  (Healthy)  │
│ proyaro-job-worker     (background)      │
│ proyaro-postgres       :5432  (Healthy)  │
│ proyaro-redis          :6379  (Healthy)  │
│ proyaro-frontend       :3000  (Healthy)  │
│ proyaro-caddy          :80,:443 (Healthy)│
└──────────────────────────────────────────┘

Health Monitoring:

All services have health checks
Docker Compose manages restarts
Logs available via docker compose logs

Software Stack

Container Images:

API Backend: Python 3.11 + FastAPI
AI Workers: NVIDIA CUDA base images
Database: PostgreSQL 16
Cache: Redis 7
Frontend: Node.js (production build)
Proxy: Caddy 2

Python Stack:

fastapi
uvicorn
sqlalchemy (async)
alembic (migrations)
aiohttp (async HTTP)
faster-whisper
TTS (Coqui)
sentence-transformers
torch (CUDA support)

System:

NVIDIA Docker runtime
CUDA Toolkit
Docker Compose V2

Comparison Matrix

Feature	Mac Mini M4	Ubuntu Server
Primary Role	Development, MLX	Production, GPU Jobs
Architecture	ARM64 (Apple Silicon)	x86_64 (AMD64)
AI Framework	MLX (Apple)	CUDA (NVIDIA)
RAM	48GB Unified	32GB System
GPU/Accelerator	Apple Neural Engine + GPU	NVIDIA RTX 3060 12GB
Text Generation	MLX (Fast, local)	Via MLX (proxied)
Image Generation	ComfyUI (CPU/GPU)	ComfyUI (GPU queue)
Video Generation	✅ Yes (max 41 frames)	❌ No
Speech-to-Text	❌ No	✅ Whisper (GPU)
Text-to-Speech	❌ No	✅ XTTS-v2 (GPU)
Embeddings (384-dim)	✅ MLX (fast)	❌ No
Embeddings (1024-dim)	❌ No	✅ E5-Large (GPU)
Job Queue	❌ No (direct API)	✅ Redis + Worker
Database	❌ No	✅ PostgreSQL
WebSocket	❌ No	✅ Yes
External Access	❌ Internal only	✅ HTTPS (Caddy)
Auto-restart	LaunchAgent (MLX)	Docker (all services)
Monitoring	Manual	Docker health checks

Capacity Planning

Mac Mini M4

Current Usage:

MLX model loaded: ~8GB
System + apps: ~10GB
Free RAM: ~30GB
Capacity: Can load one large model at a time

Growth Potential:

Limited by 48GB RAM (careful with model selection)
Storage may need expansion for generated content
Thermal limits unlikely to be reached

Bottlenecks:

RAM limits (48GB total)
Model switching time (10-30 seconds)
Video frame limit (41 frames max)
Single-machine (no horizontal scaling)

Ubuntu Server

Current Usage:

Docker containers: ~20-25GB RAM
GPU VRAM: 12GB (RTX 3060)
Storage: Growing with audio files and DB

Growth Potential:

Can scale vertically (upgrade to 64GB+ RAM)
Can add more GPU for parallel processing
Can scale horizontally (add more Ubuntu servers)

Bottlenecks:

32GB RAM (limits concurrent services)
Single RTX 3060 12GB GPU (serializes GPU-heavy jobs)
Network bandwidth (for large file uploads)
Storage (audio files accumulate)

Recommended Use Cases by Machine

Use Mac Mini For:

✅ Text generation - Fastest, lowest latency ✅ Rapid prototyping - Direct API access ✅ Image generation (single requests) - No queue ✅ Video generation - Only available here ✅ Small embeddings (384-dim) - Very fast ✅ Development - Full dev environment

Use Ubuntu Server For:

✅ Production applications - Stable, monitored ✅ Speech-to-text - Only available here ✅ Text-to-speech - Only available here ✅ Large embeddings (1024-dim) - Production quality ✅ Batch processing - Job queue system ✅ External access - HTTPS, authentication ✅ Database-backed apps - PostgreSQL available

Maintenance & Updates

Mac Mini

Regular:

Monitor MLX service logs: /tmp/mlx-service.log
Check available storage
Update MLX models when new versions release

Periodic:

macOS system updates (careful with major versions)
Python package updates
ComfyUI updates

As Needed:

Add new MLX models
Clear generated content cache

Ubuntu Server

Regular:

Monitor Docker container health: docker compose ps
Check disk space: /mnt/storage
Review job queue length (Redis)
Database backups

Periodic:

Ubuntu security updates: apt update && apt upgrade
Docker image updates: Pull new images
Database maintenance (VACUUM)

As Needed:

Clear old audio files
Clear old job results
Optimize database queries
Scale GPU resources

Disaster Recovery

Mac Mini Backup

Critical Data:

MLX converted models (can be regenerated but slow)
ComfyUI workflows (JSON files)
Project code (Git repos)

Backup Strategy:

Code: Git (remote repositories)
Models: Can re-download/convert if needed
Config files: Include in Git

Recovery:

Reinstall macOS (if necessary)
Install Homebrew, Python, Node.js
Clone repositories
Install MLX and dependencies
Download/convert models
Start services

Ubuntu Server Backup

Critical Data:

PostgreSQL database (user data, jobs, results)
Docker volumes (models, audio files)
Environment configuration
SSL certificates (Caddy manages auto-renewal)

Backup Strategy:

Database: Daily pg_dump to external storage
Models: Can re-download if needed
Audio files: Periodic backup or accept loss
Config: Git repository

Recovery:

Reinstall Ubuntu Server
Install Docker + Docker Compose
Restore /mnt/storage mount
Clone repository
Restore database dump
Re-download AI models (or restore from backup)
Start Docker Compose stack

Security Considerations

Mac Mini

Network Security:

Not exposed to internet (internal only)
No authentication on MLX service (trusted network)
macOS firewall enabled (recommended)

File Security:

User permissions: Standard user (yaro)
No sensitive data stored (all data is generated)

Ubuntu Server

Network Security:

Firewall: Only ports 80, 443, 22 open
All AI services isolated in Docker network
Caddy handles SSL/TLS termination

Application Security:

JWT authentication on all API endpoints
Password-protected database
Environment variables for secrets
Docker container isolation

Physical Security:

Server location: Secure (assumed)
SSH access: Password-protected (consider key-based)

Cost Analysis

Mac Mini M4

One-time:

Hardware: ~$2,500-4,000 (depending on config)
Software: Free (open source)

Recurring:

Electricity: ~50W idle, ~~100W load (~~$10-20/month)
Internet: Shared
Maintenance: Minimal

Total Cost of Ownership (3 years): ~$3,000-4,500

Ubuntu Server

One-time:

Hardware: ~$1,500-3,000 (GPU-dependent)
Software: Free (open source, Linux)

Recurring:

Electricity: ~150W idle, ~~300W load (~~$20-40/month)
Domain: ~$15/year (proyaro.com)
SSL: Free (Let's Encrypt via Caddy)
Hosting: $0 (self-hosted)

Total Cost of Ownership (3 years): ~$2,500-4,500

Alternative (Cloud)

Equivalent cloud costs (estimated):

GPU instance: $500-2,000/month
MLX equivalent: Not available (Apple Silicon)
Storage: $50-100/month
Data transfer: $50-200/month

Total (3 years): $21,600-82,800

Savings with self-hosted: ~$16,000-75,000 🎉

Future Expansion Possibilities

Mac Mini

More storage: External SSD for model/content storage
More models: Add specialized models (code, creative, etc.)
Clustering: Add more Mac Minis (requires network setup)
RAM upgrade: Could upgrade to higher capacity (currently 48GB)

Ubuntu Server

More GPUs: Add second NVIDIA GPU for parallel processing
More RAM: Upgrade to 128GB or 256GB
More storage: Expand /mnt/storage
Clustering: Add more Ubuntu nodes (horizontal scaling)
Kubernetes: Migrate from Docker Compose to K8s

Quick Specs Reference

╔═══════════════════════════════════════════════════════╗
║           MACHINE QUICK REFERENCE                     ║
╠═══════════════════════════════════════════════════════╣
║ MAC MINI M4 PRO                                       ║
║   IP: 10.0.0.188                                      ║
║   RAM: 48GB Unified Memory                            ║
║   GPU: 20-core Apple GPU + Neural Engine             ║
║   Best for: MLX text gen, rapid prototyping          ║
║   Specialty: Egyptian Arabic, video generation       ║
╠═══════════════════════════════════════════════════════╣
║ UBUNTU SERVER                                         ║
║   IP: 10.0.0.11 (external: api.proyaro.com)          ║
║   RAM: 32GB System Memory                             ║
║   GPU: NVIDIA RTX 3060 (12GB VRAM)                   ║
║   Best for: Production, STT, TTS, embeddings         ║
║   Specialty: Job queue, WebSocket, database          ║
╚═══════════════════════════════════════════════════════╝

Last Updated: 2026-01-02

ProYaro AI Infrastructure Documentation • Version 1.2