Machines & Infrastructure
Machines & Infrastructure Specifications
Overview
ProYaro AI Stack operates on two dedicated servers with complementary capabilities:
- Mac Mini M4 - Apple Silicon powerhouse for MLX inference
- Ubuntu Server - NVIDIA GPU server for production AI workloads
Mac Mini M4 (10.0.0.188)
Hardware Specifications
Model: Mac Mini (Late 2024) Chip: Apple M4 Pro RAM: 48GB Unified Memory Storage: High-speed NVMe SSD (capacity TBD) Neural Engine: 16-core (Apple Silicon integrated) GPU Cores: 20-core GPU (M4 Pro configuration)
Operating System
OS: macOS Sequoia 15.7.2 Kernel: Darwin 24.6.0 Architecture: ARM64 (Apple Silicon)
AI Capabilities
MLX Framework
- Native Apple Silicon optimization
- Unified memory architecture (48GB shared between CPU/GPU/Neural Engine)
- Mixed precision support: FP32, FP16, BF16, 4-bit, 8-bit quantization
- Token generation speed: 30-50 tokens/sec (7B-14B models), 15-25 tokens/sec (30B+ models)
Supported Models (MLX)
-
qwen2.5-combined-egyptian-marketing-4bit (Active by default)
- Size: ~4-8GB RAM usage
- Specialty: Egyptian Arabic marketing content
- Speed: ~40 tokens/sec
-
Qwen3-Coder-30B-A3B-Instruct-4bit
- Size: ~15-20GB RAM usage
- Specialty: Code generation
- Speed: ~20 tokens/sec
-
Qwen2.5-14B-Instruct-4bit
- Size: ~8-12GB RAM usage
- Specialty: General instruction following
- Speed: ~35 tokens/sec
Embeddings
- Model: multilingual-e5-small
- Dimensions: 384
- Speed: Very fast (Apple Neural Engine)
- Languages: 100+ including Arabic
Image Generation (ComfyUI)
- Model: Z-Image Turbo (SDXL-based, BF16)
- CLIP: Qwen 3 4B (multilingual)
- Resolution: Up to 2048x2048
- Speed: ~8-15 seconds per 1024x1024 image
- Batch size: Limited by RAM (can do 1-4 images)
Video Generation (ComfyUI)
- Model: Wan2.2-T2V (quantized to Q4)
- CLIP: UMT5-XXL (FP16)
- Frame limit: 41 frames maximum (M4 constraint)
- Resolution: 480x720, 640x480, etc.
- Speed: ~30-60 seconds for 30-frame video
- Format: MP4
Resource Limits
Memory:
- Total: 48GB
- OS + System: ~8-10GB
- MLX Model: 4-20GB (depending on model)
- ComfyUI Models: ~8-12GB loaded
- Available for processing: ~10-25GB
Storage:
- Models: ~50-100GB
- Generated content: Limited by SSD capacity
- Workspace: Plenty available
Thermal:
- Active cooling (fan)
- Can sustain continuous inference
- May throttle under extreme sustained load (unlikely in normal use)
Performance Benchmarks
Text Generation:
- Qwen2.5-14B-4bit: 35 tokens/sec average
- Qwen3-Coder-30B-4bit: 20 tokens/sec average
- Prompt processing: Extremely fast (batch processing optimized)
Image Generation:
- Z-Image Turbo (1024x1024, 9 steps): 8-12 seconds
- SDXL (1024x1024, 20 steps): 20-30 seconds
Video Generation:
- Wan2.2 (30 frames, 640x480): 45-60 seconds
- Limited by frame count (max 41)
Embeddings:
- Single text: <10ms
- Batch 100 texts: <500ms
Installed Services
┌─────────────────────────────────────────┐
│ Mac Mini M4 Services │
├─────────────────────────────────────────┤
│ MLX FastAPI :8004 (Auto-start)│
│ ComfyUI :8188 (Manual) │
│ a2zadd Backend :3000 (Manual) │
│ Frontend Dev :5173 (Manual) │
└─────────────────────────────────────────┘
Auto-start Services:
- MLX FastAPI: LaunchAgent configured (
com.mlx.service)
Service Locations:
- MLX:
/Users/yaro/Documents/new-stack/mlx_service/ - ComfyUI: System-wide installation
- Backend:
/Users/yaro/Documents/a2zadd/packages/backend/ - Frontend:
/Users/yaro/Documents/a2zadd/packages/frontend/
Software Stack
Runtime:
- Python 3.11+
- Node.js 20+
- Bun (JavaScript runtime)
Python Packages:
- mlx
- mlx-lm
- mlx-embedding-models
- fastapi
- uvicorn
- numpy
- ComfyUI (full installation)
System Tools:
- git
- curl
- wget
- homebrew
- docker (optional, not currently used)
Ubuntu Server (10.0.0.11)
Hardware Specifications
CPU: Multi-core x86_64 processor (specific model TBD) RAM: 32GB DDR4/DDR5 GPU: NVIDIA GeForce RTX 3060
- VRAM: 12GB GDDR6 Storage:
- Primary: SSD for OS and Docker
- Secondary:
/mnt/storage- Large capacity mount for AI models and data
Operating System
OS: Ubuntu (Linux) Kernel: Linux (version TBD) Architecture: x86_64 (AMD64) Container Runtime: Docker + Docker Compose
AI Capabilities
GPU-Accelerated Services
Whisper (Speech-to-Text):
- Model: faster-whisper-large-v3
- Languages: 100+ (optimized for Arabic)
- Speed: Real-time factor ~0.1-0.3x (faster than audio length)
- Audio formats: WAV, MP3, etc.
- GPU: CUDA-accelerated
XTTS-v2 (Text-to-Speech):
- Languages: Arabic, English, + 15 others
- Voice cloning: Yes (from sample audio)
- Speed: ~1-3 seconds per sentence
- Quality: 24kHz, 16-bit WAV
- GPU: CUDA-accelerated
Embeddings:
- Model: multilingual-e5-large
- Dimensions: 1024
- Languages: 100+ including Arabic
- Speed: GPU-accelerated, batch processing
- Max batch: 1000 texts
ComfyUI (Image Generation):
- Model: Z-Image Turbo SDXL (BF16)
- CLIP: Qwen 3 4B
- Resolution: Up to 2048x2048
- Speed: ~10-20 seconds per image (GPU-dependent)
- Queue: Yes (job-based system)
Resource Limits
GPU:
- Single NVIDIA GPU shared across:
- Whisper STT
- XTTS-v2 TTS
- Embeddings
- ComfyUI
- Concurrency: Jobs queued, one GPU-heavy task at a time
- VRAM: Shared pool
Memory:
- Total: 32GB
- OS + Docker: ~4-6GB
- AI Services: ~10-15GB
- Database + Redis: ~2-4GB
- Available: ~8-12GB
Storage:
/mnt/storage: Large capacity- Docker volumes: ~100-200GB
- AI models: ~50-100GB
- Audio files: Growing (managed)
- Database: PostgreSQL (growing)
Network:
- Gigabit ethernet (local)
- Internet: Upload/download speed (TBD)
Performance Benchmarks
Speech-to-Text (Whisper):
- Arabic audio (1 minute): ~10-20 seconds
- English audio (1 minute): ~8-15 seconds
- Real-time factor: 0.15-0.35x
Text-to-Speech (XTTS-v2):
- Arabic (50 chars): ~1-2 seconds
- English (50 chars): ~1-2 seconds
- Voice cloning setup: ~5-10 seconds
Embeddings:
- Single text: ~50-100ms
- Batch 100 texts: ~2-5 seconds
- Batch 1000 texts: ~15-30 seconds
Image Generation:
- ComfyUI (1024x1024, 9 steps): 10-20 seconds
- Depends on GPU load
Docker Stack
┌──────────────────────────────────────────┐
│ Ubuntu Docker Services │
├──────────────────────────────────────────┤
│ proyaro-api-backend :8000 (Healthy) │
│ proyaro-whisper :8001 (Healthy) │
│ proyaro-tts :8002 (Healthy) │
│ proyaro-embeddings :8003 (Healthy) │
│ proyaro-comfyui :8188 (Healthy) │
│ proyaro-job-worker (background) │
│ proyaro-postgres :5432 (Healthy) │
│ proyaro-redis :6379 (Healthy) │
│ proyaro-frontend :3000 (Healthy) │
│ proyaro-caddy :80,:443 (Healthy)│
└──────────────────────────────────────────┘
Health Monitoring:
- All services have health checks
- Docker Compose manages restarts
- Logs available via
docker compose logs
Software Stack
Container Images:
- API Backend: Python 3.11 + FastAPI
- AI Workers: NVIDIA CUDA base images
- Database: PostgreSQL 16
- Cache: Redis 7
- Frontend: Node.js (production build)
- Proxy: Caddy 2
Python Stack:
- fastapi
- uvicorn
- sqlalchemy (async)
- alembic (migrations)
- aiohttp (async HTTP)
- faster-whisper
- TTS (Coqui)
- sentence-transformers
- torch (CUDA support)
System:
- NVIDIA Docker runtime
- CUDA Toolkit
- Docker Compose V2
Comparison Matrix
| Feature | Mac Mini M4 | Ubuntu Server |
|---|---|---|
| Primary Role | Development, MLX | Production, GPU Jobs |
| Architecture | ARM64 (Apple Silicon) | x86_64 (AMD64) |
| AI Framework | MLX (Apple) | CUDA (NVIDIA) |
| RAM | 48GB Unified | 32GB System |
| GPU/Accelerator | Apple Neural Engine + GPU | NVIDIA RTX 3060 12GB |
| Text Generation | MLX (Fast, local) | Via MLX (proxied) |
| Image Generation | ComfyUI (CPU/GPU) | ComfyUI (GPU queue) |
| Video Generation | ✅ Yes (max 41 frames) | ❌ No |
| Speech-to-Text | ❌ No | ✅ Whisper (GPU) |
| Text-to-Speech | ❌ No | ✅ XTTS-v2 (GPU) |
| Embeddings (384-dim) | ✅ MLX (fast) | ❌ No |
| Embeddings (1024-dim) | ❌ No | ✅ E5-Large (GPU) |
| Job Queue | ❌ No (direct API) | ✅ Redis + Worker |
| Database | ❌ No | ✅ PostgreSQL |
| WebSocket | ❌ No | ✅ Yes |
| External Access | ❌ Internal only | ✅ HTTPS (Caddy) |
| Auto-restart | LaunchAgent (MLX) | Docker (all services) |
| Monitoring | Manual | Docker health checks |
Capacity Planning
Mac Mini M4
Current Usage:
- MLX model loaded: ~8GB
- System + apps: ~10GB
- Free RAM: ~30GB
- Capacity: Can load one large model at a time
Growth Potential:
- Limited by 48GB RAM (careful with model selection)
- Storage may need expansion for generated content
- Thermal limits unlikely to be reached
Bottlenecks:
- RAM limits (48GB total)
- Model switching time (10-30 seconds)
- Video frame limit (41 frames max)
- Single-machine (no horizontal scaling)
Ubuntu Server
Current Usage:
- Docker containers: ~20-25GB RAM
- GPU VRAM: 12GB (RTX 3060)
- Storage: Growing with audio files and DB
Growth Potential:
- Can scale vertically (upgrade to 64GB+ RAM)
- Can add more GPU for parallel processing
- Can scale horizontally (add more Ubuntu servers)
Bottlenecks:
- 32GB RAM (limits concurrent services)
- Single RTX 3060 12GB GPU (serializes GPU-heavy jobs)
- Network bandwidth (for large file uploads)
- Storage (audio files accumulate)
Recommended Use Cases by Machine
Use Mac Mini For:
✅ Text generation - Fastest, lowest latency ✅ Rapid prototyping - Direct API access ✅ Image generation (single requests) - No queue ✅ Video generation - Only available here ✅ Small embeddings (384-dim) - Very fast ✅ Development - Full dev environment
Use Ubuntu Server For:
✅ Production applications - Stable, monitored ✅ Speech-to-text - Only available here ✅ Text-to-speech - Only available here ✅ Large embeddings (1024-dim) - Production quality ✅ Batch processing - Job queue system ✅ External access - HTTPS, authentication ✅ Database-backed apps - PostgreSQL available
Maintenance & Updates
Mac Mini
Regular:
- Monitor MLX service logs:
/tmp/mlx-service.log - Check available storage
- Update MLX models when new versions release
Periodic:
- macOS system updates (careful with major versions)
- Python package updates
- ComfyUI updates
As Needed:
- Add new MLX models
- Clear generated content cache
Ubuntu Server
Regular:
- Monitor Docker container health:
docker compose ps - Check disk space:
/mnt/storage - Review job queue length (Redis)
- Database backups
Periodic:
- Ubuntu security updates:
apt update && apt upgrade - Docker image updates: Pull new images
- Database maintenance (VACUUM)
As Needed:
- Clear old audio files
- Clear old job results
- Optimize database queries
- Scale GPU resources
Disaster Recovery
Mac Mini Backup
Critical Data:
- MLX converted models (can be regenerated but slow)
- ComfyUI workflows (JSON files)
- Project code (Git repos)
Backup Strategy:
- Code: Git (remote repositories)
- Models: Can re-download/convert if needed
- Config files: Include in Git
Recovery:
- Reinstall macOS (if necessary)
- Install Homebrew, Python, Node.js
- Clone repositories
- Install MLX and dependencies
- Download/convert models
- Start services
Ubuntu Server Backup
Critical Data:
- PostgreSQL database (user data, jobs, results)
- Docker volumes (models, audio files)
- Environment configuration
- SSL certificates (Caddy manages auto-renewal)
Backup Strategy:
- Database: Daily pg_dump to external storage
- Models: Can re-download if needed
- Audio files: Periodic backup or accept loss
- Config: Git repository
Recovery:
- Reinstall Ubuntu Server
- Install Docker + Docker Compose
- Restore
/mnt/storagemount - Clone repository
- Restore database dump
- Re-download AI models (or restore from backup)
- Start Docker Compose stack
Security Considerations
Mac Mini
Network Security:
- Not exposed to internet (internal only)
- No authentication on MLX service (trusted network)
- macOS firewall enabled (recommended)
File Security:
- User permissions: Standard user (yaro)
- No sensitive data stored (all data is generated)
Ubuntu Server
Network Security:
- Firewall: Only ports 80, 443, 22 open
- All AI services isolated in Docker network
- Caddy handles SSL/TLS termination
Application Security:
- JWT authentication on all API endpoints
- Password-protected database
- Environment variables for secrets
- Docker container isolation
Physical Security:
- Server location: Secure (assumed)
- SSH access: Password-protected (consider key-based)
Cost Analysis
Mac Mini M4
One-time:
- Hardware: ~$2,500-4,000 (depending on config)
- Software: Free (open source)
Recurring:
- Electricity: ~50W idle,
100W load ($10-20/month) - Internet: Shared
- Maintenance: Minimal
Total Cost of Ownership (3 years): ~$3,000-4,500
Ubuntu Server
One-time:
- Hardware: ~$1,500-3,000 (GPU-dependent)
- Software: Free (open source, Linux)
Recurring:
- Electricity: ~150W idle,
300W load ($20-40/month) - Domain: ~$15/year (proyaro.com)
- SSL: Free (Let's Encrypt via Caddy)
- Hosting: $0 (self-hosted)
Total Cost of Ownership (3 years): ~$2,500-4,500
Alternative (Cloud)
Equivalent cloud costs (estimated):
- GPU instance: $500-2,000/month
- MLX equivalent: Not available (Apple Silicon)
- Storage: $50-100/month
- Data transfer: $50-200/month
Total (3 years): $21,600-82,800
Savings with self-hosted: ~$16,000-75,000 🎉
Future Expansion Possibilities
Mac Mini
- More storage: External SSD for model/content storage
- More models: Add specialized models (code, creative, etc.)
- Clustering: Add more Mac Minis (requires network setup)
- RAM upgrade: Could upgrade to higher capacity (currently 48GB)
Ubuntu Server
- More GPUs: Add second NVIDIA GPU for parallel processing
- More RAM: Upgrade to 128GB or 256GB
- More storage: Expand
/mnt/storage - Clustering: Add more Ubuntu nodes (horizontal scaling)
- Kubernetes: Migrate from Docker Compose to K8s
Quick Specs Reference
╔═══════════════════════════════════════════════════════╗
║ MACHINE QUICK REFERENCE ║
╠═══════════════════════════════════════════════════════╣
║ MAC MINI M4 PRO ║
║ IP: 10.0.0.188 ║
║ RAM: 48GB Unified Memory ║
║ GPU: 20-core Apple GPU + Neural Engine ║
║ Best for: MLX text gen, rapid prototyping ║
║ Specialty: Egyptian Arabic, video generation ║
╠═══════════════════════════════════════════════════════╣
║ UBUNTU SERVER ║
║ IP: 10.0.0.11 (external: api.proyaro.com) ║
║ RAM: 32GB System Memory ║
║ GPU: NVIDIA RTX 3060 (12GB VRAM) ║
║ Best for: Production, STT, TTS, embeddings ║
║ Specialty: Job queue, WebSocket, database ║
╚═══════════════════════════════════════════════════════╝
Last Updated: 2026-01-02
ProYaro AI Infrastructure Documentation • Version 1.2