A

Aleks Gotsa

AI/ML Engineer

I build LLM applications, agent workflows, and the routing and evaluation infrastructure that makes them reliable in production. My work sits at the seam between research-style techniques — distillation, LoRA, RAG — and the production realities like cost routing, fallback layers, and monitoring without a dedicated SRE.

Currently shipping at GAZDA in Uzhhorod, where I've built an MCP-based automation layer wiring GPT-4 and Claude into internal tooling, and put ~40 production agent workflows in front of the business. On the side I run Tiny Minds, a 3-person experimental AI lab where I'm building a teacher-expert distillation runtime, and recently shipped Cortex, an open-source multi-pass RAG agent with per-claim verification.

Now 2026

2026 · May
🔧Tiny LLM v0.6 in progress at Tiny Minds. Planning-expert distillation mode wired up, teacher backend with --distill flag, 79/79 tests passing; end-to-end benchmark at 96.25% across 80 prompts.
2026 · Apr
🚀Shipped Cortex — open-source multi-pass RAG agent with per-claim verification. ~6.7K LOC, MCP server, public retrospective in repo.
2026 · Mar
🧪Started Cortex and Tiny Minds in parallel — Cortex as a self-contained 6-stage research engine, Tiny Minds as a longer-arc lab for distillation and routing.
2026
🎓Started B.Sc. CS at University of the People — formal credential alongside the build.
2025
⚙️At GAZDA, crossed ~40 production agent workflows across data validation, content generation, and ops. Fallback and monitoring layer holding without a dedicated SRE.
2024 · Jul
🛠️Joined GAZDA as Applied AI Engineer. Started building the MCP-based automation layer connecting GPT-4 and Claude to internal tools.

Projects 02

[01]

CortexShipped · MIT

Multi-pass RAG agent with per-claim verification

A six-stage async research engine — plan → gather → detect gaps → synthesize → verify → remember — exposed via REST + SSE, a CLI, and an MCP server with four tools. About 6.7K LOC, ~30–90s end-to-end with live stage streaming.

The load-bearing piece is the per-claim verifier: it re-reads every cited source and returns confirmed / weakened / unsupported verdicts, catching synthesizer over-claims that no amount of extra retrieval can fix. Cost-routed by task shape — bounded JSON to Haiku, open-ended semantic work to Sonnet — for a ~3–4× cost reduction.

Python 3.12 FastAPI Next.js Qdrant + BGE Anthropic SDK FastMCP
[02]

Tiny MindIn Progress

Teacher-expert distillation runtime with trace-based routing · at Tiny Minds

A local-first multi-service runtime with confidence-thresholded escalation: request → router → local expert → teacher fallback. Router, expert, memory, and teacher run as separate FastAPI processes with SQLite-backed memory.

Three local experts (style, planning, retrieval) with per-expert confidence scoring and an isolation evaluation harness — all three at 20/20 isolation post-v0.5 hot fixes; end-to-end benchmark at 96.25% across 80 prompts. LoRA fine-tuning pipeline on SmolLM2-135M runs end-to-end for the teacher-trace-to-expert distillation loop; the thesis — small experts absorbing teacher work without quality collapse — is pending empirical validation.

Python FastAPI Next.js SmolLM2-135M LoRA / PEFT SQLite

Contact Open

Reach me at gotsaaleks@gmail.com. I read everything and reply to most.