AI ECOSYSTEM
FIELD GUIDE

Ranjit Kohli / YYZdata
May 2026 Edition
Protocols · Models · Data · Compute
// FULL AI APPLICATION STACK — READ BOTTOM TO TOP
YOUR APP / USER
Web UI / Mobile / API
Business Logic
Output: insights, alerts, actions

ORCHESTRATION
Agent / Orchestrator
LangChain / LangGraph
Task Router / Planner

PROTOCOLS
MCP (agent ↔ tools)
+
A2A (agent ↔ agent)
+
AG-UI (agent ↔ frontend)

MODELS
LLM (reasoning/text)
+
Vision Model (images/video)
+
Embedding Model (search)
+
Classifier (labels)

DATA LAYER
Data Lake (raw, all types)
Data Warehouse (structured, analytics)
Vector DB (semantic search)

COMPUTE
Cloud GPU (train / big inference)
Edge GPU / Jetson (local server)
Qualcomm SOC / NPU (device/robot)
§ 02 — Model Types: What Each One Does
LLM — Large Language Model
Text in, text out. Reasoning, generation, summarization.

Trained on massive text. Uses a transformer architecture to predict the next token. Good at writing, reasoning, code, Q&A, summarization, and chat. An LLM doesn't "know" things — it predicts what text should come next based on patterns learned during training.

Think of it as a very well-read generalist that can talk about almost anything but doesn't automatically know what's happening right now.

ClaudeGPT-4oLlama 3Gemini
Vision Model
Images / video in, labels or text out.

Trained on image data. Can detect objects, classify scenes, count things, read text in images (OCR), track movement, and identify anomalies. Completely different architecture from LLMs — uses convolutional nets or vision transformers (ViT) that process pixel patches.

Great for: retail shrinkage detection, factory defect inspection, store foot traffic, security camera analysis.

YOLOSAMGPT-4VGemini Vision
Embedding Model
Text in, numbers out. Powers semantic search.

Converts text into a dense vector — a list of numbers that encodes meaning. Similar-meaning text produces similar vectors. This is how AI can search for "customer complaints about billing" and find relevant records even if those words aren't in the document.

The backbone of RAG (Retrieval-Augmented Generation) systems. You need this anytime you want AI to search your own data intelligently.

text-embedding-3BGECLIP
Traditional ML Model
Structured data in, predictions out.

The original AI. Decision trees, random forests, gradient boosting (XGBoost), logistic regression. Trained on labeled tabular data. Not neural networks. Excellent for fraud detection, credit risk scoring, churn prediction, spend classification, and anything with clean structured data.

Customer spend analysis and default risk — this is the right tool. Faster, cheaper, and more explainable than LLMs for those tasks.

XGBoostscikit-learnLightGBM
Multimodal Model
Text + image + audio in, mixed output.

An LLM that also has vision (and sometimes audio). It can look at a photo and describe it, answer questions about a chart, read a document image, or analyze surveillance footage. Modern frontier models (GPT-4o, Gemini, Claude) are all multimodal.

When you need one model to handle both text reasoning AND visual analysis, this is it.

GPT-4oGemini 2.0Claude 4LLaVA
Small Language Model (SLM)
Lightweight LLM — runs on device or edge.

A compressed or purpose-built LLM that can run on a laptop, phone, or edge chip without a cloud connection. Much lower compute cost. Less general capability, but if fine-tuned on your specific domain, can be surprisingly powerful for narrow tasks.

Relevant for you: running models on a Qualcomm SOC for robotics or embedded devices. Phi-3, Llama 3.2 1B, Gemma 2B all qualify.

Phi-3GemmaLlama 3.2 1B
§ 03 — The Data Layer: Where Your Information Lives
Database
Live transactions, real-time ops

Structured rows and columns. Your system of record — orders, customers, inventory, payments. Optimized for read/write speed, not analytics. PostgreSQL, MySQL, Oracle.

OLTPReal-time
Data Warehouse
Historical analytics, structured queries

Cleaned, structured data organized for business questions. Optimized for complex queries across time. "What was average customer spend by region over Q1?" This is where your customer spend analysis and default risk models draw from.

SnowflakeBigQueryRedshift
Data Lake
Raw everything — structured + unstructured

Store everything first, figure out schema later. Text files, logs, audio, video, JSON, CSV — all raw. Low cost. Used for AI training data, ML pipelines, and storing data you don't know how to use yet. Sentiment analysis of customer conversations starts here.

S3Azure ADLSDatabricks
Vector Database
Semantic similarity search

Stores embeddings (number vectors). Answers questions like "find me the 20 customer reviews most similar to this complaint." Enables RAG — giving an LLM access to your private data without retraining it. Critical for AI search applications.

PineconeWeaviatepgvector
§ 04 — Protocols & Frameworks: How Things Talk to Each Other
MCP — Model Context Protocol
Anthropic (Nov 2024) · Now universal standard

The "USB-C for AI." Before MCP, connecting an AI model to an external tool (a database, a CRM, a file system) required custom code for every single pair. MCP standardizes that connection — you write one MCP server for your tool, and any MCP-compatible AI can use it.

Agent is the client. Tool (database, API, calendar, Slack) is the server. MCP is the wire between them.

Think: Your AI agent asks "What's this customer's order history?" → MCP routes that to your CRM → CRM returns the data → agent reasons with it.
Adopted by: OpenAI, Google, AWS, IBM, Microsoft
A2A — Agent to Agent (Google)
Google · April 2025 · Open standard

MCP handles agent ↔ tool. A2A handles agent ↔ agent. When you have a system with a planner agent, a data retrieval agent, and a reporting agent, A2A is how they hand off tasks to each other — even if they're built on different frameworks or running in different places.

Before A2A: agents built on LangChain couldn't talk to agents built on Google ADK. A2A fixes this. It's the inter-agent routing layer.

Your question on "AAA" — this is A2A. That was the term you were reaching for.
JSON-RPC over HTTPSDiscovery via well-known URLs
LangChain / LangGraph
Orchestration framework · Python/JS

LangChain is the most popular framework for building AI applications. It gives you building blocks: connect an LLM, add memory, plug in tools, chain operations together. LangGraph is the extension for multi-step, stateful agent workflows — where the agent needs to loop, branch, or pass state between steps.

If MCP is the protocol, LangChain is the plumbing. Your application logic and agent behavior live here.

Open sourceChainsAgentsMemoryRAG
RAG — Retrieval-Augmented Generation
Pattern, not a product

RAG is a technique, not a tool. When a user asks a question, instead of relying only on what the LLM learned during training, RAG first searches your data (via a vector database), retrieves the relevant pieces, and feeds them into the LLM as context. The LLM then answers using your actual data.

This is how you give an LLM access to your company's private documents, customer records, or real-time data without retraining the entire model.

Flow: Question → Embed the question → Search vector DB → Retrieve top matches → LLM reads matches + answers.
§ 05 — Your Three Use Cases, Architected
USE CASE A: Sentiment Analysis + Ongoing Insight Engine
01
Ingest Raw Data

Pull customer conversations, reviews, support tickets, call transcripts from all sources.

Data Lake (S3/Azure) — raw unstructured text
02
Clean + Embed

Normalize text, chunk it, convert each chunk to a vector using an embedding model.

Embedding Model + Vector DB (Pinecone/Weaviate)
03
Classify Sentiment

Run each chunk through a classifier or fine-tuned LLM: positive/negative/neutral + topic.

Fine-tuned LLM or Classifier Model
04
Agent Monitors + Acts

Orchestrator agent watches for trend shifts, surfaces insights, generates recommended actions.

LangGraph Agent + MCP → CRM / Slack / Dashboard
05
Loop Updates

Scheduled pipeline re-runs on new data. Insight dashboard updates automatically.

Airflow / cron + streaming pipeline
USE CASE B: Customer Spend + Risk Analysis (Structured Data)
01
Source Data

Customer transaction history, account data, payment records — structured rows.

Database → ETL → Data Warehouse (Snowflake)
02
Feature Engineering

Build features: avg spend, spend variance, days since last payment, product mix, etc.

SQL + dbt + Python (pandas)
03
Train Risk Model

Train an ML model (XGBoost) on labeled data: defaulted / not defaulted. Outputs probability score.

XGBoost / scikit-learn — traditional ML
04
Query with LLM

LLM + RAG over warehouse. "What's avg spend for customers with risk score > 0.7?" → SQL → answer.

LLM + Text-to-SQL + Data Warehouse
05
Dashboard

Scores, segments, risk buckets, spend tiers surfaced in BI tool or agent interface.

Tableau / Superset / custom React UI
USE CASE C: Storefront Intelligence — Vision + Behavior Analysis
01
Cameras + Sensors

Video feeds from store cameras. Edge device captures and preprocesses frames locally.

IP Cameras + Edge device (Jetson / Qualcomm RB5)
02
Object Detection

Vision model runs on-device: detects people, tracks movement paths, identifies products being handled.

YOLO / SAM running on-edge NPU
03
Anomaly Detection

Separate model watches for anomalies — items not scanned, concealment behavior, dwell patterns near exits.

Classifier / anomaly model — trained on labeled shrinkage events
04
Events → Cloud

Only flagged events (not raw video) sent to cloud for review. Saves bandwidth. Privacy-preserving.

Edge → cloud pipeline (MQTT / HTTP)
05
Insights + Alerts

LLM agent synthesizes patterns: "Shrinkage events cluster near aisle 4, 5–7pm Fridays." Triggers alert or restock recommendation.

LLM Agent + MCP → Slack / POS system
§ 06 — Compute: Where Models Actually Run
PC / Laptop NPU
Qualcomm Snapdragon X / Apple M-series

5–80 TOPS. Runs small models (SLMs) locally. Private, offline, no cloud cost. For prototyping, edge demos, or privacy-sensitive use cases. Limited to smaller models.

Edge Server / Robot
NVIDIA Jetson AGX Orin / Qualcomm RB5

75–275 TOPS. Runs full vision models and medium LLMs locally. Powers robots, smart cameras, industrial automation. Real-time with no cloud round-trip. 10–30W power draw.

On-Prem / Private GPU
NVIDIA A100 / H100 on-prem cluster

High-end inference or fine-tuning you want to keep private. More capital cost upfront, lower per-token cost over time. Relevant for regulated industries or IP-sensitive workloads.

Cloud Data Center
AWS / Azure / GCP — H100 / A100 clusters

Effectively unlimited scale. Where frontier model training happens (millions of $ runs). Pay per token / per hour. Where most production AI inference lives today. 300–700W per GPU.

§ 07 — Agents: What They Actually Are
What Is an AI Agent?

An agent is an LLM that can take actions — not just answer questions. It has access to tools (search, database, API, code runner), can decide which tool to call, observe the result, and then decide what to do next. It loops until the task is complete.

Single-agent: one LLM doing everything. Multi-agent: specialized agents (planner, researcher, writer, validator) handing tasks to each other via A2A. More reliable for complex workflows.

Simple mental model: A chatbot responds. An agent acts. An agent with tools is like giving an intern access to your computer, your databases, and a set of approved actions.
Agent Memory & State

Context window — what the agent can "see" right now. Limited. Gets expensive fast.

Short-term memory — conversation history stored in the prompt. Fades when the session ends.

Long-term memory — stored in a vector DB or database. Persists. The agent can retrieve relevant past interactions using semantic search.

Tools / skills — what the agent can do: search web, query DB, write file, call API, send Slack message, run code. Each tool is exposed via MCP.

ReAct patternPlan + ExecuteReflection loops
§ 08 — Quick Reference Glossary
Transformer
The architecture behind most modern AI. Learns relationships between words/tokens at scale. The "T" in GPT.
Token
The unit LLMs process — roughly ¾ of a word. "Hello world" ≈ 2 tokens. Models charge per token and have context limits measured in tokens.
Context Window
How much text an LLM can "read" at once. GPT-4o: 128k tokens. Claude: 200k. Bigger window = can read more docs at once, but costs more.
Vector / Embedding
A list of numbers (e.g. 1536 floats) that encodes the meaning of text. Similar meaning → similar numbers. Foundation of semantic search.
Fine-tuning
Further training a pre-trained model on your specific data. Like giving a generalist intern domain-specific training. Results in a specialized model.
RAG
Retrieval-Augmented Generation. Search your data first → feed results to LLM → LLM answers using your actual data. No retraining needed.
Inference
Running a trained model to get predictions. Training is expensive (days/weeks/millions). Inference is cheaper (milliseconds per query).
TOPS
Tera Operations Per Second. How fast a chip can run AI math. More TOPS = can run bigger models or process faster. Qualcomm Snapdragon X Elite: 80 TOPS.
NPU
Neural Processing Unit. A chip specialized for AI math (matrix multiplication). Found in Qualcomm SOCs, Apple Silicon. Far more efficient than CPU for AI workloads.
SOC
System on a Chip. CPU + GPU + NPU + memory on one chip. Qualcomm Snapdragon is an SOC. Power-efficient. Used in phones, robots, edge devices.
MoE — Mixture of Experts
Model architecture where only a subset of the network activates per query. Allows very large models to run efficiently. Used in Mixtral, Gemini, GPT-4.
Hallucination
When a model generates confident but false information. Not lying — it doesn't know what it doesn't know. RAG and grounding reduce this.
Structured Data
Rows and columns. SQL databases, spreadsheets, CSVs. Has a schema. Easy for ML models to process directly.
Unstructured Data
Text, images, audio, video, PDFs. No schema. Requires AI/ML to extract structure. Most of the world's data is unstructured.
Data Lakehouse
Hybrid of lake + warehouse. Store everything (lake) but add structure and governance for analytics (warehouse). Databricks, Snowflake, BigQuery all do this.
Orchestration
Coordinating multiple AI components — models, agents, tools, pipelines — so they work together reliably. LangChain, LangGraph, and Apache Airflow are orchestration tools.
Model Onboarding
The process of adapting a model to run efficiently on specific hardware — optimizing formats, quantizing weights, benchmarking performance. YYZdata's core service.
Quantization
Reducing model precision (e.g., 32-bit → 4-bit) to shrink size and speed up inference. Required to run larger models on edge hardware.