Evaluation Overview — Axiom Cortex™ (No Public API)

What it is: Axiom Cortex™ is a proprietary, internal evaluation service inside the Nearshore IT Co-Pilot™ platform. There is no public API. All evaluations run inside our platform to align and predict talent performance using 44 psychometric + NLP signals on top of LLMs with semantic chunking in RAG and staged/multi-step prompting. We apply language-fairness calibration (judge ideas and reasoning, not accent/phrasing) and perform expert review of flags before roll-up. Results map to BARS (Behaviorally Anchored Rating Scales = ratings tied to observable behaviors).

Inputs (plain English)

Role & level (e.g., Senior Backend L3, domain context)
Tech stack (languages, frameworks, cloud)
Work artifacts (code/PRs/tickets/design notes)
Scenario tasks (staged reasoning + coding prompts)
Communication samples (written/async)
Operational signals (security, reliability, cost awareness)

Method core

Semantic chunking in RAG — split artifacts into meaningful chunks; retrieval guided by role/stack so prompts aren’t one-size-fits-all.
Staged / multi-step prompting — plan → subgoals → verification; require chain integrity before scoring.
Language-fairness calibration — normalize signal distributions across language cohorts so phrasing/accents don’t depress scores.
Expert review of flags before roll-up — only flagged anomalies go to experts; decisions apply before final aggregation.

The 44 Signals (executive view)

A. Alignment (5)
1) Problem–schema match • 2) Stack coverage • 3) Complexity bandwidth • 4) Domain transfer • 5) Requirements fidelity

B. Analysis (7)
6) Chain integrity • 7) Error-mode awareness • 8) Counterfactuals • 9) Causal linkage • 10) Evidence binding • 11) Numeracy checks • 12) Ambiguity handling

C. Synthesis (6)
13) Composability • 14) Architectural tradeoffs • 15) Edge-case coverage • 16) Observability planning • 17) Dependency hygiene • 18) Migration/rollback planning

D. Code Quality (6)
19) Semantic diff quality • 20) Static risks • 21) Runtime realism • 22) Test intent/coverage • 23) Debug vectoring • 24) Complexity control

E. Communication & Collaboration (6)
25) Instruction uptake • 26) Audience targeting • 27) Decision log clarity • 28) Review acumen • 29) Ticket hygiene • 30) Documentation atomics

F. Systems & Ops (5)
31) Reliability thinking (SLOs) • 32) Cost–performance • 33) Release safety • 34) Infra-as-code • 35) Telemetry interpretation

G. Risk & Security (5)
36) Sensitive-data handling • 37) License hygiene • 38) Supply-chain care • 39) Threat-modeling reflex • 40) Secrets/access discipline

H. Robustness & Agreement (4)
41) Adversarial resistance • 42) Cross-model agreement • 43) Self-consistency • 44) Hallucination index

Processing & formulas (transparent math, plain English)

We keep formulas readable here and reserve proofs/derivations for Scientific Foundations.

1) Normalization (z-scores)
Each raw signal (s_i) is normalized within role/level cohorts:
z_i = (s_i − μ_role,level) / σ_role,level (robust μ/σ with outlier guards).
This keeps scores comparable across roles and seniorities.

2) L2-aware weighting
We decompose communication into semantic vs. form carriers with weights (\alpha) and (\beta):
Score_comm = α·SemContent + β·Form, with β → 0 as L2 uncertainty rises (estimated from stable markers).
Meaning: grammar/fluency noise is down-weighted; ambiguous content is still penalized.

3) Cross-lingual semantic fidelity (FSD)
We compare answer embeddings to an Ideal Answer Blueprint distribution via a Fréchet-style distance.
Lower FSD ⇒ closer to the target concept even with Spanish-influenced English.

4) Optimal transport (W2) with code-switch mask
Token alignment uses W2 distance with neutral cost for common bilingual markers (e.g., “pues”, “o sea”).
This prevents code-switch tokens from inflating “distance” when the substantive idea matches.

5) DIF checks & correction
For each rubric item, we test Differential Item Functioning across language cohorts at matched ability.
Items with significant DIF are adjusted or removed; if unresolved, the model fails closed for that item.

6) Aggregation (monotone link)
Signals roll up through constrained aggregators:

Isotonic regression / monotone lattice ensure stronger evidence doesn’t produce weaker scores.
Skill graph priors (network psychometrics) stabilize multi-stack profiles.

7) Uncertainty & calibration
Every composite score includes uncertainty (bootstrap CIs / calibrated posteriors).
We monitor Expected Calibration Error (ECE) and produce reliability diagrams by cohort.

8) BARS mapping (human-readable)
Final scores map to BARS (ratings tied to observable behaviors) so feedback is specific, fair, and actionable.

9) Decision layer (constrained utility)
Recommendations maximize expected utility subject to fairness/reliability gates:
maximize E[U | evidence] s.t. DIF ≤ δ, G ≥ G_min, P[Collab < τ_c] ≤ ε_c
Lagrangian relaxations provide gate-aware choices with justifications.

Outputs you see (inside the platform)

Per-dimension evidence with human-readable rationales
Uncertainty bounds and cohort-level calibration notes
Gate status (fairness/reliability) and any expert-review outcomes
BARS-anchored feedback and promotion signals (L1–L4)

Publications & research

SSRN: Redesigning Human Capacity in Nearshore IT Staff Augmentation (5165433) — DOI 10.2139/ssrn.5165433
SSRN Working Paper: abstract_id=5253470
SSRN Working Paper: abstract_id=5188490
Google Scholar — TeamStation AI: https://scholar.google.com/citations?user=aNol-ycAAAAJ

More: Publications page

Platform context (no external endpoints)

Axiom Cortex™ powers the Nearshore IT Co-Pilot™—one platform to hire, equip, secure, and pay LATAM engineers under one SLA. There is no external API.

Authority anchors: