TL;DR: In template-strict retail banking workflows, bigger generalist models aren’t automatically “safer” or more accurate in practice. A compact 4B specialist trained with SFT delivered much stronger format + lexical adherence than GPT-5 (2-shot) (e.g., BLEU 0.2685 vs 0.0137) because it encodes the workflow directly in the weights, reducing verbosity and format drift. We trained and evaluated the specialist end-to-end using AlignTune’s unified SFT + standardized evaluation pipeline, making results reproducible and deployment-ready.
In the push toward enterprise AI, it’s common to assume that larger generalist models (for example, GPT-5) are always the safest choice for accuracy. Our results on retail banking support, a high-volume and highly procedural domain, suggest a more nuanced reality: when outputs must follow strict templates, specialist alignment can outperform generalist prompting.
Key finding: On a standardized retail banking support task dataset, a compact 4B model aligned using SFT achieved higher format and lexical adherence than GPT-5 (2-shot) under our evaluation protocol.
Implementation: Training and evaluation for this study were executed using AlignTune (Lexsi.ai), leveraging its unified SFT interface and standardized evaluation workflow for consistent metric reporting and reproducibility.
1. The problem: the “chattiness” penalty in banking
Retail banking assistants are not judged primarily on creativity or open-ended helpfulness. They’re judged on procedural precision and template compliance across a narrow set of intents (activation, limits, verification, blocking, etc.).
In these workflows, general-purpose models often introduce two failure modes:
- Verbosity: responses expand beyond what the workflow expects
- Format drift: answers deviate from required structure, terminology, or ordering
Even when the response is “helpful,” verbosity and drift make integration brittle, especially in automated support flows where downstream systems depend on predictable formatting.
2. Introducing AlignTune
AlignTune is Lexsi.ai’s modular toolkit for post-training alignment of language models. It provides a unified interface for SFT and RLHF-style methods, designed to make specialization reproducible and operational across production constraints.
At a high level, AlignTune delivers:
- A unified training interface across common alignment methods
- Standardized configuration and evaluation surfaces
- Reproducible training workflows with consistent reporting
- Compatibility across widely used backends (TRL, Unsloth) with workflow consistency
How AlignTune is used in this case study
- We apply SFT only to adapt a base 4B model to retail-banking response templates and procedural style.
- We do not apply DPO in this study because the objective is largely strict format/protocol adherence, not preference nuance.
- Training and evaluation run through AlignTune’s unified trainer + evaluation pipeline, reducing tooling-driven variance and keeping comparisons consistent across baselines.
3. Training pipeline and evaluation
We leveraged the Bitext Retail Banking dataset to convert a raw 4B base model into a procedural specialist using a single high-impact stage.
Supervised Fine-Tuning (SFT)
Goal: Encode procedural adherence directly into the weights.
Setup: We created class-balanced splits to ensure the model learns all intents consistently, from routine balance inquiries to high-priority security protocols.
Why this matters: By internalizing workflow structure in the model weights, we reduce or eliminate dependence on heavy in-context prompting at inference time.
Metrics: what we measure and why
We evaluate with two complementary metric families:
1) Lexical overlap (template/terminology adherence)
- BLEU, ROUGE, ChrF compare outputs against expert references using word/phrase overlap.
- These are especially relevant for banking where exact terminology, ordering, and template fidelity matter.
- ChrF is more tolerant to minor wording changes (character n-grams), while still rewarding structure compliance.
2) Semantic similarity (meaning preservation)
- BERTScore measures whether the response preserves the intended meaning even when phrasing differs (e.g., “account frozen” vs “access blocked”).
Protocol note: GPT-5 is evaluated with 2-shot prompting; the SFT specialist is evaluated 0-shot because the task format is encoded in the weights. In template-heavy settings, few-shot exemplars can add noise and distribution shift rather than help.

4. Experimental validation
We benchmarked the 4B SFT specialist against a strong generalist baseline: GPT-5 (2-shot).
4.1 The generalist mismatch: format-sensitive tasks punish verbosity
The most striking outcome was how difficult it was for a generalist model to reliably conform to rigid banking templates under this evaluation setup.
Despite 2-shot prompting, GPT-5 shows low lexical overlap with reference templates (BLEU 0.0137). This pattern is consistent with verbosity and format variance which are penalized in template-faithful evaluation.
4.2 SFT materially improves procedural adherence
The SFT specialist achieves strong lexical and semantic alignment with references (BLEU 0.2685, ROUGE-L 0.4128, BERTScore ~0.914), indicating reliable adherence to expected response structure and domain terminology.
5. The specialist advantage: reliability without “prompt engineering”
A common pain point in deploying generalist LLMs is the fragility of prompt engineering. Teams frequently spend weeks refining few-shot examples to keep outputs stable, only to see drift when intents expand or templates change.
This benchmark shows the operational value of specialization:
- Intrinsic protocol knowledge: The 4B model internalizes banking workflows in its weights, reducing dependence on long prompts and context windows.
- Deterministic formatting: Outputs remain stable across variations in user phrasing, which is critical for transactional flows.
- Operational simplicity: Lower input tokens translate to reduced inference cost and latency, improving end-user experience and system throughput.
In short, we move from “prompt babysitting” to “procedural reliability.”
6. Qualitative examples (behavioral patterns)
6.1 Base model: false-positive safety refusal and intent confusion
The base model can fail in two ways:
- It triggers a false positive refusal when it sees security-adjacent terms like “password,” misclassifying a support request as a threat.
- It can also misunderstand domain-specific intents (e.g., treating “open an account” as a generic account type rather than a bank onboarding flow).
The practical result: it either refuses to help, or responds in a way that’s misaligned with the banking workflow.
6.2 GPT-5: helpful but ambiguous, with template drift
GPT-5 is often broadly helpful, but it can struggle with domain ambiguity. For example, it may interpret “password” as a general security query rather than a platform-specific onboarding flow.
In addition, its generalist tendency toward explanation and context can create format drift. Even when the content is correct, the response often deviates from the strict template, which is costly for automation.


6.3 SFT specialist: intent certainty and procedural execution
The SFT model performs well because it learns the mapping between user phrasing and the bank’s internal intent taxonomy.
For example, it learns that “set up a password” corresponds to a specific onboarding flow (e.g., “set_up_password”). It then produces the expected output reliably: correct URL, correct sequence, and the prescribed procedural style, closely matching the reference response.
7. Conclusion
Retail banking workflows reward protocol adherence and template stability more than open-ended generality. In this benchmark, specialization via SFT yields a compact 4B model that is:
- More consistent on template-sensitive outputs than a prompted general-purpose baseline
- Operationally simpler (less prompt scaffolding, lower inference overhead)
- Better aligned to production requirements for transactional support systems
For enterprises building AI systems that must behave like reliable software, post-training specialization is often the shortest path to stability. AlignTune is built to make that path repeatable.





