[{"data":1,"prerenderedAt":571},["ShallowReactive",2],{"article-alternates":3,"article-\u002Fen\u002Fai\u002Fprompt-versioning-and-ab-testing-llm-operations-discipline":13},{"i18nKey":4,"paths":5},"ai-004-2026-06",{"de":6,"en":7,"es":8,"fr":9,"it":10,"ru":11,"tr":12},"\u002Fde\u002Fai\u002Fprompt-versionierung-und-a-b-tests","\u002Fen\u002Fai\u002Fprompt-versioning-and-ab-testing-llm-operations-discipline","\u002Fes\u002Fai\u002Fversionado-prompts-pruebas-a-b-llm","\u002Ffr\u002Fai\u002Fversionamento-prompt-e-test-ab-disciplina-operativa-llm","\u002Fit\u002Fai\u002Fversionamento-prompt-e-test-ab-discipline-operativa-llm","\u002Fru\u002Fai\u002Fversiyonlama-t","\u002Ftr\u002Fai\u002Fprompt-versiyonlama-ve-a-b-testi-llm-operasyonun-disiplini",{"_path":7,"_dir":14,"_draft":15,"_partial":15,"_locale":16,"title":17,"description":18,"publishedAt":19,"modifiedAt":19,"category":14,"i18nKey":4,"tags":20,"readingTime":26,"author":27,"body":28,"_type":565,"_id":566,"_source":567,"_file":568,"_stem":569,"_extension":570},"ai",false,"","Prompt Versioning and A\u002FB Testing: The Discipline of LLM Operations","Building prompt eval pipelines with Promptfoo and LangSmith. Methods for preventing regression in production LLM workflows and measuring cost-quality tradeoffs.","2026-06-04",[21,22,23,24,25],"llm-operations","prompt-engineering","evaluation","mlops","ai-testing",8,"Roibase",{"type":29,"children":30,"toc":557},"root",[31,39,46,51,56,72,78,116,121,332,361,374,379,385,390,395,400,428,434,439,451,456,461,467,472,485,498,512,518,523,528,542,546,551],{"type":32,"tag":33,"props":34,"children":35},"element","p",{},[36],{"type":37,"value":38},"text","Every team running LLMs in production lives the same cycle: you iterate on a prompt, output improves, then performance tanks in another use case. You revert the change, the first scenario breaks. Versionless prompt iteration is an infinite regression loop. Pulling responses from the Claude API and saying \"looks good\" isn't product operations — it isn't software engineering. In 2026, any team not testing prompts like code loses confidence with every deploy. Promptfoo, LangSmith, and evaluation frameworks bring this discipline: seeing prompt changes' impact quantified, A\u002FB testing them, being able to roll back.",{"type":32,"tag":40,"props":41,"children":43},"h2",{"id":42},"why-prompt-versioning-became-non-negotiable",[44],{"type":37,"value":45},"Why Prompt Versioning Became Non-Negotiable",{"type":32,"tag":33,"props":47,"children":48},{},[49],{"type":37,"value":50},"LLM output isn't deterministic. The same prompt produces different responses at different times (as long as temperature > 0). This randomness makes the observation \"it works today\" unreliable. One step further: if you don't know what happens to old test cases when you change a prompt, you can't tell whether you've improved or just traded off. Example: you add \"show more data\" to your blog-writing workflow prompt, output gets richer but stretches to 400 tokens. Token cost rises 30%, latency hits 1.2 seconds. If you don't catch this before deployment, you find out in production and rollback takes two weeks.",{"type":32,"tag":33,"props":52,"children":53},{},[54],{"type":37,"value":55},"Versioning discipline answers these questions: which metric did this prompt change improve, which did it harm? How much accuracy difference versus the old version? If we ship this change, what's the monthly cost increase? If you can't answer, you're guessing, not iterating. Promptfoo and LangSmith turn those questions into metric tables. Every prompt is a commit, every test run a report. When regression appears, you know which line you changed — like git diff.",{"type":32,"tag":33,"props":57,"children":58},{},[59,61,70],{"type":37,"value":60},"At Roibase, we commit prompt versions to Git in n8n + Claude API workflows. Every change is a PR, every PR runs the eval suite. Promptfoo fails regression checks, no merge. Without this discipline, our ",{"type":32,"tag":62,"props":63,"children":67},"a",{"href":64,"rel":65},"https:\u002F\u002Fwww.roibase.com.tr\u002Fen\u002Fgeo",[66],"nofollow",[68],{"type":37,"value":69},"Generative Engine Optimization",{"type":37,"value":71}," work can't keep citation accuracy stable — every prompt tweak can drop brand mentions, and if we miss it, recovery is three weeks.",{"type":32,"tag":40,"props":73,"children":75},{"id":74},"building-an-eval-pipeline-with-promptfoo",[76],{"type":37,"value":77},"Building an Eval Pipeline with Promptfoo",{"type":32,"tag":33,"props":79,"children":80},{},[81,83,90,92,98,100,106,108,114],{"type":37,"value":82},"Promptfoo is an open-source test framework: you define prompts in YAML, store test cases in CSV\u002FJSON, run it and get a metric table. Model agnostic — OpenAI, Anthropic, local LLaMA, all accessed through the same interface. Setup is simple: ",{"type":32,"tag":84,"props":85,"children":87},"code",{"className":86},[],[88],{"type":37,"value":89},"npm install -g promptfoo",{"type":37,"value":91},", then ",{"type":32,"tag":84,"props":93,"children":95},{"className":94},[],[96],{"type":37,"value":97},"promptfoo init",{"type":37,"value":99},". Creates two files: ",{"type":32,"tag":84,"props":101,"children":103},{"className":102},[],[104],{"type":37,"value":105},"promptfooconfig.yaml",{"type":37,"value":107}," (prompt definition + provider config) and ",{"type":32,"tag":84,"props":109,"children":111},{"className":110},[],[112],{"type":37,"value":113},"test-cases.json",{"type":37,"value":115}," (input-output pairs).",{"type":32,"tag":33,"props":117,"children":118},{},[119],{"type":37,"value":120},"Example config:",{"type":32,"tag":122,"props":123,"children":127},"pre",{"className":124,"code":125,"language":126,"meta":16,"style":16},"language-yaml shiki shiki-themes github-dark","prompts:\n  - \"You are a marketing analyst. Answer this question: {{query}}\"\nproviders:\n  - anthropic:messages:claude-3-5-sonnet-20241022\ntests:\n  - vars:\n      query: \"What are Q4 2025 e-commerce conversion trends?\"\n    assert:\n      - type: contains\n        value: \"conversion rate\"\n      - type: cost\n        threshold: 0.05\n","yaml",[128],{"type":32,"tag":84,"props":129,"children":130},{"__ignoreMap":16},[131,149,164,177,190,203,220,239,251,274,292,313],{"type":32,"tag":132,"props":133,"children":136},"span",{"class":134,"line":135},"line",1,[137,143],{"type":32,"tag":132,"props":138,"children":140},{"style":139},"--shiki-default:#85E89D",[141],{"type":37,"value":142},"prompts",{"type":32,"tag":132,"props":144,"children":146},{"style":145},"--shiki-default:#E1E4E8",[147],{"type":37,"value":148},":\n",{"type":32,"tag":132,"props":150,"children":152},{"class":134,"line":151},2,[153,158],{"type":32,"tag":132,"props":154,"children":155},{"style":145},[156],{"type":37,"value":157},"  - ",{"type":32,"tag":132,"props":159,"children":161},{"style":160},"--shiki-default:#9ECBFF",[162],{"type":37,"value":163},"\"You are a marketing analyst. Answer this question: {{query}}\"\n",{"type":32,"tag":132,"props":165,"children":167},{"class":134,"line":166},3,[168,173],{"type":32,"tag":132,"props":169,"children":170},{"style":139},[171],{"type":37,"value":172},"providers",{"type":32,"tag":132,"props":174,"children":175},{"style":145},[176],{"type":37,"value":148},{"type":32,"tag":132,"props":178,"children":180},{"class":134,"line":179},4,[181,185],{"type":32,"tag":132,"props":182,"children":183},{"style":145},[184],{"type":37,"value":157},{"type":32,"tag":132,"props":186,"children":187},{"style":160},[188],{"type":37,"value":189},"anthropic:messages:claude-3-5-sonnet-20241022\n",{"type":32,"tag":132,"props":191,"children":193},{"class":134,"line":192},5,[194,199],{"type":32,"tag":132,"props":195,"children":196},{"style":139},[197],{"type":37,"value":198},"tests",{"type":32,"tag":132,"props":200,"children":201},{"style":145},[202],{"type":37,"value":148},{"type":32,"tag":132,"props":204,"children":206},{"class":134,"line":205},6,[207,211,216],{"type":32,"tag":132,"props":208,"children":209},{"style":145},[210],{"type":37,"value":157},{"type":32,"tag":132,"props":212,"children":213},{"style":139},[214],{"type":37,"value":215},"vars",{"type":32,"tag":132,"props":217,"children":218},{"style":145},[219],{"type":37,"value":148},{"type":32,"tag":132,"props":221,"children":223},{"class":134,"line":222},7,[224,229,234],{"type":32,"tag":132,"props":225,"children":226},{"style":139},[227],{"type":37,"value":228},"      query",{"type":32,"tag":132,"props":230,"children":231},{"style":145},[232],{"type":37,"value":233},": ",{"type":32,"tag":132,"props":235,"children":236},{"style":160},[237],{"type":37,"value":238},"\"What are Q4 2025 e-commerce conversion trends?\"\n",{"type":32,"tag":132,"props":240,"children":241},{"class":134,"line":26},[242,247],{"type":32,"tag":132,"props":243,"children":244},{"style":139},[245],{"type":37,"value":246},"    assert",{"type":32,"tag":132,"props":248,"children":249},{"style":145},[250],{"type":37,"value":148},{"type":32,"tag":132,"props":252,"children":254},{"class":134,"line":253},9,[255,260,265,269],{"type":32,"tag":132,"props":256,"children":257},{"style":145},[258],{"type":37,"value":259},"      - ",{"type":32,"tag":132,"props":261,"children":262},{"style":139},[263],{"type":37,"value":264},"type",{"type":32,"tag":132,"props":266,"children":267},{"style":145},[268],{"type":37,"value":233},{"type":32,"tag":132,"props":270,"children":271},{"style":160},[272],{"type":37,"value":273},"contains\n",{"type":32,"tag":132,"props":275,"children":277},{"class":134,"line":276},10,[278,283,287],{"type":32,"tag":132,"props":279,"children":280},{"style":139},[281],{"type":37,"value":282},"        value",{"type":32,"tag":132,"props":284,"children":285},{"style":145},[286],{"type":37,"value":233},{"type":32,"tag":132,"props":288,"children":289},{"style":160},[290],{"type":37,"value":291},"\"conversion rate\"\n",{"type":32,"tag":132,"props":293,"children":295},{"class":134,"line":294},11,[296,300,304,308],{"type":32,"tag":132,"props":297,"children":298},{"style":145},[299],{"type":37,"value":259},{"type":32,"tag":132,"props":301,"children":302},{"style":139},[303],{"type":37,"value":264},{"type":32,"tag":132,"props":305,"children":306},{"style":145},[307],{"type":37,"value":233},{"type":32,"tag":132,"props":309,"children":310},{"style":160},[311],{"type":37,"value":312},"cost\n",{"type":32,"tag":132,"props":314,"children":316},{"class":134,"line":315},12,[317,322,326],{"type":32,"tag":132,"props":318,"children":319},{"style":139},[320],{"type":37,"value":321},"        threshold",{"type":32,"tag":132,"props":323,"children":324},{"style":145},[325],{"type":37,"value":233},{"type":32,"tag":132,"props":327,"children":329},{"style":328},"--shiki-default:#79B8FF",[330],{"type":37,"value":331},"0.05\n",{"type":32,"tag":33,"props":333,"children":334},{},[335,337,343,345,351,353,359],{"type":37,"value":336},"Run ",{"type":32,"tag":84,"props":338,"children":340},{"className":339},[],[341],{"type":37,"value":342},"promptfoo eval",{"type":37,"value":344}," and it sends requests to Claude API, runs outputs against assertions. ",{"type":32,"tag":84,"props":346,"children":348},{"className":347},[],[349],{"type":37,"value":350},"contains",{"type":37,"value":352}," assertion is simple — checks if the specified term appears in output. ",{"type":32,"tag":84,"props":354,"children":356},{"className":355},[],[357],{"type":37,"value":358},"cost",{"type":37,"value":360}," assertion monitors token usage — fails if threshold is exceeded. These two assertions alone answer: \"Does the prompt change produce the right terminology, and is there cost bloat?\"",{"type":32,"tag":33,"props":362,"children":363},{},[364,366,372],{"type":37,"value":365},"More powerful: ",{"type":32,"tag":84,"props":367,"children":369},{"className":368},[],[370],{"type":37,"value":371},"llm-rubric",{"type":37,"value":373},". You route output to another LLM (e.g., GPT-4o) for scoring. Example: \"Does this text portray the brand positively?\" — GPT-4o scores on a 1-5 scale. Compare average scores across all test cases before and after a prompt change — if regression exists, you see it quantified.",{"type":32,"tag":33,"props":375,"children":376},{},[377],{"type":37,"value":378},"At Roibase, our blog-writing pipeline has 30+ test cases — each a different keyword + category combination. Promptfoo runs nightly in CI\u002FCD, collecting metrics: average readingTime, internal link count, headline length. If a new prompt version drops readingTime below 7 (target is 7-8), it fails. We see it before merge.",{"type":32,"tag":40,"props":380,"children":382},{"id":381},"production-observability-with-langsmith",[383],{"type":37,"value":384},"Production Observability with LangSmith",{"type":32,"tag":33,"props":386,"children":387},{},[388],{"type":37,"value":389},"Promptfoo is perfect for local testing but doesn't see what happens in production. LangSmith (LangChain team's product) fills that gap: logs every LLM call, traces latency\u002Ftokens\u002Fcost, captures errors. Python\u002FJS SDKs available, also callable from n8n HTTP nodes. Traces appear in the web UI — which prompt produced which output, how many tokens, how many seconds, all on one screen.",{"type":32,"tag":33,"props":391,"children":392},{},[393],{"type":37,"value":394},"LangSmith's critical feature: convert production traces into datasets and eval against them. Example: you generated 500 blog posts over a week, 10% needed manual edits due to \"insufficient internal links.\" Filter those 50 traces in LangSmith, save as \"regression test dataset.\" Now when you change prompts, test against this dataset — see if you're recreating past failures.",{"type":32,"tag":33,"props":396,"children":397},{},[398],{"type":37,"value":399},"Another feature: human feedback annotation. In LangSmith UI, you thumbs up\u002Fdown each trace. Over time, high-feedback-score traces become your \"golden dataset.\" Test new prompt versions against it — if golden set performance drops, don't deploy. It's manual but scalable. At Roibase, our editorial team reviews 20-30 outputs per week in LangSmith, annotates them. This data becomes the eval pipeline's ground truth.",{"type":32,"tag":33,"props":401,"children":402},{},[403,405,411,413,419,420,426],{"type":37,"value":404},"Token cost tracking is also embedded. Each trace shows ",{"type":32,"tag":84,"props":406,"children":408},{"className":407},[],[409],{"type":37,"value":410},"total_tokens",{"type":37,"value":412},", ",{"type":32,"tag":84,"props":414,"children":416},{"className":415},[],[417],{"type":37,"value":418},"prompt_tokens",{"type":37,"value":412},{"type":32,"tag":84,"props":421,"children":423},{"className":422},[],[424],{"type":37,"value":425},"completion_tokens",{"type":37,"value":427},". Configure model pricing (Anthropic's per-token rate), LangSmith auto-calculates cost. Dashboard shows \"total LLM cost last 30 days\" graph. If that trend breaks after a prompt change, rollback is the reason.",{"type":32,"tag":40,"props":429,"children":431},{"id":430},"measuring-cost-quality-tradeoffs",[432],{"type":37,"value":433},"Measuring Cost-Quality Tradeoffs",{"type":32,"tag":33,"props":435,"children":436},{},[437],{"type":37,"value":438},"Production LLM operations' most critical balance: should you use a more capable (more expensive) model, or longer prompts for better output? Claude Opus 3.5 or Sonnet 3.5? Temperature 0.7 or 0.3? Every decision is a tradeoff. Deciding without measurement is gambling. An eval pipeline quantifies it.",{"type":32,"tag":33,"props":440,"children":441},{},[442,444,449],{"type":37,"value":443},"Example scenario: your blog pipeline uses Claude 3.5 Sonnet, averaging 1500 output tokens, $0.015\u002Frequest. Would switching to Opus improve quality? A\u002FB test in Promptfoo: send the same 50 test cases to both models, run outputs through GPT-4o with ",{"type":32,"tag":84,"props":445,"children":447},{"className":446},[],[448],{"type":37,"value":371},{"type":37,"value":450}," assertion. Result: Opus average quality score 4.2, Sonnet 3.9. 8% difference. Cost: Opus $0.045\u002Frequest, 3× more expensive. Decision: does 8% quality improvement justify 3× cost increase? If editorial workload drops 20% (less manual editing needed), ROI is positive. If the difference doesn't reach users, stick with Sonnet.",{"type":32,"tag":33,"props":452,"children":453},{},[454],{"type":37,"value":455},"Different tradeoff: prompt length. Add 200 tokens of context to system prompt and output gets more specific, but every request costs 200 more tokens. At 10K requests\u002Fmonth, that's 2M tokens = $6 extra cost (Sonnet input pricing). What's the return on that $6? Check annotation data in LangSmith: thumbs-down rate before was 15%, after is 8%. Is a 7% quality improvement worth $6? The team decides, but data exists — no guessing.",{"type":32,"tag":33,"props":457,"children":458},{},[459],{"type":37,"value":460},"Temperature is another tradeoff. Temperature 0 is deterministic but monotone. Temperature 0.7 is creative but sometimes off-topic. Test 0.0, 0.3, 0.7 versions in Promptfoo with assertion: \"internal link count 1-2?\" and \"readingTime 7-8?\". Temperature 0.7 fails 20% of test cases (links become 0 or 3), 0.3 fails 5%. Decision: stick at 0.3, production stability > creativity.",{"type":32,"tag":40,"props":462,"children":464},{"id":463},"regression-prevention-and-rollback-strategy",[465],{"type":37,"value":466},"Regression Prevention and Rollback Strategy",{"type":32,"tag":33,"props":468,"children":469},{},[470],{"type":37,"value":471},"Without prompt versioning, regression takes two weeks to notice. By then, production has generated 1000 bad outputs. When you notice, you don't know which version to roll back to. The eval pipeline ends this chaos: every commit is tested, fail means no merge. Regression never reaches production.",{"type":32,"tag":33,"props":473,"children":474},{},[475,477,483],{"type":37,"value":476},"At Roibase, our Git workflow: ",{"type":32,"tag":84,"props":478,"children":480},{"className":479},[],[481],{"type":37,"value":482},"main",{"type":37,"value":484}," branch is production prompt. Changes happen on feature branches, PR opened. GitHub Actions CI triggers Promptfoo eval. Eval passes, reviewer approves, merge. Eval fails, PR blocks. This discipline means zero production prompt regressions in six months — all caught at PR stage.",{"type":32,"tag":33,"props":486,"children":487},{},[488,490,496],{"type":37,"value":489},"Rollback mechanism: every production trace in LangSmith is tagged with its prompt version. If post-deploy problems appear (e.g., internal link ratio drops), filter LangSmith's last 100 traces, check which commit hash produced them. Find that commit in Git, ",{"type":32,"tag":84,"props":491,"children":493},{"className":492},[],[494],{"type":37,"value":495},"git revert",{"type":37,"value":497}," it, open a new PR. Revert PR also passes eval — you verify the old version is still valid. Merge, deploy. Rollback is done in 15 minutes.",{"type":32,"tag":33,"props":499,"children":500},{},[501,503,510],{"type":37,"value":502},"Another strategy: canary deployment. Send the new prompt version to 10% of production traffic, keep 90% on the old version. Watch both versions' metrics side-by-side in LangSmith: latency, cost, thumbs up\u002Fdown ratio. After 24 hours, if the new version outperforms at 10%, scale to 50%, then 100%. Poor performance drops it to 0%, rollback. This strategy relies on ",{"type":32,"tag":62,"props":504,"children":507},{"href":505,"rel":506},"https:\u002F\u002Fwww.roibase.com.tr\u002Fen\u002Ffirstparty",[66],[508],{"type":37,"value":509},"First-Party Data & Measurement Architecture",{"type":37,"value":511}," — if production events are readable in real time, canary works; if not, it doesn't.",{"type":32,"tag":40,"props":513,"children":515},{"id":514},"integrating-the-eval-pipeline-into-team-process",[516],{"type":37,"value":517},"Integrating the Eval Pipeline into Team Process",{"type":32,"tag":33,"props":519,"children":520},{},[521],{"type":37,"value":522},"Setting up eval tooling is easy; adoption is hard. Without team adoption, the tool is dead. At Roibase, we built adoption through: (1) At least one prompt iteration PR expected per sprint. (2) PR review checklist includes \"Promptfoo eval passed?\" (3) Weekly LLM ops meeting reviews LangSmith dashboard — which traces got thumbs down, why? (4) Quarterly prompt audit: all production prompts tested against regression dataset, refactored if performance drops.",{"type":32,"tag":33,"props":524,"children":525},{},[526],{"type":37,"value":527},"The team initially resisted, saying \"writing evals is extra work.\" By sprint two they noticed: without eval, each change takes 3 days to test (manually), with eval it's 10 minutes. Manual testing misses edge cases, eval suite doesn't. Adoption grew. Now engineers write test cases first, then iterate the prompt — TDD mindset. This discipline raised prompt quality 40% (by annotation data).",{"type":32,"tag":33,"props":529,"children":530},{},[531,533,540],{"type":37,"value":532},"Another adoption lever: cost reporting. We opened the LangSmith dashboard to our CFO, showed monthly LLM spend. CFO asked, \"how do we optimize this?\" Answer: eval pipeline tests model\u002Ftemperature\u002Fprompt-length tradeoffs, putting the most efficient config in production. Next quarter we cut costs 15% (with zero quality regression). CFO saw data, approved tooling budget. Moved to LangSmith Plus (team plan, unlimited traces). Now all LLM workflows are in LangSmith — not just content generation, also our SQL generation workflow in ",{"type":32,"tag":62,"props":534,"children":537},{"href":535,"rel":536},"https:\u002F\u002Fwww.roibase.com.tr\u002Fen\u002Fverianalizi",[66],[538],{"type":37,"value":539},"Data Analysis & Insights Engineering",{"type":37,"value":541},".",{"type":32,"tag":543,"props":544,"children":545},"hr",{},[],{"type":32,"tag":33,"props":547,"children":548},{},[549],{"type":37,"value":550},"Prompt versioning and eval discipline aren't optional in 2026 — they're foundational to production LLM operations. Use Promptfoo to prevent regression, LangSmith to observe production, eval to measure cost-quality tradeoffs. Every prompt change is a hypothesis, eval results are validation. If you don't have a rollback mechanism, don't deploy. Without team adoption, tooling is dead — embed it in process, decide with data. Now act: take your current LLM workflow, write 10 test cases, set up Promptfoo, run the first eval. When you catch the first regression, you'll see the discipline's value.",{"type":32,"tag":552,"props":553,"children":554},"style",{},[555],{"type":37,"value":556},"html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}",{"title":16,"searchDepth":166,"depth":166,"links":558},[559,560,561,562,563,564],{"id":42,"depth":151,"text":45},{"id":74,"depth":151,"text":77},{"id":381,"depth":151,"text":384},{"id":430,"depth":151,"text":433},{"id":463,"depth":151,"text":466},{"id":514,"depth":151,"text":517},"markdown","content:en:ai:prompt-versioning-and-ab-testing-llm-operations-discipline.md","content","en\u002Fai\u002Fprompt-versioning-and-ab-testing-llm-operations-discipline.md","en\u002Fai\u002Fprompt-versioning-and-ab-testing-llm-operations-discipline","md",1782079495816]