I also pay my bills as an AI Engineer specialized in NLP and Large Language Models.
Lately, I've been diving into evaluation methods for generative ai and agentic systems —making sure we’re building smart stuff with confidence. At my current job, I sit somewhere between engineering, research, and consulting.
<aside> <img src="/icons/info-alternate_gray.svg" alt="/icons/info-alternate_gray.svg" width="40px" />
Getting Started
Head to Instructions for setting up guide!
</aside>

OPINION
Sounds so 2025 but this could be a headline from Byte Magazine in 1975, when C - introduced in 1972 - was gaining popularity.
Imagine Assembly programmers worried their job would become obsolete.
Instead of manually writing Assembly code it became possible to write code in a new language called C. The compiler would then generate the code for you. Imagine that.

J O U R N A L
The most common question I hear when teams set up their evaluation framework: "How many samples do we need?"
This question also comes in different forms , like
"How much time do we need from our domain experts to annotate the dataset?"
In the context of big corporate business, domain experts are usually the most expensive resource in the company. Being able to get one hour of their time is serious money.
So we need to prove how much exactly we need to borrow from their time to annotate our brand new AI project.