Hello 👋, I’m Lena Shakurova, Conversational AI Advisor & CEO and Founder of parslabs.org and chatbotly.co

Linkedin: https://www.linkedin.com/in/lena-shakurova/

🌎 Amsterdam

📌 If you want another tool to be added send me an email to [email protected]

<aside> 🕑

Last updated on 16.06.2025

I will try to keep this resource updated as I get to know about new eval tools. New tools will be marked “WIP” until I get time to test them

</aside>

<aside> 👌

If you need extra help with LLM evals, I offer audits, consultations, and full setup support to help your team build a proper LLM evaluation framework, so you can release to production with more confidence. Just reply to this email and I’ll send you more details.

Send me a DM on LinkedIn or book a free intro call and I’ll send you more details our LLM eval setup service :)

</aside>

**How to evaluate LLM-based apps:

Build with evidence, monitor what matters**

To know if you're improving, you must measure.

During development, evaluation shows whether prompt changes or model tweaks help or harm. Guessing slows you down.

But LLM evals are hard. Change one word in a prompt, does it help? Add a new instruction, do past use cases still pass?

LLMs are non-deterministic: the same input might produce five different outputs. How do you decide what's “good” or “bad”? Do you rely on human judgment, unit tests, or automatic scoring? And how do you catch silent regressions when nothing breaks, but quality slips?

In production, monitoring becomes critical. You need alerts when something fails, like the bot refusing basic tasks or drifting off-topic. Test sets help prevent this. Cover edge cases, simulate unexpected inputs: incomplete data, foreign languages, or hostile users.

This page lists tools to evaluate, test, and monitor LLMs, through every stage of development and deployment.

This document includes:

No-code tools for LLM evaluation (including open source)
Python libraries
Moderation checks
Voice evaluation tools

No-code tools for LLM evaluation

<aside> 💡

Check “Gallery” view for screenshots and “Open Source” to see all open source tools.

</aside>

No-code tools for LLM evaluation

<aside> 👌

If you need extra help with LLM evals, I offer:

Audits of your current test setup
Consultations on the best evaluation approach for your use case
Full setup support to build a proper LLM evaluation framework for your project

All focused on helping your team release to production with more confidence.

📧 Send me a DM on LinkedIn or book a free intro call and I’ll send you more details our LLM eval setup service :)

</aside>

Python libraries for LLM evaluation

Python libraries

Moderation checks

Voice evaluation tools

You team needs help setting up LLM evaluation workflow?

<aside> 👌

If you need extra help with LLM evals, I offer:

Audits of your current test setup
Consultations on the best evaluation approach for your use case
Full setup support to build a proper LLM evaluation framework for your project

All focused on helping your team release to production with more confidence.

📧 Send me a DM on LinkedIn or book a free intro call and I’ll send you more details our LLM eval setup service :)

</aside>