How to Build an AI Agent That Actually Works

You have access to AI agents. Now what?

The technical side of building an agent is well documented. You can set a goal, write instructions, connect tools, and start chatting. But the harder part happens before you open the editor. It happens when you sit down and answer: What should this agent do, and how will I know it is doing it well?

This article walks through the thinking process behind building an effective AI agent, from identifying a use case to evaluating results.

Start with a problem, not a technology

The most common mistake is starting with "I want to build an AI agent" instead of "I have a problem that an AI agent could solve."

Before you build anything, answer these questions:

What task takes too long or happens inconsistently today? Look for repetitive, knowledge-heavy tasks where the answer exists somewhere but is hard to find or slow to assemble. Shift summaries, quality reports, data lookups, and rework guidance are good examples.
Who does this task today, and what do they struggle with? Talk to the person. Understand where they lose time, where they make mistakes, and what information they wish they had faster.
What would "good" look like if this task were done perfectly every time? If you cannot describe the ideal output, you are not ready to build an agent.

A strong use case has three traits:

Clear inputs. The agent needs data to work with. If the data does not exist in your tables, connectors, or documents yet, start there first.
Defined output. You can describe what a correct response looks like, and you can tell the difference between a good one and a bad one.
Real operational value. Solving this problem saves time, reduces errors, or surfaces information that drives better decisions.

Example: A quality manager spends 30 minutes every morning pulling defect records, calculating yield, and writing a summary email. The data already lives in Tulip Tables. The output format is consistent. This is a strong use case. That is exactly what the Quality Report Agent does.

Define the goal before you write a single word of the prompt

A goal is not a feature description. It is a one-sentence statement that answers: What role does this agent play, and what value does it deliver?

Write the goal as if you are writing a job description for a new team member. Be specific about:

Who the agent serves. An agent for operators sounds different from an agent for engineers.
What the agent does. Use action verbs. "Summarize," "analyze," "generate," "guide," "review."
What the agent does NOT do. Boundaries matter. If the agent should not create records, modify data, or make assumptions, say that up front.

Weak goal: "Help with quality stuff."

Strong goal: "Generate concise, data-driven daily quality summaries for manufacturing teams by analyzing production, defect, inspection, and action data for a specific date."

The difference is that the strong goal tells you exactly what success looks like. You can test against it.

Design the prompt like a set of work instructions

Think of your prompt the same way you think about work instructions on the shop floor. A good work instruction tells the operator:

What to do (task)
What information to use (inputs)
What the result should look like (outputs)
What rules to follow (constraints)
What to do when something goes wrong (edge cases)

Your AI agent prompt follows the same logic:

Task

State the action clearly. "Calculate and summarize the following quality performance indicators" is better than "look at quality data and figure something out."

Inputs

Tell the agent exactly where to find data. Reference specific table names and IDs. Name the fields it should query. Define how to filter by date, shift, or station. Do not assume the agent will figure out which table to use.

Outputs

Describe the format. Should the response be a table? A bulleted list? A paragraph under 150 words? Should it include charts? Specify the structure so the output is consistent every time.

Constraints

Constraints prevent the agent from going off track. Common constraints include:

Use only data from connected sources. Do not make assumptions.
If data is missing, say so explicitly.
Do not exceed a word limit.
Use clear language that shop floor personnel can understand.
Only reference functions or articles from a specific approved list.

Capabilities and reminders

Reinforce the most important behaviors at the end. Think of this section as the "remember this" note you put at the bottom of a work instruction. If the agent tends to drift, this is where you pull it back.

Tip: The more specific and structured your prompt, the more reliable the agent. Vague prompts produce vague results. Treat prompt writing as engineering, not creative writing.

Choose the right tools for the job

AI agents in Tulip use tools to access data and perform actions. Tools are grouped by function:

Data table tools let the agent query, read, update, or count records in Tulip Tables.
App management tools let the agent inspect apps, steps, triggers, and variables.
Station management tools let the agent work with shop floor stations.
Educational tools let the agent search the knowledge base and Library.
Connector functions let the agent interact with external systems like ERPs or databases.
User management tools let the agent look up users, roles, and permissions.

Not every agent needs every tool. Give the agent only the tools it needs to complete its task. More tools does not mean a better agent. It means more opportunities for the agent to take an unexpected path.

Ask yourself:

Does this agent need to read data, or also write data?
Does it need access to apps and triggers, or just tables?
Does it connect to external systems, or only work within Tulip?

Match the tools to the task. Nothing more.

Think about evaluation before you deploy

Building the agent is half the work. Knowing whether it works is the other half.

Evaluation means testing your agent against a set of prompts where you already know what the correct answer should be. This is not optional. Without evaluation, you are guessing.

Build an evaluation table

Create a simple table with these columns:

Prompt: The exact question or request you will send to the agent.
Expected result: What the correct response should look like.
Actual result: What the agent returned.
Pass or fail: Did the actual result match the expected result?
Notes: Context on why it passed or failed.

Start with 5 to 10 test prompts that cover:

The normal case. A straightforward request the agent should handle easily.
Edge cases. What happens when data is missing? When the user asks for a date with no records? When a field is null?
Out-of-scope requests. What does the agent do when asked something it should not answer? Does it refuse clearly, or does it hallucinate?
Ambiguous input. What happens when the user's question is vague? Does the agent ask for clarification or guess?

Iterate based on results

If the agent fails a test:

Check the prompt. Is the instruction clear enough? Did you define constraints for this scenario?
Check the tools. Does the agent have access to the data it needs?
Check the data. Is the data in your tables accurate and complete?

Prompt tuning is an iterative process. Expect to go through several rounds before the agent performs consistently. Each failure teaches you something about what the prompt is missing.

Example from practice: The Expression Editor agent was evaluated against prompts like "Add 7 days to a datetime variable" and "Extract a string between two identifiers." Each test case had a known correct expression. The agent passed when its output matched exactly. This kind of structured testing makes the agent trustworthy.

Common pitfalls to avoid

Building too broad. An agent that tries to do everything does nothing well. One agent, one clear responsibility. If you need multiple capabilities, build multiple agents.

Skipping the data foundation. AI agents process data that already exists. If your tables are empty, your fields are inconsistent, or your records are outdated, the agent will reflect that. Clean data in, useful output out.

Writing prompts like conversations. A prompt is not a chat message. It is a specification. Write it with the same precision you would use for a trigger or an automation.

Ignoring constraints. Without constraints, the agent will try to be helpful in ways you did not intend. It might synthesize information, assume root causes, or respond to questions outside its scope. Constraints are your safety rails.

Not testing with real users. Your evaluation table tests correctness. But you also need to watch real operators and engineers use the agent. They will ask questions you did not anticipate. Those questions become your next round of prompt improvements.

A practical framework to get started

Use this checklist before you build your next agent:

I can describe the problem this agent solves in one sentence.
I know who the primary user is and what they need.
The data the agent needs already exists and is accurate.
I can describe what a correct response looks like.
I have written a goal, task, inputs, outputs, and constraints.
I have selected only the tools the agent needs.
I have built an evaluation table with at least 5 test cases.
I have tested edge cases and out-of-scope requests.
I have iterated on the prompt based on test results.
I have observed a real user interacting with the agent.

If you can check every box, your agent is ready for production.

Documentation Index