GenAI Series - The Art and Science of Prompt Engineering: A Comprehensive Guide
Beyond Basic Queries: Advanced Methods to Unlock AI's Full Potential
I have been pretty skeptical of Prompt engineering as a discipline. In one of my last posts called “Prompt Engineering: Why Your Tech Career Shouldn't Rest on Vibes and Creative Writing” you could have seen my bias towards prompt engineering as a discipline and my disregard of people who call themselves “Prompt Engineers”. But that is not to say that not learning about it is an option for ML engineers. In the words of some political commentators, “Two things can be true at once”. And in case of Prompt engineering, it is — that this is a valuable skill, and we don’t need a separate job role called prompt engineers. So, this is where we are with Prompt engineering. I think about prompt engineering as veering somewhere between a science and an art form, something that can make or break your model outputs and something that you should know about if you want to put these large language models into production.
In this post, I will talk about prompt engineering, best practices, the various techniques used while creating prompts, and how you could steer the AI behaviour to your advantage.
What is Prompt Engineering?
So, we start with what Prompt engineering is. Is it just giving the model an instruction in natural language, or is it something more? In my view, prompt engineering is the process of “designing” and “refining” inputs that effectively guide AI systems to produce your desired outputs. Think of it as learning to speak the language of AI - the better you communicate your intentions, the better results you'll get.
So, how do we do that? Let's dive into the fundamentals, explore various techniques, and learn how to masterfully craft prompts that get the best out of these powerful systems.
Basic Principles and Best Practices
1. Be Clear and Specific
The cardinal rule of prompt engineering is clarity. Vague prompts lead to vague responses. The more specific you are about what you want, the better the model can tailor its response.
Example:
Vague: "Tell me about stars."
Better: "Explain the life cycle of a main sequence star like our Sun, focusing on the key stages from formation to eventual fate."
In the second example, you provided what you were looking for, while the first example was very generic. Adding specific details helps narrow the scope and guides the model to provide the information you're looking for.
Here is how the outputs look like with a big model like Claude Sonnet 3.7
2. Provide Context
Context is king in AI interactions. Giving relevant background information significantly improves the quality and relevance of responses.
Example:
Limited: "Write code to sort a list."
Better: "I'm a Python beginner learning data structures. Write Python code to sort a list of integers using the merge sort algorithm with explanatory comments."
The additional context allows the AI to tailor its response to your knowledge level and specific needs. As you can see, the first prompt just used a sorted functionality in Python, while the second one provided a more “useful” answer.
3. Structure Your Prompts
Well-structured prompts help AI models organize their responses. Consider using bullet points, numbered lists, or clear sections to guide the structure of the response you want.
Example:
Unstructured: "Help me with my career change to Machine Learning."
Structured: "I'm considering a career change from marketing to data science. Please address:
Three transferable skills from Software Engineering that apply to Machine Learning
Essential technical skills I should develop (in order of priority)
Suggested timeline for transition (6-month milestones)"
This structure not only helps the model organize its thoughts but also makes the information more digestible for you.
4. Use Examples (Few-Shot Prompting)
One of the most potent techniques in prompt engineering is showing examples of the desired outputs, especially for complex or specific tasks.
Example:
Classify these sentences as positive, neutral, or negative:
Example 1: "The food was delicious." - POSITIVE
Example 2: "The restaurant was quite crowded." - NEUTRAL
Example 3: "The service was terrible." - NEGATIVE
Now classify: "The movie had beautiful cinematography but a confusing plot."
By showing the model exactly what you expect, you greatly increase the chances of getting a response in the “format” and “style” you want.
5. Iterative Refinement
Iterative Refinement is one of the basic concepts of everything in the world of Software Engineering and Machine Learning. You don’t create the most accurate model on the first try. It is all about putting the bricks together piece by piece to create your castle. So, one of the most important insights about prompt engineering is that it's rarely perfect on the first try. The most effective prompts are an output of iterative process, refining prompts based on the responses you get. And while working iteratively, it makes a lot of sense to store and document your prompt attempts. You can create a prompt versioning system using just a text file with fields for:
Prompt name and version
Goal of the attempt
Model and configuration (temperature, token limit, etc.)
The full prompt text
The resulting output
This documentation becomes invaluable when:
You need to revisit your work after a break
You want to test performance across different model versions
You're debugging unexpected behaviors
You're collaborating with others on prompt design
Treating prompt engineering as a systematic, documented, iterative process transforms it from an art of guesswork into a science of incremental improvement—just like we approach software development and machine learning model training.
Prompting Strategies: When and How to Use Them
Let's explore various prompting techniques that have emerged from research and practice:
1. Zero-Shot Prompting
When to use: For straightforward tasks where the model likely has sufficient background knowledge.
How it works: Simply ask the model to perform a task without examples.
Example:
Summarize the key arguments in favor of renewable energy.
This approach was formalized in the groundbreaking paper "Language Models are Few-Shot Learners", which demonstrated that large language models can perform tasks without specific training examples.
2. Few-Shot Prompting
This is very useful for tasks requiring specific formatting, style, or approach that might not be immediately obvious. Here, we provide examples of the desired input-output pattern before asking for a new output.
Example:
Convert these sentences to past tense:
Input: I walk to the store.
Output: I walked to the store.
Input: She runs every morning.
Output: She ran every morning.
Input: They build sandcastles at the beach.
Output:
The research shows that providing just a few examples can dramatically improve performance on specialized tasks, even when the model hasn't been specifically fine-tuned for them.
3. Chain-of-Thought Prompting
This is very useful for complex reasoning tasks, mathematical problems, or multi-step analyses. Here, we just prompt the model to work through a problem step by step and show its reasoning.
Example:
Solve this math problem step by step:
If a store sells 150 shirts at $25 each and has overhead costs of $1,500, what is the profit?
This technique was formalized in "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (Wei et al., 2022), which demonstrated significant improvements in complex reasoning tasks. By instructing the model to reason step-by-step, we can often get more accurate results, especially for problems requiring logical deduction or calculation.
4. Self-Consistency
This one is a pretty interesting approach, though not always scalable for tasks requiring high accuracy like mathematics or logical reasoning. Here, we generate multiple independent solutions to the same problem through slightly different prompts, then take the most common answer.
For Example, we ask the same math problem multiple times with some sampling strategy and choose the answer that comes from the majority of outputs.
Research shows that sampling multiple reasoning paths and taking the majority answer significantly improves performance on reasoning tasks. This approach was introduced in "Self-Consistency Improves Chain of Thought Reasoning in Language Models" (Wang et al., 2022).
5. Tree of Thoughts
This is an extension of Chain-of-Thought that explores multiple reasoning paths simultaneously, allowing for backtracking and evaluation of different approaches.
This technique introduces a powerful way to tackle complex problems by exploring various solution paths and selecting the most promising ones - similar to how humans might brainstorm different approaches to a difficult problem.
Example:
Imagine three different experts are answering this question. All experts will write down 1 step of their thinking, then share it with the group. Then all experts will go on to the next step, etc. If any expert realises they're wrong at any point then they leave.
The question is...
There are five houses of different colors next to each other. In each house lives a man. Each man has a unique nationality, an exclusive favorite drink, a distinct favorite brand of cigarettes and keeps specific pets. Use all the clues below to fill the grid and answer the question: "Who owns the fish?"
- The Brit lives in the Red house.
- The Swede keeps Dogs as pets.
- The Dane drinks Tea.
- The Green house is exactly to the left of the White house.
- The owner of the Green house drinks Coffee.
- The person who smokes Pall Mall rears Birds.
- The owner of the Yellow house smokes Dunhill.
- The man living in the centre house drinks Milk.
- The Norwegian lives in the first house.
- The man who smokes Blends lives next to the one who keeps Cats.
- The man who keeps Horses lives next to the man who smokes Dunhill.
- The man who smokes Blue Master drinks Beer.
- The German smokes Prince.
- The Norwegian lives next to the Blue house.
- The man who smokes Blends has a neighbour who drinks Water.
In this example, the model will explore both decision paths systematically and compare the results, leading to a more reliable conclusion than a single-path approach. This approach was formalized in “Large Language Model Guided Tree-of-Thought”
I used Claude Opus 3.0 for this. The TOT approach gives the right answer, which is that the German owns the fish, compared to just asking the question as zero shot, which gave Norwegian as the answer.
6. ReAct (Reasoning and Acting)
This one combines reasoning with action-taking, allowing the model to interact with external tools or APIs as part of its problem-solving process.
This approach, which was introduced in the paper REACT: SYNERGIZING REASONING AND ACTING IN LANGUAGE MODELS essentially allows the model to "think" about what it needs to do, take action to gather information or perform operations, observe the results, and then continue reasoning. It's particularly powerful for tasks that require both intellectual reasoning and practical information gathering. Current LLMs like Claude 3.7 do this well, where Claude can generate and run JavaScript code, do web searches, and use other provided MCP tools.
Example:
7. Step-Back Prompting
What it is: Encourages the model to "step back" and consider a general context or principle before addressing a specific problem.
Step-back prompting is a technique for improving performance by prompting the LLM to first consider a general question related to the specific task at hand and then feeding the answer to that general question into a subsequent prompt for the specific task.
This technique helps the model access broader knowledge and principles before diving into specifics, often leading to more accurate and insightful responses.
Example:
Before analyzing this specific coding bug, let's step back and consider: What are the general principles of debugging memory leaks in Python applications?
[Model provides general principles of debugging memory leaks]
Now, with these principles in mind, let's examine this specific code snippet that's causing a memory leak:
```python
def process_data(data_stream):
results = []
for chunk in data_stream:
processed = transform(chunk)
results.append(processed)
return results
```
What's likely causing the memory leak and how would you fix it?
This is like a two-step prompt with RAG, where in the first prompt, you get the information that could be useful for your task and provide the retrieved information in the context while asking for the real question.
Output Control: Temperature, Top-K, and Top-P Settings
I am pretty sure you would also be confused by the different parameters provided by these model APIs. Here, we will try to understand these parameters, which are crucial for fine-tuning AI outputs to match your specific needs:
1. Temperature
What it is: Temperature controls the randomness in token selection. Lower values make responses more deterministic and focused, while higher values increase creativity and diversity.
Range: Usually 0 to 1
When to adjust:
Lower temperature (0-0.3) for factual content, code generation, structured data, or when accuracy is critical
Mid temperature (0.4-0.7) for general content creation, explanations, or conversational responses
Higher temperature (0.8-1.0) for creative writing, brainstorming, or generating diverse alternatives
A low temperature setting mirrors a low softmax temperature (T), emphasizing a single, preferred token with high certainty. A higher temperature setting allows for more diversity, making a wider range of tokens more acceptable.
2. Top-K Sampling
What it is: Restricts token selection to only the K most likely next tokens. Higher values allow for more diversity, while lower values keep responses more focused.
Range: Usually from 1 to 100
The higher the top-K, the more creative and varied the model's output; the lower the top-K, the more restive and factual the model's output. A top-K of 1 is equivalent to greedy decoding or temperature=0.
3. Top-P (Nucleus) Sampling
What it is: Restricts token selection to the smallest set of tokens whose cumulative probability exceeds probability P. This is a more dynamic approach than Top-K.
Range: 0 to 1
Top-P sampling selects the top tokens whose cumulative probability does not exceed a certain value (P). Values for P range from 0 (greedy decoding) to 1 (all tokens in the LLM's vocabulary).
Practical Combinations
These parameters often interact with each other. Here are some practical combinations for different scenarios:
For factual, accurate responses:
Temperature: 0.1-0.2
Top-K: 20-30
Top-P: 0.9
For balanced creative writing:
Temperature: 0.7
Top-K: 40
Top-P: 0.95
For maximum creativity (brainstorming):
Temperature: 0.9
Top-K: 50
Top-P: 0.99
For coding tasks:
Temperature: 0.0-0.1
Top-K: 10
Top-P: 0.9
As a general starting point, a temperature of 0.2, top-P of 0.95, and top-K of 30 will give you relatively coherent results that can be creative but not excessively so.
Conclusion
Mastering prompt engineering is a journey that combines technical knowledge with creativity and experimentation. The techniques I've outlined here provide a solid foundation, but the real learning comes from hands-on practice and iteration.
As you experiment with these techniques, remember that different models may respond differently to the same prompts, and what works for one task might not work for another. Stay curious, keep refining your approaches, and maintain a mindset of continuous improvement. The field of prompt engineering is still evolving rapidly, with researchers and practitioners constantly discovering new techniques and best practices.
By staying informed about these developments and building your intuition through practice, you'll be well-equipped to harness the full potential of generative AI models.
I'd love to hear about your experiences with prompt engineering! What techniques have worked best for you? What challenges have you encountered?