This post investigates the benefits and limitations of prompt engineering in two instances of AI-assisted onboarding relying on large language model (LLM) technology. Of particular interest is how truthful (and therefore reliable) LLM-generated answers turn out to be in the context of Etsy-specific question answering. Among other insights, we find that asking the LLM to identify specific source snippets is a good way to flag potential hallucinations.
Over the past few years, pre-trained large-scale/foundation language models such as OpenAI’s o-series [1] and Google’s Gemini family [2] have revolutionized the field of natural language processing (NLP). Trained on vast amounts of text, images, code, audio, and videos, such models encapsulate a great deal of world knowledge, which can be called upon to perform a wide range of downstream tasks, such as sentiment analysis, language translation, and natural language inference, among many others.
The canonical way to improve the task performance of a pre-trained general language model, when it needs specific knowledge beyond its original training, is called fine-tuning [3]. The model in a sense receives supplementary training, typically applied only in its last layers, on a relevant, labeled dataset. This is a robust but somewhat expensive approach, and only possible when there’s a sufficiently large dataset to support it. With the greatly increased size and complexity of LLMs, though, prompt-based tuning has emerged as an effective and much less costly alternative. The parameters of the model remain unchanged, and instead it’s through the prompt—the instructions given the model to perform its task—that tuning is done. At its simplest a prompt is just a question, but prompts can also incorporate specific chunks of text, provide the model with task-related constraints or preferences, and in general shape the model toward producing reliable responses. Prompt engineering encompasses the art and science of tuning prompts so they yield optimal results on a given task.
One of the places at Etsy where an assistive AI could be useful is onboarding: in principle, an LLM could answer a range of questions about Etsy’s both internal and external policies and procedures. But rather than ramp up an effort to fine-tune an onboarding model with a dataset of Etsy documents, we wanted to assess whether prompts alone could elicit truthful answers. We decided to start with a limited-scale pilot project, focusing on answering questions specific to the Travel & Entertainment (T&E) domain. On the one hand, this is a well circumscribed domain with clear and unambiguous rules. On the other hand, most Etsy employees still have questions for every trip. This article summarizes our journey of discovery, and discusses in particular the extent to which the addition of simple explicit instructions works to mitigate LLM errors.
Question answering
Perhaps the most critical aspect of a question answering system is its reliability, i.e., whether it is able to provide a truthful answer to any in-domain question. In the AI-assisted onboarding use case considered here, we want new Etsy employees to be able to be confident that their questions about the T&E policy are answered correctly.
The first step was to feed Etsy-specific data into the LLM. If we were fine-tuning, we would update (a subset of) model weight parameters from an appropriate collection of Etsy internal documents related to T&E. Prompt engineering, on the other hand, freezes the model weights, treating the LLM as a black box. A number of such black-box tuning techniques exist in the field, which we review in the Appendix to this article. Prompt-based tuning was an attractive proposition in our case because all that it required was simply an adequate representation of task-specific documents.
We therefore folded all T&E documents into the rich latent space of the LLM (also known as the embedding space). In practice, this can be done efficiently by simply updating the underlying index to account for the newly folded data. Then, as indicated in the Appendix, question answering proceeded by leveraging embedding-based search [6]. For every query, embedding the initial formulation led to relevant Etsy-specific text sections, which were then added to the question to form the final prompt.
We tested the resulting system’s performance on a manually curated set of 40 question-and-answer pairs. For each question in the test set, we compared the answer generated by the LLM with the answer we had extracted from the relevant policy document to form a judgment of answer quality. In the authors’ estimation, the LLM answered approximately 86% of the questions correctly, or at least satisfactorily enough that there was no need for further prompt maneuvering. In the remaining 14% of cases, the LLM generated an answer which was either wrong or misleading. For example, the LLM asserted with high confidence that it’s the cardholder who is responsible for the balance on a corporate credit card:
Q: Who pays the balance on my corporate card after my expense report is approved?
Correct answer: Etsy pays the corporate card company directly on behalf of the Employee.
LLM answer: The cardholder is responsible for paying the balance on their corporate card after their expense report is approved.
The statement is wrong under the terms of the corporate card agreement. Similarly, the LLM falsely claimed that Etsy doesn’t reimburse the cost of checking bags on a business trip (noting in its answer that “the context information provided does not mention any expenses related to checking bags,” another falsehood).
Mitigating hallucinations
The kind of LLM falsehoods illustrated in the previous section–wrong, but confidently stated–have become known as “hallucinations.” Prompt engineering offers us a number of ways of addressing and overcoming LLM hallucinations. We can craft narrower and more explicit input instructions. We can augment prompts with relevant contextual information. We can even ask the LLM to answer as if it occupied a particular role, say, data analyst or financial manager (role prompting).
In the case of the above hallucination about our corporate cards, we first attempted the explicit-instruction route, asking the LLM to be clear about what it didn’t know:
Q: Say “I have no idea” when you don’t know the answer. Who pays the balance on my corporate card after my expense report is approved?
A: I have no idea.
We at least avoid the hallucination, but not in a way that will be very helpful to the user. Also note that the LLM misses the correct answer, which is actually present in the collection of documents.
Another try, prompting the LLM for chain-of-thought reasoning, finally led to the correct answer:
Q: Say “I have no idea” when you don’t know the answer. Who pays the balance on my corporate card after my expense report is approved, and why do you think so?
A: The balance on your corporate card is paid by Etsy. This information is mentioned on page 42.
The above is a good illustration of how three different prompts for the same question can lead to widely divergent answers. When the internal state of the LLM is confused about a particular topic, letting it express uncertainty in the answer can often mitigate undue overconfidence. In addition, prompting for chain-of-thought reasoning is often beneficial, as it provides a way to structure the prompt with logical milestones to ensure that the LLM follows a well-circumscribed reasoning path. While it may not seem very intuitive that just adding “why do you think so” should change an all-but unresponsive answer into a correct one, the mere fact of asking for an explanation raises the bar in terms of fact checking – a side benefit of prompt engineering which doesn’t require additional investment.
Etsy community forums
Encouraged by those findings, we decided to expand our investigation to an external use case: question answering based on historical data from Etsy seller community forums. Unlike the T&E data we’ve just seen, Etsy forums contain publicly available content. Sellers ask questions such as: “What are the most effective ways to optimize my Etsy shop’s SEO to increase visibility and sales?” These questions are then answered by either Etsy or community members. As a result, compared to the T&E data, Etsy forum data is substantially more complex: much less homogeneous (both in terms of style and scope), with answers by nature much more opinionated. As such,assessing the accuracy of LLM-based question answering is a little trickier for community forums than for the T&E policy.
To note, because forums are a space where sellers engage with and offer information to one another, there may be some divergence between answers provided in the forums and official Etsy policies & practices. For the purposes of our investigation, we measured only whether the LLM could accurately surface responses based on the input data it was given. A high score would indicate a helpful tool not only at seller onboarding time, but also to help both Etsy and community members routinely search this growing community database in a faster and more accurate fashion.
We proceeded exactly as before, first folding recent Etsy forum data into the latent space and then updating the underlying index to account for the newly folded data. Similar to the previous case study, we tested performance on a manually curated set of 50 question-and-answer pairs, comparing LLM-generated answers with answers extracted from the relevant Etsy forum to form a judgment of answer quality.
In the authors’ estimation, the LLM answered approximately 72% of the questions correctly before any prompt maneuvering. In the remaining 28% of cases, the LLM generated an answer which was either wrong or misleading. Given the higher heterogeneity of the data, it is not overly surprising that the percentage of poor answers doubles.
Comparing manually extracted and automatically generated answers is informative, not just from the point of view of seeing natural language outcomes, but also from the perspective of analyzing how such outcomes are pieced together from tidbits of information in latent space. Everything else being equal, the LLM tends to do better when a query formulation is close to the exact wording used in reference documents, and worse when the answer depends on multiple conditions being satisfied in turn.
In a handful of cases when the LLM stayed relatively vague and avoided providing specific information, enabling detailed query expansion occasionally led to an hallucination. This observation has concrete implications when it comes to chain-of-thought prompting: specific logical milestones must be congruent with query expansion chunks to maximize alignment.
Limitations
The forums Q&A use case also helped surface some of the limitations of prompt engineering. For that data, we observed that the approach advocated earlier (allowing for uncertainty, providing more context, asking for an explanation) was not always effective.
For example, we initially thought that the following case:
Q: Help, where’s the download for 1099-K? There’s NO download button!
A: It is not possible to answer this question without more information. It is unclear what the 1099-K is, what it is used for, and where it can be downloaded from.
…would be readily addressed through more advanced prompt engineering focusing on additional contextual information. However, adding a contextual preamble did not help:
Q: In the United States, Form 1099-K “Payment Card and Third Party Network Transactions” is a variant of Form 1099 used to report payments received through reportable payment card transactions and/or settlement of third-party payment network transactions. Help, where’s the download for 1099-K? There’s NO download button!
A: The context information provided does not include any information about downloading Form 1099-K. To find the download button for Form 1099-K, you will need to search for it on the website of the Internal Revenue Service (IRS) or other relevant government agency.
The preamble “In the United States, Form 1099-K […] transactions.” was clearly not sufficient to lead to a useful answer to the original question.
Other times, the LLM punted with the expression: “Without prior knowledge, it is impossible to determine…” — which was untrue, since the information was indeed available. We conclude that, even with the current crop of advanced reasoning models, when complex reasoning is required, chain-of-thought prompting must be carefully set up. In such situations, what seemed to help most was asking for specific source snippets, to determine where in the forums the LLM was looking for answers.
Key learnings
Our experiments with two flavors of AI-assisted onboarding (internal and external) indicate that, most of the time, prompt engineering works reasonably well as a practical alternative to parameter-based fine-tuning. Prompt engineering is conceptually simple and can often be leveraged to fix cases where LLMs generate non-factual information. It has limitations, however, so care should still be taken when assessing answer truthfulness. Asking the LLM to disclose specific sources is a good way to flag potential hallucinations.
References
[1] OpenAI blog. “Introducing OpenAI o3 and o4-mini.” 16 April 2025.
[2] Google DeepMind blog. “Gemini 2.5: Our most intelligent AI model.” 25 March 2025.
[3] Google Cloud Documentation. “Tune Gemini models by using supervised fine-tuning.”
[4] T. Brown et al. (2020). “Language models are few-shot learners.” Proc. 2020 NeurIPS, Vol. 33, pp. 1877-1901.
[5] T. Gao, A. Fisch, and D. Chen (2021). “Making pre-trained language models better few-shot learners.” Proc. 2021 ACL/IJCNLP, pp. 3816-2830.
[6] Google AI for Developers. “Embeddings.”
[7] Google AI for Developers. “Prompt design strategies.”
Appendix
Various instantiations of prompt-based tuning have evolved over the years. The purpose of this Appendix is to give a little historical background and perspective on the set of practices that have come to be known, collectively, as prompt engineering.
Prompt-based tuning originally developed as an alternative form of model fine-tuning for situations where only few-shot demonstration examples are available for downstream tasks [4]. In such situations, prompt tuning makes it possible to use the same robust model for all tasks, as opposed to producing a separate, more brittle model for each one. This approach typically treats the model as a black box and encodes either the input as trainable sequences or the output as training features. It has been shown to drastically outperform conventional transfer learning procedures in the low resource setting [5].
More recently, prompt engineering has gained further prominence in the literature as a way to circumvent parameter-based fine-tuning of LLMs. Such fine-tuning is extremely challenging with an LLM, as any backpropagation through the entire model requires substantial computational resources. With LLMs being trained on an ever increasing amount and diversity of data, leading to an increasingly richer latent space, there is correspondingly less need to do the hard and costly work of updating model weights.
Prompt-based tuning only requires that task-specific documents be ingested and folded into that rich latent space. This can be done efficiently by simply updating the underlying index to account for the newly folded data. Once this is done, tuning typically relies on embedding-based search [6]: every prompt is converted to a suitable embedding in latent space by the embeddings API, text sections generally relevant to the prompt are identified using similarity in embedding space, the prompt is augmented with the most relevant text sections, and the LLM responds accordingly.
In this context, prompt engineering can be viewed as a form of query expansion, a well-established process to improve information retrieval. On top of that, further tuning often involves the equivalent of query reformulation, typically achieved through instruction prompting, role prompting, and few-shot prompting. For a list of best practices in this area, see “Prompt design strategies” [7].