Optimize query responses with user feedback using Amazon Bedrock embedding and few-shot prompting


Bettering reaction high quality for consumer queries is very important for AI-driven programs, particularly the ones that specialize in consumer pleasure. For instance, an HR chat-based assistant will have to strictly practice corporate insurance policies and reply the use of a undeniable tone. A deviation from that may be corrected by way of suggestions from customers. This put up demonstrates how Amazon Bedrock, mixed with a consumer suggestions dataset and few-shot prompting, can refine responses for upper consumer pleasure. Via the use of Amazon Titan Text Embeddings v2, we show a statistically vital development in reaction high quality, making it a treasured instrument for programs looking for correct and personalised responses.

Fresh research have highlighted the price of suggestions and prompting in refining AI responses. Prompt Optimization with Human Feedback proposes a scientific strategy to finding out from consumer suggestions, the use of it to iteratively fine-tune fashions for progressed alignment and robustness. In a similar fashion, Black-Box Prompt Optimization: Aligning Large Language Models without Model Training demonstrates how retrieval augmented chain-of-thought prompting complements few-shot finding out by way of integrating related context, enabling higher reasoning and reaction high quality. Construction on those concepts, our paintings makes use of the Amazon Titan Text Embeddings v2 fashion to optimize responses the use of to be had consumer suggestions and few-shot prompting, reaching statistically vital enhancements in consumer pleasure. Amazon Bedrock already supplies an automatic prompt optimization function to robotically adapt and optimize activates with out further consumer enter. On this weblog put up, we exhibit the best way to use OSS libraries for a extra custom designed optimization in response to consumer suggestions and few-shot prompting.

We’ve advanced a realistic resolution the use of Amazon Bedrock that robotically improves chat assistant responses in response to consumer suggestions. This resolution makes use of embeddings and few-shot prompting. To show the effectiveness of the answer, we used a publicly to be had consumer suggestions dataset. Then again, when making use of it inside of an organization, the fashion can use its personal suggestions knowledge supplied by way of its customers. With our check dataset, it presentations a three.67% build up in consumer pleasure ratings. The important thing steps come with:

  1. Retrieve a publicly to be had consumer suggestions dataset (for this case, Unified Feedback Dataset on Hugging Face).
  2. Create embeddings for queries to seize semantic an identical examples, the use of Amazon Titan Textual content Embeddings.
  3. Use an identical queries as examples in a few-shot instructed to generate optimized activates.
  4. Examine optimized activates in opposition to direct large language model (LLM) calls.
  5. Validate the advance in reaction high quality the use of a paired pattern t-test.

The next diagram is an outline of the device.

End-to-end workflow diagram showing how user feedback and queries are processed through embedding, semantic search, and LLM optimization

The important thing advantages of the use of Amazon Bedrock are:

  • 0 infrastructure control – Deploy and scale with out managing advanced device finding out (ML) infrastructure
  • Price-effective – Pay just for what you utilize with the Amazon Bedrock pay-as-you-go pricing fashion
  • Undertaking-grade safety – Use AWS integrated safety and compliance options
  • Simple integration – Combine seamlessly present programs and open supply equipment
  • More than one fashion choices – Get admission to more than a few foundation models (FMs) for various use instances

The next sections dive deeper into those steps, offering code snippets from the pocket book let’s say the method.

Necessities

Necessities for implementation come with an AWS account with Amazon Bedrock get entry to, Python 3.8 or later, and configured Amazon credentials.

Information assortment

We downloaded a consumer suggestions dataset from Hugging Face, llm-blender/Unified-Feedback. The dataset accommodates fields reminiscent of conv_A_user (the consumer question) and conv_A_rating (a binary ranking; 0 approach the consumer doesn’t love it and 1 approach the consumer likes it). The next code retrieves the dataset and specializes in the fields wanted for embedding era and suggestions research. It may be run in an Amazon Sagemaker pocket book or a Jupyter pocket book that has get entry to to Amazon Bedrock.

# Load the dataset and specify the subset
dataset = load_dataset("llm-blender/Unified-Comments", "synthetic-instruct-gptj-pairwise")

# Get admission to the 'educate' cut up
train_dataset = dataset["train"]

# Convert the dataset to Pandas DataFrame
df = train_dataset.to_pandas()

# Flatten the nested dialog constructions for conv_A and conv_B safely
df['conv_A_user'] = df['conv_A'].observe(lambda x: x[0]['content'] if len(x) > 0 else None)
df['conv_A_assistant'] = df['conv_A'].observe(lambda x: x[1]['content'] if len(x) > 1 else None)

# Drop the unique nested columns if they're now not wanted
df = df.drop(columns=['conv_A', 'conv_B'])

Information sampling and embedding era

To control the method successfully, we sampled 6,000 queries from the dataset. We used Amazon Titan Textual content Embeddings v2 to create embeddings for those queries, remodeling textual content into high-dimensional representations that let for similarity comparisons. See the next code:

import random import bedrock # Take a pattern of 6000 queries 
df = df.shuffle(seed=42).make a selection(vary(6000)) 
# AWS credentials
consultation = boto3.Consultation()
area = 'us-east-1'
# Initialize the S3 consumer
s3_client = boto3.consumer('s3')

boto3_bedrock = boto3.consumer('bedrock-runtime', area)
titan_embed_v2 = BedrockEmbeddings(
    consumer=boto3_bedrock, model_id="amazon.titan-embed-text-v2:0")
    
# Serve as to transform textual content to embeddings
def get_embeddings(textual content):
    reaction = titan_embed_v2.embed_query(textual content)
    go back reaction  # This will have to go back the embedding vector

# Observe the serve as to the 'instructed' column and retailer in a brand new column
df_test['conv_A_user_vec'] = df_test['conv_A_user'].observe(get_embeddings)

Few-shot prompting with similarity seek

For this phase, we took the next steps:

  1. Pattern 100 queries from the dataset for trying out. Sampling 100 queries is helping us run more than one trials to validate our resolution.
  2. Compute cosine similarity (measure of similarity between two non-zero vectors) between the embeddings of those check queries and the saved 6,000 embeddings.
  3. Make a choice the highest okay an identical queries to the check queries to function few-shot examples. We set Okay = 10 to steadiness between the computational potency and variety of the examples.

See the next code:

# Step 2: Outline cosine similarity serve as
def compute_cosine_similarity(embedding1, embedding2):
embedding1 = np.array(embedding1).reshape(1, -1) # Reshape to 2D array
embedding2 = np.array(embedding2).reshape(1, -1) # Reshape to 2D array
go back cosine_similarity(embedding1, embedding2)[0][0]

# Pattern question embedding
def get_matched_convo(question, df):
    query_embedding = get_embeddings(question)
    
    # Step 3: Compute similarity with each and every row within the DataFrame
    df['similarity'] = df['conv_A_user_vec'].observe(lambda x: compute_cosine_similarity(query_embedding, x))
    
    # Step 4: Kind rows in response to similarity rating (descending order)
    df_sorted = df.sort_values(by way of='similarity', ascending=False)
    
    # Step 5: Filter out or get most sensible matching rows (e.g., most sensible 10 suits)
    top_matches = df_sorted.head(10) 
    
    # Print most sensible suits
    go back top_matches[['conv_A_user', 'conv_A_assistant','conv_A_rating','similarity']]

This code supplies a few-shot context for each and every check question, the use of cosine similarity to retrieve the nearest suits. Those instance queries and suggestions function further context to lead the instructed optimization. The next serve as generates the few-shot instructed:

import boto3
from langchain_aws import ChatBedrock
from pydantic import BaseModel

# Initialize Amazon Bedrock consumer
bedrock_runtime = boto3.consumer(service_name="bedrock-runtime", region_name="us-east-1")

# Configure the fashion to make use of
model_id = "us.anthropic.claude-3-5-haiku-20241022-v1:0"
model_kwargs = {
"max_tokens": 2048,
"temperature": 0.1,
"top_k": 250,
"top_p": 1,
"stop_sequences": ["nnHuman"],
}

# Create the LangChain Chat object for Bedrock
llm = ChatBedrock(
consumer=bedrock_runtime,
model_id=model_id,
model_kwargs=model_kwargs,
)

# Pydantic fashion to validate the output instructed
magnificence OptimizedPromptOutput(BaseModel):
optimized_prompt: str

# Serve as to generate the few-shot instructed
def generate_few_shot_prompt_only(user_query, nearest_examples):
    # Be sure that df_examples is a DataFrame
    if no longer isinstance(nearest_examples, pd.DataFrame):
    lift ValueError("Anticipated df_examples to be a DataFrame")
    # Assemble the few-shot instructed the use of nearest matching examples
    few_shot_prompt = "Listed below are examples of consumer queries, LLM responses, and suggestions:nn"
    for i in vary(len(nearest_examples)):
    few_shot_prompt += f"Person Question: {nearest_examples.loc[i,'conv_A_user']}n"
    few_shot_prompt += f"LLM Reaction: {nearest_examples.loc[i,'conv_A_assistant']}n"
    few_shot_prompt += f"Person Comments: {'👍' if nearest_examples.loc[i,'conv_A_rating'] == 1.0 else '👎'}nn"
    
    # Upload the consumer question for which the optimized instructed is needed
    few_shot_prompt += f"In accordance with those examples, generate a basic optimized instructed for the next consumer question:nn"
    few_shot_prompt += f"Person Question: {user_query}n"
    few_shot_prompt += "Optimized Steered: Supply a transparent, well-researched reaction in response to correct knowledge and credible assets. Keep away from pointless data or hypothesis."
    
    go back few_shot_prompt

The get_optimized_prompt serve as plays the next duties:

  1. The consumer question and an identical examples generate a few-shot instructed.
  2. We use the few-shot instructed in an LLM name to generate an optimized instructed.
  3. Make sure that the output is within the following layout the use of Pydantic.

See the next code:

# Serve as to generate an optimized instructed the use of Bedrock and go back best the instructed the use of Pydantic
def get_optimized_prompt(user_query, nearest_examples):
    # Generate the few-shot instructed
    few_shot_prompt = generate_few_shot_prompt_only(user_query, nearest_examples)
    
    # Name the LLM to generate the optimized instructed
    reaction = llm.invoke(few_shot_prompt)
    
    # Extract and validate best the optimized instructed the use of Pydantic
    optimized_prompt = reaction.content material # Mounted to get entry to the 'content material' characteristic of the AIMessage object
    optimized_prompt_output = OptimizedPromptOutput(optimized_prompt=optimized_prompt)
    
    go back optimized_prompt_output.optimized_prompt

# Instance utilization
question = "Is the United States greenback weakening over the years?"
nearest_examples = get_matched_convo(question, df_test)
nearest_examples.reset_index(drop=True, inplace=True)

# Generate optimized instructed
optimized_prompt = get_optimized_prompt(question, nearest_examples)
print("Optimized Steered:", optimized_prompt)

The make_llm_call_with_optimized_prompt serve as makes use of an optimized instructed and consumer question to make the LLM (Anthropic’s Claude Haiku 3.5) name to get the overall reaction:

# Serve as to make the LLM name the use of the optimized instructed and consumer question
def make_llm_call_with_optimized_prompt(optimized_prompt, user_query):
    start_time = time.time()
    # Mix the optimized instructed and consumer question to shape the enter for the LLM
    final_prompt = f"{optimized_prompt}nnUser Question: {user_query}nResponse:"

    # Make the decision to the LLM the use of the mixed instructed
    reaction = llm.invoke(final_prompt)
    
    # Extract best the content material from the LLM reaction
    final_response = reaction.content material  # Extract the reaction content material with out including any labels
    time_taken = time.time() - start_time
    go back final_response,time_taken

# Instance utilization
user_query = "The right way to develop avocado indoor?"
# Think 'optimized_prompt' has already been generated from the former step
final_response,time_taken = make_llm_call_with_optimized_prompt(optimized_prompt, user_query)
print("LLM Reaction:", final_response)

Comparative analysis of optimized and unoptimized activates

To match the optimized instructed with the baseline (on this case, the unoptimized instructed), we outlined a serve as that returned a consequence with out an optimized instructed for all of the queries within the analysis dataset:

def get_unoptimized_prompt_response(df_eval):
    # Iterate over the dataframe and make LLM calls
    for index, row in tqdm(df_eval.iterrows()):
        # Get the consumer question from 'conv_A_user'
        user_query = row['conv_A_user']
        
        # Make the Bedrock LLM name
        reaction = llm.invoke(user_query)
        
        # Retailer the reaction content material in a brand new column 'unoptimized_prompt_response'
        df_eval.at[index, 'unoptimized_prompt_response'] = reaction.content material  # Extract 'content material' from the reaction object
    
    go back df_eval

The next serve as generates the question reaction the use of similarity seek and intermediate optimized instructed era for all of the queries within the analysis dataset:

def get_optimized_prompt_response(df_eval):
    # Iterate over the dataframe and make LLM calls
    for index, row in tqdm(df_eval.iterrows()):
        # Get the consumer question from 'conv_A_user'
        user_query = row['conv_A_user']
        nearest_examples = get_matched_convo(user_query, df_test)
        nearest_examples.reset_index(drop=True, inplace=True)
        optimized_prompt = get_optimized_prompt(user_query, nearest_examples)
        # Make the Bedrock LLM name
        final_response,time_taken = make_llm_call_with_optimized_prompt(optimized_prompt, user_query)
        
        # Retailer the reaction content material in a brand new column 'unoptimized_prompt_response'
        df_eval.at[index, 'optimized_prompt_response'] = final_response  # Extract 'content material' from the reaction object
    
    go back df_eval

This code compares responses generated with and with out few-shot optimization, putting in place the knowledge for analysis.

LLM as choose and analysis of responses

To quantify reaction high quality, we used an LLM as a choose to attain the optimized and unoptimized responses for alignment with the consumer question. We used Pydantic right here to ensure the output sticks to the required development of 0 (LLM predicts the reaction gained’t be favored by way of the consumer) or 1 (LLM predicts the reaction will probably be favored by way of the consumer):

# Outline Pydantic fashion to put in force predicted suggestions as 0 or 1
magnificence FeedbackPrediction(BaseModel):
    predicted_feedback: conint(ge=0, le=1)  # Most effective permit values 0 or 1

# Serve as to generate few-shot instructed
def generate_few_shot_prompt(df_examples, unoptimized_response):
    few_shot_prompt = (
        "You might be an unbiased choose comparing the standard of LLM responses. "
        "In accordance with the consumer queries and the LLM responses supplied beneath, your activity is to decide whether or not the reaction is excellent or unhealthy, "
        "the use of the examples supplied. Go back 1 if the reaction is excellent (thumbs up) or 0 if the reaction is unhealthy (thumbs down).nn"
    )
    few_shot_prompt += "Under are examples of consumer queries, LLM responses, and consumer suggestions:nn"
    
    # Iterate over few-shot examples
    for i, row in df_examples.iterrows():
        few_shot_prompt += f"Person Question: {row['conv_A_user']}n"
        few_shot_prompt += f"LLM Reaction: {row['conv_A_assistant']}n"
        few_shot_prompt += f"Person Comments: {'👍' if row['conv_A_rating'] == 1 else '👎'}nn"
    
    # Give you the unoptimized reaction for suggestions prediction
    few_shot_prompt += (
        "Now, overview the next LLM reaction in response to the examples above. Go back 0 for unhealthy reaction or 1 for excellent reaction.nn"
        f"Person Question: {unoptimized_response}n"
        f"Predicted Comments (0 for 👎, 1 for 👍):"
    )
    go back few_shot_prompt

LLM-as-a-judge is a capability the place an LLM can choose the accuracy of a textual content the use of sure grounding examples. We have now used that capability right here to pass judgement on the variation between the outcome won from optimized and un-optimized instructed. Amazon Bedrock introduced an LLM-as-a-judge capability in December 2024 that can be utilized for such use instances. Within the following serve as, we show how the LLM acts as an evaluator, scoring responses in response to their alignment and pleasure for the entire analysis dataset:

# Serve as to expect suggestions the use of few-shot examples
def predict_feedback(df_examples, df_to_rate, response_column, target_col):
    # Create a brand new column to retailer predicted suggestions
    df_to_rate[target_col] = None
    
    # Iterate over each and every row within the dataframe to price
    for index, row in tqdm(df_to_rate.iterrows(), overall=len(df_to_rate)):
        # Get the unoptimized instructed reaction
        take a look at:
            time.sleep(2)
            unoptimized_response = row[response_column]

            # Generate few-shot instructed
            few_shot_prompt = generate_few_shot_prompt(df_examples, unoptimized_response)

            # Name the LLM to expect the suggestions
            reaction = llm.invoke(few_shot_prompt)

            # Extract the anticipated suggestions (assuming the fashion returns '0' or '1' as suggestions)
            predicted_feedback_str = reaction.content material.strip()  # Blank and extract the anticipated suggestions

            # Validate the suggestions the use of Pydantic
            take a look at:
                feedback_prediction = FeedbackPrediction(predicted_feedback=int(predicted_feedback_str))
                # Retailer the anticipated suggestions within the dataframe
                df_to_rate.at[index, target_col] = feedback_prediction.predicted_feedback
            except for (ValueError, ValidationError):
                # In case of invalid knowledge, assign default price (e.g., 0)
                df_to_rate.at[index, target_col] = 0
        except for:
            cross

    go back df_to_rate

Within the following instance, we repeated this procedure for 20 trials, shooting consumer pleasure ratings each and every time. The whole rating for the dataset is the sum of the consumer pleasure rating.

df_eval = df.drop(df_test.index).pattern(100)
df_eval['unoptimized_prompt_response'] = "" # Create an empty column to retailer responses
df_eval = get_unoptimized_prompt_response(df_eval)
df_eval['optimized_prompt_response'] = "" # Create an empty column to retailer responses
df_eval = get_optimized_prompt_response(df_eval)
Name the serve as to expect suggestions
df_with_predictions = predict_feedback(df_eval, df_eval, 'unoptimized_prompt_response', 'predicted_unoptimized_feedback')
df_with_predictions = predict_feedback(df_with_predictions, df_with_predictions, 'optimized_prompt_response', 'predicted_optimized_feedback')

# Calculate accuracy for unoptimized and optimized responses
original_success = df_with_predictions.conv_A_rating.sum()*100.0/len(df_with_predictions)
unoptimized_success  = df_with_predictions.predicted_unoptimized_feedback.sum()*100.0/len(df_with_predictions) 
optimized_success = df_with_predictions.predicted_optimized_feedback.sum()*100.0/len(df_with_predictions) 

# Show effects
print(f"Unique luck: {original_success:.2f}%")
print(f"Unoptimized Steered luck: {unoptimized_success:.2f}%")
print(f"Optimized Steered luck: {optimized_success:.2f}%")

End result research

The next line chart presentations the efficiency development of the optimized resolution over the unoptimized one. Inexperienced spaces point out sure enhancements, while crimson spaces display unfavourable adjustments.

Detailed performance analysis graph comparing optimized vs unoptimized solutions, highlighting peak 12% improvement at test case 7.5

As we accrued the results of 20 trials, we noticed that the imply of pleasure ratings from the unoptimized instructed used to be 0.8696, while the imply of pleasure ratings from the optimized instructed used to be 0.9063. Subsequently, our means outperforms the baseline by way of 3.67%.

In spite of everything, we ran a paired pattern t-test to match pleasure ratings from the optimized and unoptimized activates. This statistical check validated whether or not instructed optimization considerably progressed reaction high quality. See the next code:

from scipy import stats
# Pattern consumer pleasure ratings from the pocket book
unopt = [] #20 samples of ratings for the unoptimized promt
choose = [] # 20 samples of ratings for the optimized promt]
# Paired pattern t-test
t_stat, p_val = stats.ttest_rel(unopt, choose)
print(f"t-statistic: {t_stat}, p-value: {p_val}")

After operating the t-test, we were given a p-value of 0.000762, which is lower than 0.05. Subsequently, the efficiency spice up of optimized activates over unoptimized activates is statistically vital.

Key takeaways

We discovered the next key takeaways from this resolution:

  • Few-shot prompting improves question reaction – The use of extremely an identical few-shot examples results in vital enhancements in reaction high quality.
  • Amazon Titan Textual content Embeddings allows contextual similarity – The fashion produces embeddings that facilitate superb similarity searches.
  • Statistical validation confirms effectiveness – A p-value of 0.000762 signifies that our optimized manner meaningfully complements consumer pleasure.
  • Stepped forward industry have an effect on – This manner delivers measurable industry price via progressed AI assistant efficiency. The three.67% build up in pleasure ratings interprets to tangible results: HR departments can be expecting fewer coverage misinterpretations (decreasing compliance dangers), and customer support groups would possibly see a vital relief in escalated tickets. The answer’s skill to incessantly be told from suggestions creates a self-improving device that will increase ROI over the years with out requiring specialised ML experience or infrastructure investments.

Obstacles

Even supposing the device presentations promise, its efficiency closely will depend on the provision and quantity of consumer suggestions, particularly in closed-domain programs. In situations the place just a handful of suggestions examples are to be had, the fashion would possibly combat to generate significant optimizations or fail to seize the nuances of consumer personal tastes successfully. Moreover, the present implementation assumes that consumer suggestions is dependable and consultant of broader consumer wishes, which would possibly no longer at all times be the case.

Subsequent steps

Long run paintings may just center of attention on increasing the program to fortify multilingual queries and responses, enabling broader applicability throughout various consumer bases. Incorporating Retrieval Augmented Generation (RAG) tactics may just additional give a boost to context dealing with and accuracy for advanced queries. Moreover, exploring techniques to handle the constraints in low-feedback situations, reminiscent of artificial suggestions era or switch finding out, may just make the manner extra tough and flexible.

Conclusion

On this put up, we demonstrated the effectiveness of question optimization the use of Amazon Bedrock, few-shot prompting, and consumer suggestions to noticeably give a boost to reaction high quality. Via aligning responses with user-specific personal tastes, this manner alleviates the desire for pricey fashion fine-tuning, making it sensible for real-world programs. Its flexibility makes it appropriate for chat-based assistants throughout more than a few domain names, reminiscent of ecommerce, customer support, and hospitality, the place high quality, user-aligned responses are very important.

To be told extra, seek advice from the next sources:


In regards to the Authors

Tanay Chowdhury is a Information Scientist on the Generative AI Innovation Heart at Amazon Internet Services and products.

Parth Patwa is a Information Scientist on the Generative AI Innovation Heart at Amazon Internet Services and products.

Yingwei Yu is an Implemented Science Supervisor on the Generative AI Innovation Heart at Amazon Internet Services and products.



Source link

Leave a Comment