Use custom metrics to evaluate your generative AI application with Amazon Bedrock


With Amazon Bedrock Evaluations, you’ll be able to evaluation basis fashions (FMs) and Retrieval Augmented Technology (RAG) programs, whether or not hosted on Amazon Bedrock or some other mannequin or RAG machine hosted in different places, together with Amazon Bedrock Knowledge Bases or multi-cloud and on-premises deployments. We recently announced the general availability of the massive language mannequin (LLM)-as-a-judge method in mannequin analysis and the brand new RAG analysis software, additionally powered by way of an LLM-as-a-judge in the back of the scenes. Those equipment are already empowering organizations to systematically evaluation FMs and RAG programs with enterprise-grade equipment. We additionally discussed that those analysis equipment don’t must be restricted to fashions or RAG programs hosted on Amazon Bedrock; with the carry your personal inference (BYOI) responses characteristic, you’ll be able to evaluation fashions or packages if you happen to use the input formatting requirements for both providing.

The LLM-as-a-judge method powering those critiques allows automatic, human-like analysis high quality at scale, the use of FMs to evaluate high quality and accountable AI dimensions with out guide intervention. With integrated metrics like correctness (factual accuracy), completeness (reaction thoroughness), faithfulness (hallucination detection), and accountable AI metrics similar to harmfulness and solution refusal, you and your staff can evaluation fashions hosted on Amazon Bedrock and data bases natively, or the use of BYOI responses out of your custom-built programs.

Amazon Bedrock Critiques gives an intensive listing of integrated metrics for each analysis equipment, however there are occasions whilst you would possibly wish to outline those analysis metrics differently, or make utterly new metrics which might be related on your use case. As an example, you could wish to outline a metric that evaluates an software reaction’s adherence on your particular emblem voice, or wish to classify responses in line with a tradition specific rubric. You may wish to use numerical scoring or specific scoring for more than a few functions. For those causes, you wish to have some way to make use of tradition metrics for your critiques.

Now with Amazon Bedrock, you’ll be able to increase tradition analysis metrics for each mannequin and RAG critiques. This capacity extends the LLM-as-a-judge framework that drives Amazon Bedrock Critiques.

On this submit, we display how one can use tradition metrics in Amazon Bedrock Critiques to measure and make stronger the efficiency of your generative AI packages in line with your particular trade necessities and analysis standards.

Evaluation

Customized metrics in Amazon Bedrock Critiques be offering the next options:

  • Simplified getting began enjoy – Pre-built starter templates are to be had at the AWS Management Console according to our industry-tested integrated metrics, with choices to create from scratch for particular analysis standards.
  • Versatile scoring programs – Improve is to be had for each quantitative (numerical) and qualitative (specific) scoring to create ordinal metrics, nominal metrics, and even use analysis equipment for classification duties.
  • Streamlined workflow control – You’ll be able to save tradition metrics for reuse throughout more than one analysis jobs or import in the past explained metrics from JSON recordsdata.
  • Dynamic content material integration – With integrated template variables (for instance, {{suggested}}, {{prediction}}, and {{context}}), you’ll be able to seamlessly inject dataset content material and mannequin outputs into analysis activates.
  • Customizable output keep an eye on – You’ll be able to use our beneficial output schema for constant effects, with complex choices to outline tradition output codecs for specialised use instances.

Customized metrics come up with unparalleled keep an eye on over the way you measure AI machine efficiency, so you’ll be able to align critiques together with your particular trade necessities and use instances. Whether or not assessing factuality, coherence, helpfulness, or domain-specific standards, tradition metrics in Amazon Bedrock allow extra significant and actionable analysis insights.

Within the following sections, we stroll in the course of the steps to create a task with mannequin analysis and tradition metrics the use of each the Amazon Bedrock console and the Python SDK and APIs.

Supported knowledge codecs

On this phase, we evaluation some necessary knowledge codecs.

Pass judgement on suggested importing

To add your in the past stored tradition metrics into an analysis task, observe the JSON layout within the following examples.

The next code illustrates a definition with numerical scale:

{
    "customMetricDefinition": {
        "metricName": "my_custom_metric",
        "directions": "The entire tradition metric suggested together with no less than one {{enter variable}}",
        "ratingScale": [
            {
                "definition": "first rating definition",
                "value": {
                    "floatValue": 3
                }
            },
            {
                "definition": "second rating definition",
                "value": {
                    "floatValue": 2
                }
            },
            {
                "definition": "third rating definition",
                "value": {
                    "floatValue": 1
                }
            }
        ]
    }
}

The next code illustrates a definition with string scale:

{
    "customMetricDefinition": {
        "metricName": "my_custom_metric",
        "directions": "The entire tradition metric suggested together with no less than one {{enter variable}}",
        "ratingScale": [
            {
                "definition": "first rating definition",
                "value": {
                    "stringValue": "first value"
                }
            },
            {
                "definition": "second rating definition",
                "value": {
                    "stringValue": "second value"
                }
            },
            {
                "definition": "third rating definition",
                "value": {
                    "stringValue": "third value"
                }
            }
        ]
    }
}

The next code illustrates a definition without a scale:

{
    "customMetricDefinition": {
        "metricName": "my_custom_metric",
        "directions": "The entire tradition metric suggested together with no less than one {{enter variable}}"
    }
}

For more info on defining a choose suggested without a scale, see the most productive practices phase later on this submit.

Style analysis dataset layout

When the use of LLM-as-a-judge, just one mannequin may also be evaluated in line with analysis task. In consequence, you should supply a unmarried access within the modelResponses listing for each and every analysis, even though you’ll be able to run more than one analysis jobs to check other fashions. The modelResponses box is needed for BYOI jobs, however now not wanted for non-BYOI jobs. The next is the enter JSONL layout for LLM-as-a-judge in mannequin analysis. Fields marked with ? are not obligatory.

{
    "suggested": string
    "referenceResponse"?: string
    "class"?: string
     "modelResponses"?: [
        {
            "response": string
            "modelIdentifier": string
        }
    ]
}

RAG analysis dataset layout

We up to date the analysis task enter dataset layout to be much more versatile for RAG analysis. Now, you’ll be able to carry referenceContexts, which can be anticipated retrieved passages, so you’ll be able to examine your exact retrieved contexts on your anticipated retrieved contexts. You’ll be able to to find the brand new referenceContexts box within the up to date JSONL schema for RAG analysis:

{
    "conversationTurns": [{
            "prompt": {
                "content": [{
                    "text": string
                }]
            },
            "referenceResponses": [{
                "content": [{
                    "text": string
                }]
            }],
            "referenceContexts" ? : [{
                "content": [{
                    "text": string
                }]
            }],
            "output": {
                "textual content": string "modelIdentifier" ? : string "knowledgeBaseIdentifier": string "retrievedPassages": {
                    "retrievalResults": [{
                        "name" ? : string "content": {
                            "text": string
                        },
                        "metadata" ? : {
                            [key: string]: string
                        }
                    }]
                }
            }]
    }

Variables for knowledge injection into choose activates

To be sure that your knowledge is injected into the choose activates in the best position, use the variables from the next desk. We now have additionally integrated a information to turn you the place the analysis software will pull knowledge out of your enter report, if appropriate. There are instances the place if you happen to carry your personal inference responses to the analysis task, we can use that knowledge out of your enter report; if you happen to don’t use carry your personal inference responses, then we can name the Amazon Bedrock mannequin or wisdom base and get ready the responses for you.

The next desk summarizes the variables for mannequin analysis.

Simple Title Variable Enter Dataset JSONL Key Obligatory or Non-compulsory
Urged {{suggested}} suggested Non-compulsory
Reaction {{prediction}}

For a BYOI task:

modelResponses.reaction 

If you happen to don’t carry your personal inference responses, the analysis task will name the mannequin and get ready this knowledge for you.

Obligatory
Flooring fact reaction {{ground_truth}} referenceResponse Non-compulsory

The next desk summarizes the variables for RAG analysis (retrieve best).

Simple Title Variable Enter Dataset JSONL Key Obligatory or Non-compulsory
Urged {{suggested}} suggested Non-compulsory
Flooring fact reaction {{ground_truth}}

For a BYOI task:

output.retrievedResults.retrievalResults 

If you happen to don’t carry your personal inference responses, the analysis task will name the Amazon Bedrock wisdom base and get ready this knowledge for you.

Non-compulsory
Retrieved passage {{context}}

For a BYOI task:

output.retrievedResults.retrievalResults 

If you happen to don’t carry your personal inference responses, the analysis task will name the Amazon Bedrock wisdom base and get ready this knowledge for you.

Obligatory
Flooring fact retrieved passage {{reference_contexts}} referenceContexts Non-compulsory

The next desk summarizes the variables for RAG analysis (retrieve and generate).

Simple Title Variable Enter dataset JSONL key Obligatory or not obligatory
Urged {{suggested}} suggested Non-compulsory
Reaction {{prediction}}

For a BYOI task:

Output.textual content

If you happen to don’t carry your personal inference responses, the analysis task will name the Amazon Bedrock wisdom base and get ready this knowledge for you.

Obligatory
Flooring fact reaction {{ground_truth}} referenceResponses Non-compulsory
Retrieved passage {{context}}

For a BYOI task:

Output.retrievedResults.retrievalResults

If you happen to don’t carry your personal inference responses, the analysis task will name the Amazon Bedrock wisdom base and get ready this knowledge for you.

Non-compulsory
Flooring fact retrieved passage {{reference_contexts}} referenceContexts Non-compulsory

Necessities

To make use of the LLM-as-a-judge mannequin analysis and RAG analysis options with BYOI, you should have the next necessities:

Create a mannequin analysis task with tradition metrics the use of Amazon Bedrock Critiques

Entire the next steps to create a task with mannequin analysis and tradition metrics the use of Amazon Bedrock Critiques:

  1. At the Amazon Bedrock console, make a choice Critiques within the navigation pane and make a choice the Fashions
  2. Within the Style analysis phase, at the Create dropdown menu, make a choice Automated: mannequin as a choose.
  3. For the Style analysis main points, input an analysis title and not obligatory description.
  4. For Evaluator mannequin, make a choice the mannequin you wish to have to make use of for automated analysis.
  5. For Inference supply, make a choice the supply and make a choice the mannequin you wish to have to judge.

For this case, we selected Claude 3.5 Sonnet because the evaluator mannequin, Bedrock fashions as our inference supply, and Claude 3.5 Haiku as our mannequin to judge.

  1. The console will show the default metrics for the evaluator mannequin you selected. You’ll be able to make a choice different metrics as wanted.
  2. Within the Customized Metrics phase, we create a brand new metric referred to as “Comprehensiveness.” Use the template equipped and alter according to your metrics. You’ll be able to use the next variables to outline the metric, the place best {{prediction}} is obligatory:
    1. suggested
    2. prediction
    3. ground_truth

The next is the metric we explained in complete:

Your position is to pass judgement on the comprehensiveness of a solution according to the query and 
the prediction. Assess the standard, accuracy, and helpfulness of language mannequin reaction,
 and use those to pass judgement on how complete the reaction is. Award upper rankings to responses
 which might be detailed and considerate.

In moderation evaluation the comprehensiveness of the LLM reaction for the given question (suggested)
 in opposition to all specified standards. Assign a unmarried total ranking that easiest represents the 
comprehensivenss, and supply a temporary rationalization justifying your ranking, referencing 
particular strengths and weaknesses seen.

When comparing the reaction high quality, imagine the next rubrics:
- Accuracy: Factual correctness of knowledge equipped
- Completeness: Protection of necessary sides of the question
- Readability: Transparent group and presentation of knowledge
- Helpfulness: Sensible software of the reaction to the person

Review the next:

Question:
{{suggested}}

Reaction to judge:
{{prediction}}

  1. Create the output schema and further metrics. Right here, we outline a scale that gives most issues (10) if the reaction may be very complete, and 1 if the reaction isn’t complete in any respect.
  2. For Datasets, input your enter and output places in Amazon S3.
  3. For Amazon Bedrock IAM position – Permissions, make a choice Use an present carrier position and make a choice a task.
  4. Make a choice Create and look forward to the task to finish.

Concerns and easiest practices

When the use of the output schema of the tradition metrics, observe the next:

  • If you happen to use the integrated output schema (beneficial), don’t upload your grading scale into the principle choose suggested. The analysis carrier will mechanically concatenate your choose suggested directions together with your explained output schema ranking scale and a few structured output directions (distinctive to each and every choose mannequin) in the back of the scenes. That is so the analysis carrier can parse the choose mannequin’s effects and show them at the console in graphs and calculate reasonable values of numerical rankings.
  • The absolutely concatenated choose activates are visual within the Preview window in case you are the use of the Amazon Bedrock console to build your tradition metrics. As a result of choose LLMs are inherently stochastic, there may well be some responses we will’t parse and show at the console and use for your reasonable ranking calculations. Alternatively, the uncooked choose responses are at all times loaded into your S3 output report, even though the analysis carrier can not parse the reaction ranking from the choose mannequin.
  • If you happen to don’t use the integrated output schema characteristic (we advise you employ it as an alternative of ignoring it), then you’re chargeable for offering your ranking scale within the choose suggested directions frame. Alternatively, the analysis carrier won’t upload structured output directions and won’t parse the effects to turn graphs; you’ll see the total choose output plaintext effects at the console with out graphs and the uncooked knowledge will nonetheless be for your S3 bucket.

Create a mannequin analysis task with tradition metrics the use of the Python SDK and APIs

To make use of the Python SDK to create a mannequin analysis task with tradition metrics, observe those steps (or confer with our example notebook):

  1. Arrange the specified configurations, which will have to come with your mannequin identifier for the default metrics and tradition metrics evaluator, IAM position with suitable permissions, Amazon S3 paths for enter knowledge containing your inference responses, and output location for effects:
    import boto3
    import time
    from datetime import datetime
    
    # Configure wisdom base and mannequin settings
    evaluator_model = "anthropic.claude-3-5-sonnet-20240620-v1:0"
    generator_model = "amazon.nova-lite-v1:0"
    custom_metrics_evaluator_model = "anthropic.claude-3-5-sonnet-20240620-v1:0"
    role_arn = "arn:aws:iam:::position/"
    BUCKET_NAME = ""
    
    # Specify S3 places
    input_data = f"s3://{BUCKET_NAME}/evaluation_data/enter.jsonl"
    output_path = f"s3://{BUCKET_NAME}/evaluation_output/"
    
    # Create Bedrock shopper
    # NOTE: You'll be able to trade the area title to the area of your opting for.
    bedrock_client = boto3.shopper('bedrock', region_name="us-east-1") 

  2. To outline a tradition metric for mannequin analysis, create a JSON construction with a customMetricDefinition Come with your metric’s title, write detailed analysis directions incorporating template variables (similar to {{suggested}} and {{prediction}}), and outline your ratingScale array with review values the use of both numerical rankings (floatValue) or specific labels (stringValue). This correctly formatted JSON schema allows Amazon Bedrock to judge mannequin outputs persistently in line with your particular standards.
    comprehensiveness_metric ={
        "customMetricDefinition": {
            "title": "comprehensiveness",
            "directions": """Your position is to pass judgement on the comprehensiveness of an 
    solution according to the query and the prediction. Assess the standard, accuracy, 
    and helpfulness of language mannequin reaction, and use those to pass judgement on how complete
     the reaction is. Award upper rankings to responses which might be detailed and considerate.
    
    In moderation evaluation the comprehensiveness of the LLM reaction for the given question (suggested)
     in opposition to all specified standards. Assign a unmarried total ranking that easiest represents the 
    comprehensivenss, and supply a temporary rationalization justifying your ranking, referencing 
    particular strengths and weaknesses seen.
    
    When comparing the reaction high quality, imagine the next rubrics:
    - Accuracy: Factual correctness of knowledge equipped
    - Completeness: Protection of necessary sides of the question
    - Readability: Transparent group and presentation of knowledge
    - Helpfulness: Sensible software of the reaction to the person
    
    Review the next:
    
    Question:
    {{suggested}}
    
    Reaction to judge:
    {{prediction}}""",
            "ratingScale": [
                {
                    "definition": "Very comprehensive",
                    "value": {
                        "floatValue": 10
                    }
                },
                {
                    "definition": "Mildly comprehensive",
                    "value": {
                        "floatValue": 3
                    }
                },
                {
                    "definition": "Not at all comprehensive",
                    "value": {
                        "floatValue": 1
                    }
                }
            ]
        }
    }

  3. To create a mannequin analysis task with tradition metrics, use the create_evaluation_job API and come with your tradition metric within the customMetricConfig phase, specifying each integrated metrics (similar to Builtin.Correctness) and your tradition metric within the metricNames array. Configure the task together with your generator mannequin, evaluator mannequin, and right kind Amazon S3 paths for enter dataset and output effects.
    # Create the mannequin analysis task
    model_eval_job_name = f"model-evaluation-custom-metrics{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
    
    model_eval_job = bedrock_client.create_evaluation_job(
        jobName=model_eval_job_name,
        jobDescription="Review mannequin efficiency with tradition comprehensiveness metric",
        roleArn=role_arn,
        applicationType="ModelEvaluation",
        inferenceConfig={
            "fashions": [{
                "bedrockModel": {
                    "modelIdentifier": generator_model
                }
            }]
        },
        outputDataConfig={
            "s3Uri": output_path
        },
        evaluationConfig={
            "automatic": {
                "datasetMetricConfigs": [{
                    "taskType": "General",
                    "dataset": {
                        "name": "ModelEvalDataset",
                        "datasetLocation": {
                            "s3Uri": input_data
                        }
                    },
                    "metricNames": [
                        "Builtin.Correctness",
                        "Builtin.Completeness",
                        "Builtin.Coherence",
                        "Builtin.Relevance",
                        "Builtin.FollowingInstructions",
                        "comprehensiveness"
                    ]
                }],
                "customMetricConfig": {
                    "customMetrics": [
                        comprehensiveness_metric
                    ],
                    "evaluatorModelConfig": {
                        "bedrockEvaluatorModels": [{
                            "modelIdentifier": custom_metrics_evaluator_model
                        }]
                    }
                },
                "evaluatorModelConfig": {
                    "bedrockEvaluatorModels": [{
                        "modelIdentifier": evaluator_model
                    }]
                }
            }
        }
    )
    
    print(f"Created mannequin analysis task: {model_eval_job_name}")
    print(f"Activity ID: {model_eval_job['jobArn']}")

  4. After filing the analysis task, track its standing with get_evaluation_job and get entry to effects at your specified Amazon S3 location when whole, together with the usual and tradition metric efficiency knowledge.

Create a RAG machine analysis with tradition metrics the use of Amazon Bedrock Critiques

On this instance, we stroll thru a RAG machine analysis with a mix of integrated metrics and tradition analysis metrics at the Amazon Bedrock console. Entire the next steps:

  1. At the Amazon Bedrock console, make a choice Critiques within the navigation pane.
  2. At the RAG tab, make a choice Create.
  3. For the RAG analysis main points, input an analysis title and not obligatory description.
  4. For Evaluator mannequin, make a choice the mannequin you wish to have to make use of for automated analysis. The evaluator mannequin decided on right here shall be used to calculate default metrics if decided on. For this case, we selected Claude 3.5 Sonnet because the evaluator mannequin.
  5. Come with any not obligatory tags.
  6. For Inference supply, make a choice the supply. Right here, you will have the choice to make a choice between Bedrock Wisdom Bases and Deliver your personal inference responses. If you happen to’re the use of Amazon Bedrock Knowledge Bases, it is very important make a choice a in the past created wisdom base or create a brand new one. For BYOI responses, you’ll be able to carry the suggested dataset, context, and output from a RAG machine. For this case, we selected Bedrock Wisdom Base as our inference supply.
  7. Specify the analysis kind, reaction generator mannequin, and integrated metrics. You’ll be able to choose from a blended retrieval and reaction analysis or a retrieval best analysis, with choices to make use of default metrics, tradition metrics, or each in your RAG analysis. The reaction generator mannequin is best required when the use of an Amazon Bedrock wisdom base because the inference supply. For the BYOI configuration, you’ll be able to continue and not using a reaction generator. For this case, we decided on Retrieval and reaction technology as our analysis kind and selected Nova Lite 1.0 as our reaction generator mannequin.
  8. Within the Customized Metrics phase, make a choice your evaluator mannequin. We decided on Claude 3.5 Sonnet v1 as our evaluator mannequin for tradition metrics.
  9. Make a choice Upload tradition metrics.
  10. Create your new metric. For this case, we create a brand new tradition metric for our RAG analysis referred to as information_comprehensiveness. This metric evaluates how totally and entirely the reaction addresses the question by way of the use of the retrieved knowledge. It measures the level to which the reaction extracts and comprises related knowledge from the retrieved passages to supply a complete solution.
  11. You’ll be able to choose from uploading a JSON report, the use of a preconfigured template, or making a tradition metric with complete configuration keep an eye on. As an example, you’ll be able to make a choice the preconfigured templates for the default metrics and alter the scoring machine or rubric. For our information_comprehensiveness metric, we make a choice the tradition possibility, which permits us to enter our evaluator suggested without delay.
  12. For Directions, input your suggested. As an example:
    Your position is to judge how comprehensively the reaction addresses the question 
    the use of the retrieved knowledge. Assess whether or not the reaction supplies a radical 
    remedy of the topic by way of successfully using the to be had retrieved passages.
    
    In moderation evaluation the comprehensiveness of the RAG reaction for the given question
     in opposition to all specified standards. Assign a unmarried total ranking that easiest represents
     the comprehensiveness, and supply a temporary rationalization justifying your ranking, 
    referencing particular strengths and weaknesses seen.
    
    When comparing reaction comprehensiveness, imagine the next rubrics:
    - Protection: Does the reaction make the most of the important thing related knowledge from the retrieved
     passages?
    - Intensity: Does the reaction supply enough element on necessary sides from the
     retrieved knowledge?
    - Context usage: How successfully does the reaction leverage the to be had
     retrieved passages?
    - Knowledge synthesis: Does the reaction mix retrieved knowledge to create
     a radical remedy?
    
    Review the next:
    
    Question: {{suggested}}
    
    Retrieved passages: {{context}}
    
    Reaction to judge: {{prediction}}

  13. Input your output schema to outline how the tradition metric effects shall be structured, visualized, normalized (if appropriate), and defined by way of the mannequin.

If you happen to use the integrated output schema (beneficial), don’t upload your ranking scale into the principle choose suggested. The analysis carrier will mechanically concatenate your choose suggested directions together with your explained output schema ranking scale and a few structured output directions (distinctive to each and every choose mannequin) in the back of the scenes in order that your choose mannequin effects may also be parsed. The absolutely concatenated choose activates are visual within the Preview window in case you are the use of the Amazon Bedrock console to build your tradition metrics.

  1. For Dataset and analysis effects S3 location, input your enter and output places in Amazon S3.
  2. For Amazon Bedrock IAM position – Permissions, make a choice Use an present carrier position and make a choice your position.
  3. Make a choice Create and look forward to the task to finish.

Get started a RAG analysis task with tradition metrics the use of the Python SDK and APIs

To make use of the Python SDK for growing an RAG analysis task with tradition metrics, observe those steps (or confer with our example notebook):

  1. Arrange the specified configurations, which will have to come with your mannequin identifier for the default metrics and tradition metrics evaluator, IAM position with suitable permissions, wisdom base ID, Amazon S3 paths for enter knowledge containing your inference responses, and output location for effects:
    import boto3
    import time
    from datetime import datetime
    
    # Configure wisdom base and mannequin settings
    knowledge_base_id = ""
    evaluator_model = "anthropic.claude-3-5-sonnet-20240620-v1:0"
    generator_model = "amazon.nova-lite-v1:0"
    custom_metrics_evaluator_model = "anthropic.claude-3-5-sonnet-20240620-v1:0"
    role_arn = "arn:aws:iam:::position/"
    BUCKET_NAME = ""
    
    # Specify S3 places
    input_data = f"s3://{BUCKET_NAME}/evaluation_data/enter.jsonl"
    output_path = f"s3://{BUCKET_NAME}/evaluation_output/"
    
    # Configure retrieval settings
    num_results = 10
    search_type = "HYBRID"
    
    # Create Bedrock shopper
    # NOTE: You'll be able to trade the area title to the area of your opting for
    bedrock_client = boto3.shopper('bedrock', region_name="us-east-1") 

  2. To outline a tradition metric for RAG analysis, create a JSON construction with a customMetricDefinition Come with your metric’s title, write detailed analysis directions incorporating template variables (similar to {{suggested}}, {{context}}, and {{prediction}}), and outline your ratingScale array with review values the use of both numerical rankings (floatValue) or specific labels (stringValue). This correctly formatted JSON schema allows Amazon Bedrock to judge responses persistently in line with your particular standards.
    # Outline our tradition information_comprehensiveness metric
    information_comprehensiveness_metric = {
        "customMetricDefinition": {
            "title": "information_comprehensiveness",
            "directions": """
            Your position is to judge how comprehensively the reaction addresses the 
    question the use of the retrieved knowledge. 
            Assess whether or not the reaction supplies a radical remedy of the topic
    by way of successfully using the to be had retrieved passages.
    
    In moderation evaluation the comprehensiveness of the RAG reaction for the given question
    in opposition to all specified standards. 
    Assign a unmarried total ranking that easiest represents the comprehensiveness, and 
    supply a temporary rationalization justifying your ranking, referencing particular strengths
    and weaknesses seen.
    
    When comparing reaction comprehensiveness, imagine the next rubrics:
    - Protection: Does the reaction make the most of the important thing related knowledge from the 
    retrieved passages?
    - Intensity: Does the reaction supply enough element on necessary sides from 
    the retrieved knowledge?
    - Context usage: How successfully does the reaction leverage the to be had 
    retrieved passages?
    - Knowledge synthesis: Does the reaction mix retrieved knowledge to 
    create a radical remedy?
    
    Review the use of the next:
    
    Question: {{suggested}}
    
    Retrieved passages: {{context}}
    
    Reaction to judge: {{prediction}}
    """,
            "ratingScale": [
                {
                    "definition": "Very comprehensive",
                    "value": {
                        "floatValue": 3
                    }
                },
                {
                    "definition": "Moderately comprehensive",
                    "value": {
                        "floatValue": 2
                    }
                },
                {
                    "definition": "Minimally comprehensive",
                    "value": {
                        "floatValue": 1
                    }
                },
                {
                    "definition": "Not at all comprehensive",
                    "value": {
                        "floatValue": 0
                    }
                }
            ]
        }
    }

  3. To create a RAG analysis task with tradition metrics, use the create_evaluation_job API and come with your tradition metric within the customMetricConfig phase, specifying each integrated metrics (Builtin.Correctness) and your tradition metric within the metricNames array. Configure the task together with your wisdom base ID, generator mannequin, evaluator mannequin, and right kind Amazon S3 paths for enter dataset and output effects.
    # Create the analysis task
    retrieve_generate_job_name = f"rag-evaluation-generate-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
    
    retrieve_generate_job = bedrock_client.create_evaluation_job(
        jobName=retrieve_generate_job_name,
        jobDescription="Review retrieval and technology with tradition metric",
        roleArn=role_arn,
        applicationType="RagEvaluation",
        inferenceConfig={
            "ragConfigs": [{
                "knowledgeBaseConfig": {
                    "retrieveAndGenerateConfig": {
                        "type": "KNOWLEDGE_BASE",
                        "knowledgeBaseConfiguration": {
                            "knowledgeBaseId": knowledge_base_id,
                            "modelArn": generator_model,
                            "retrievalConfiguration": {
                                "vectorSearchConfiguration": {
                                    "numberOfResults": num_results
                                }
                            }
                        }
                    }
                }
            }]
        },
        outputDataConfig={
            "s3Uri": output_path
        },
        evaluationConfig={
            "automatic": {
                "datasetMetricConfigs": [{
                    "taskType": "General",
                    "dataset": {
                        "name": "RagDataset",
                        "datasetLocation": {
                            "s3Uri": input_data
                        }
                    },
                    "metricNames": [
                        "Builtin.Correctness",
                        "Builtin.Completeness",
                        "Builtin.Helpfulness",
                        "information_comprehensiveness"
                    ]
                }],
                "evaluatorModelConfig": {
                    "bedrockEvaluatorModels": [{
                        "modelIdentifier": evaluator_model
                    }]
                },
                "customMetricConfig": {
                    "customMetrics": [
                        information_comprehensiveness_metric
                    ],
                    "evaluatorModelConfig": {
                        "bedrockEvaluatorModels": [{
                            "modelIdentifier": custom_metrics_evaluator_model
                        }]
                    }
                }
            }
        }
    )
    
    print(f"Created analysis task: {retrieve_generate_job_name}")
    print(f"Activity ID: {retrieve_generate_job['jobArn']}")

  4. After filing the analysis task, you’ll be able to take a look at its standing the use of the get_evaluation_job means and retrieve the effects when the task is whole. The output shall be saved on the Amazon S3 location specified within the output_path parameter, containing detailed metrics on how your RAG machine carried out around the analysis dimensions together with tradition metrics.

Customized metrics are best to be had for LLM-as-a-judge. On the time of writing, we don’t settle for tradition AWS Lambda purposes or endpoints for code-based tradition metric evaluators. Human-based mannequin analysis has supported custom metric definition since its release in November 2023.

Blank up

To keep away from incurring long term fees, delete the S3 bucket, pocket book cases, and different sources that had been deployed as a part of the submit.

Conclusion

The addition of tradition metrics to Amazon Bedrock Critiques empowers organizations to outline their very own analysis standards for generative AI programs. Via extending the LLM-as-a-judge framework with tradition metrics, companies can now measure what issues for his or her particular use instances along integrated metrics. With give a boost to for each numerical and specific scoring programs, those tradition metrics allow constant review aligned with organizational requirements and targets.

As generative AI turns into increasingly more built-in into trade processes, the power to judge outputs in opposition to custom-defined standards is very important for keeping up high quality and using steady growth. We inspire you to discover those new features in the course of the Amazon Bedrock console and API examples equipped, and uncover how personalized evaluation frameworks can support your AI programs’ efficiency and trade have an effect on.


Concerning the Authors

Shreyas Subramanian is a Major Information Scientist and is helping shoppers by way of the use of generative AI and deep finding out to resolve their trade demanding situations the use of AWS products and services. Shreyas has a background in large-scale optimization and ML and in using ML and reinforcement finding out for accelerating optimization duties.

Adewale Akinfaderin is a Sr. Information Scientist–Generative AI, Amazon Bedrock, the place he contributes to leading edge inventions in foundational fashions and generative AI packages at AWS. His experience is in reproducible and end-to-end AI/ML strategies, sensible implementations, and serving to international shoppers formulate and increase scalable answers to interdisciplinary issues. He has two graduate levels in physics and a doctorate in engineering.

Jesse Manders is a Senior Product Supervisor on Amazon Bedrock, the AWS Generative AI developer carrier. He works on the intersection of AI and human interplay with the purpose of constructing and making improvements to generative AI services to fulfill our wishes. In the past, Jesse held engineering staff management roles at Apple and Lumileds, and was once a senior scientist in a Silicon Valley startup. He has an M.S. and Ph.D. from the College of Florida, and an MBA from the College of California, Berkeley, Haas Faculty of Trade.

Ishan Singh is a Sr. Generative AI Information Scientist at Amazon Internet Products and services, the place he is helping shoppers construct leading edge and accountable generative AI answers and merchandise. With a powerful background in AI/ML, Ishan focuses on development Generative AI answers that force trade worth. Outdoor of labor, he enjoys enjoying volleyball, exploring native motorcycle trails, and spending time together with his spouse and canine, Beau.



Source link

Leave a Comment