Evaluations

Evaluating accuracy in LawY

As legal professionals increasingly experiment with AI research tools in the legal market, it is difficult for lawyers using such products to understand the comparative accuracy of the different tools available to them.

We understand the legal profession demands the highest standards of accuracy. As your trusted AI legal research assistant, we're committed to continually improving.

This is why, in addition to manual evaluation by our team of lawyers, we've also introduced an automatic system to assist lawyers in evaluating our AI-answers alongside those from other leading platforms.

See the results

Published: 8th August 2025

The methodology

We are excited to introduce an automated evaluation system that benchmarks AI answers from LawY and other AI platforms against a human-verified source of truth. This helps us deliver continually higher levels of accuracy, shape the direction of LawY's development, and offer our users transparency.

300 sample questions

We randomly selected 300 legal questions spanning all areas of law, and primarily focused on AU & UK.

3 platforms to compare

Each question was answered by 3 different AI solutions: LawY, Gemini 2.5 Pro, and ChatGPT 4.1.

1 source of truth

'Golden Answers' were AI-answers reviewed and corrected by an experienced 'lawyer-in-the-loop'.

LLM as a judge

An LLM compared each AI solution's answer against the 'golden answer', evaluating accuracy, and providing comparative commentary.

Tally the result

We tallied the final results for each AI solution giving a total score across the platforms which can easily be compared.

Objectivity: To ensure objectivity, we intentionally excluded all golden answers from LawY's knowledge base of verified answers during testing. This prevented any unfair advantage and ensured our evaluations reflect genuine performance.

Evaluation: Additionally, to ensure the accuracy of the LLM's evaluations during these comparisons, our team of lawyers verified the LLM's judgments on 183 of LawY's answers within their respective areas of legal expertise.

Reasoning

Why this methodology is appropriate

This methodology provides a practical way for users to compare LawY's performance against other leading platforms, helping them determine the most suitable tool for their needs. While there are limitations (noted below), automated testing remains the only efficient and cost-effective approach to regularly benchmark these tools against one another.

We're committed to publishing benchmarking results regularly to offer you transparency and to continue to improve LawY so that you have unparalleled accuracy.

Limitations

The limitations of this methodology

Like any AI-based tool, LLMs are prone to occasional errors. In this benchmarking process, some AI answers from all platforms contain inaccuracies or fail to properly reflect information in the source of truth. Similarly, since we use AI to compare the answers, the LLM's judgment may sometimes be inaccurate.

The source of truth answers were selected at random and were accurate as of their verification date in LawY. However, laws may have changed since these verifications were completed. Such changes could result in the LLM judge incorrectly marking AI answers from one or more platforms included in the evaluation process.

The results

Our latest evaluation results show that LawY's AI-answers were measured as the most correct when compared to leading platforms including Gemini 2.5 Pro and ChatGPT 4.1. 'Correct' was defined as how closely the test answers aligned with the lawyer-verified 'Golden Answers'.

Note: 'Golden Answers' were excluded from the LawY database for the duration of testing, to ensure fair evaluations.

LawY

77% correct
232/300

Gemini 2.5 Pro

61% correct
182/300

ChatGPT 4.1

57% correct
172/300

Evaluating accuracy in LawY

The methodology

We are excited to introduce an automated evaluation system that benchmarks AI answers from LawY and other AI platforms against a human-verified source of truth. This helps us deliver continually higher levels of accuracy, shape the direction of LawY's development, and offer our users transparency.

300 sample questions

3 platforms to compare

1 source of truth

LLM as a judge

Tally the result

Why this methodology is appropriate

The limitations of this methodology

The results

Our latest evaluation results show that LawY's AI-answers were measured as the most correct when compared to leading platforms including Gemini 2.5 Pro and ChatGPT 4.1. 'Correct' was defined as how closely the test answers aligned with the lawyer-verified 'Golden Answers'.

LawY

Gemini 2.5 Pro

ChatGPT 4.1

See the full results for Australia and the United Kingdom.

View the results for Australia

LawY

Gemini 2.5 Pro

ChatGPT 4.1

View the results for the United Kingdom

LawY

Gemini 2.5 Pro

ChatGPT 4.1

Learn more about the methodology

We have created a fair and scalable approach to enable regular benchmarking.