LibGuides: ChatGPT and Generative AI Legal Research Guide: Comparison of Answers of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard

Comparison of Responses of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard

When asked the same question (prompt), ChatGPT-3.5, ChatGPT-4, Bing Chat, Bard all produce different results. They are all distinct systems with varying capabilities, and it's important to be aware of these differences and consider them when evaluating the reliability, accuracy, and usefulness of the answers you receive.

It may be beneficial to try using multiple systems to generate an answer that meets your specific needs.

Systems Comparison: Providing Information About a Statute

To test the ability of various systems to provide information about a specific statute, researchers asked them about A.R.S. 13-3603.02. This Arizona statute prohibits abortion based on certain criteria, such as sex selection and race.

Prompt: What is A.R.S. 13-3603.02 about?

ChatGPT-3.5 Response

Grade = F-

The system got the topic of the statute wrong (voyeurism instead of abortion) and then made up the text of the statute. It did, however, provide the correct URL for the statute.

ChatGPT-4 Response

Grade = F+

In its quest to live up to the hype about providing accurate information (and not hallucinating), the system initially denied any knowledge of the statute. Then, when informed about the subject of the statute, it recommended searching the Arizona legislature's website.

Though the system received a low grade (F+) for its performance, it did receive some credit (the +) for admitting when it didn't know the answer.

Bing Chat Response

Grade = A+

Bing Chat did much better than both versions of ChatGPT. The system correctly identified the statute's topic and provided the accurate text and URL.

Bard Response

Grade = C

Bard correctly summarized the statute, but was unable to provide the text of the statute or its URL.

Systems Comparison: Analyzing a Legal Fact Pattern and Writing Legal Memorandum

By examining how each system responds to this task, we can gain insight into their ability to interpret and analyze complex legal information, as well as their potential to assist in legal research and analysis.

Prompt: I have a question involving Arizona DUI law. Please write a 1000 word legal memorandum on the following. The facts are that a woman named Drunka was arrested and charged with a DUI after a police officer found her drunk and sleeping in the back seat of her car while parked illegally on the side of the road. Her keys were in the ignition. She is not planning to deny she was drunk. Please provide a detailed 1000 word answer about her likelihood of conviction based on these facts. In your answer, please cite relevant statutes and at least 3 relevant Arizona cases. Please put the legal citations in Bluebook format, if possible.

ChatGPT-3.5 Response

Grade = C-

The system did a bad job researching and providing accurate information about relevant Arizona primary materials. Although ChatGPT accurately cited and summarized the statute, the case names and citations were all incorrect. The citation for State v. Zaragoza retrieved an irrelevant murder case, although there is a State v. Zaragoza DUI case with the same facts/holding described by ChatGPT. Similarly, although the citation for State v. Loveall retrieved a case with different party names that wasn't a DUI case, it seems likely that ChatGPT meant to reference the State v. Love case which has the same facts/holding it described. The citation for State v. Peterson also retrieved a case with different party names that also wasn't about DUI.

Despite these shortcomings, the system successfully analyzed a fact pattern and produced a legal memorandum that correctly explains the statute and some relevant case law on the subject (although it got the cases totally wrong) and then applies them to the facts. It may not be the best-written legal memorandum ever, but some might consider it a useful first draft to start with.

There is, of course, more work for the researcher. In addition to finding actual Arizona DUI cases with facts/holdings similar to those described in the memo, the researcher would have to conduct research to verify and update all information, and to make sure that no additional relevant information or primary materials are missed.

ChatGPT-4 Response

Grade = B

Similar to ChatGPT-3.5, ChatGPT-4 successfully analyzed a fact pattern and produced a legal memorandum that correctly explains the statute and some relevant case law on the subject and then applies it to the facts. It may not be the best-written legal memorandum ever, but some might consider it a useful first draft to start with.

ChatGPT-4 did a better job at researching and providing accurate information about relevant Arizona primary materials. The citation for the statutes was accurate and the relevant parts were summarized correctly. As for the cases, only Tellez was incorrect. For the others, ChatGPT gave accurate names and citations, and their facts and holdings were correctly summarized.

There is still, of course, more work to do since the researcher would have to conduct research to verify and update all information, and to make sure that no additional relevant information or primary materials are missed.

Bing Chat Response

Grade = C-

The system did a good job researching and providing accurate information and relevant primary materials! It correctly cited and summarize the statute and two of the leading cases (Love and Zaragoza) for the area of law.

However, it refused to generate a legal memorandum or attempt to apply the law to the facts to reach a conclusion. Additionally, it provided information about legal issues (blood test and evidence of impairment) not relevant to the original prompt.

Bard Response

Grade = D-

Even though the system generated a memo of sorts, the information it provides is too general to be a useful first draft to start with.