Skip to Main Content

ChatGPT and Generative AI Legal Research Guide

Large Language Model Training Data

The training data used to train large language models, such as ChatGPT, includes a diverse range of legal documents and texts that are publicly available on the internet. However, it's important to note that different AI systems may have varying cut-off dates for their training data and may or may not have the ability to search the internet for current information.

In his April 20, 2023 article entitled What Makes LLM-Based AI So Smart? Well, Turns Out This Blog Played A Part, Along with Other Legal Sites, Bob Ambrogi reports though that The Washington Post has “lifted the cover off this black box.”

In collaboration with the Allen Institute for AI, The Washington Post analyzed Google's C4 dataset, a huge collection of data from 15 million websites. This dataset has been utilized in the training of prominent English-language AI models such as Google's T5 and Facebook's LLaMA. The analysis involved categorizing the websites based on their content, such as journalism, entertainment, etc.

Ambrogi looked at the data and lists some of the top legal information sites that were included in the training data:

It's important to keep in mind that while these legal information sites were included in the training data, the cut-off dates for the data may vary between different AI systems. Additionally, some AI systems may have the capability to search the internet for current information, while others may not. Users should be aware of these differences when utilizing AI systems for legal research and always verify the information provided against up-to-date, authoritative sources.