AI Is Bad At Bluebooking (Part 988) (Updated 6/1)

Kim Krawiec June 1, 2025No Comments

As reported by Reuters here:

An attorney defending artificial-intelligence company Anthropic in a copyright lawsuit over music lyrics told a California federal judge on Thursday that her law firm Latham & Watkins was responsible for an incorrect footnote in an expert report caused by an AI "hallucination."

Ivana Dukanovic said in a court filing that the expert had relied on a legitimate academic journal article, but Dukanovic created a citation for it using Anthropic's chatbot Claude, which made up a fake title and authors in what the attorney called "an embarrassing and unintentional mistake."

Although not about AI hallucinations, Matthew Dahl in this recent paper, shows that LLMs are not very good even at technical conformance with bluebook rules:

Bye-bye, Bluebook? Automating Legal Procedure with Large Language Models

Legal practice requires careful adherence to procedural rules. In the United States, few are more complex than those found in The Blue-book: A Uniform System of Citation. Compliance with this system’s 500+ pages of byzantine formatting instructions is the raison d’être of thousands of student law review editors and the bête noire of lawyers everywhere. To evaluate whether large language models (LLMs) are able to adhere to the procedures of such a complicated system, we construct an original dataset of 866 Bluebook tasks and test flagship LLMs from OpenAI, Anthropic, Google, Meta, and DeepSeek. We show (1) that these models produce fully compliant Bluebook citations only 69%-74% of the time and (2) that in-context learning on the Bluebook’s underlying system of rules raises accuracy only to 77%. These results caution against using off-the-shelf LLMs to automate aspects of the law where fidelity to procedure is paramount.

Can confirm that it makes similar mistakes with less byzantine citation styles as well.

Update 6/1: Here is a database that tracks legal decisions in cases "where generative AI produced hallucinated content – typically fake citations, but also other types of arguments." As of today it includes 129 decisions.

Comments

No comments yet. Why don’t you start the discussion?

Comments

Leave a Reply Cancel reply