From Extraction to Retrieval: A Graph-Enhanced Framework for Spatio-Temporal Reasoning in Historical Legal Documents
Author(s)
Liu, Yifeng
DownloadThesis PDF (9.795Mb)
Advisor
Williams, Sarah
Terms of use
Metadata
Show full item recordAbstract
Accurate understanding and querying of historical legal property documents remains a significant challenge for urban planning research. These records typically exist only in analog format—scanned images with inconsistent quality, archaic language, and no structured metadata. This limitation severely hinders systematic analysis of how discriminatory housing practices, particularly racial covenants, shaped city development patterns. While researchers have begun applying Generative AI systems to assist with legal documentation work,1 fundamental challenges persist in retrieval accuracy, model hallucination, and reliable extraction of structured facts from unstructured historical text. Using racial covenant analysis as a test case, this thesis addresses two interconnected challenges. Information Extraction: How can structured spatio-temporal information be accurately extracted from degraded historical documents? Reasoning at Scale: How can retrieval systems maintain accuracy when answering complex queries requiring multi-hop reasoning across thousands of documents with temporal and spatial constraints? To address the first challenge, a document processing pipeline was developed and applied to 569 historical deed records from a test subset of the Massachusetts Covenants Project, spanning 1861–1930 in North Middlesex County. The pipeline integrates optical character recognition, named entity extraction, and geographic information systems to transform scanned deeds into structured spatio-temporal data. The case study demonstrates successful extraction of policy-relevant keywords and accurate geolocation, achieving 64.9% complete accuracy across all documents (or 76.3% when considering only documents containing extractable geographic information), validating the e!ectiveness of this approach for systematically analyzing discriminatory housing patterns. The pipeline has been released as an open-source tool2 and is currently being packaged for deployment by MassHousing, a state a!ordable housing agency, to support ongoing collaborative research. While the pipeline produces structured data, querying this information at scale presents its own challenges. To address this second problem, a Graph RAG (Graph-based RetrievalAugmented Generation) framework is adapted and optimized, with its knowledge graph structure designed to specifically encode the spatio-temporal relationships inherent in historical deeds. Through controlled experiments on 2,000 synthetic documents mirroring real deed characteristics, it is demonstrated that the spatio-temporal-optimized Graph RAG achieves an overall F1 score of 0.598, compared to 0.007 for traditional vector-based RAG—an absolute improvement of 0.591 in F1 score. Notably, vector RAG performance degrades significantly when scaling from 100 to 2,000 documents, while the proposed method maintains stable accuracy. Performance was further benchmarked across a five-level query complexity hierarchy, with Graph RAG achieving particularly strong results on multi-hop queries (F1: 0.923) and temporal reasoning tasks, highlighting the critical role of graph construction and query parsing for complex spatio-temporal tasks. These dual contributions establish a replicable framework for historical document analysis in urban planning research, comprising an open-source processing pipeline achieving 64.9% complete accuracy (76.3% on valid samples) and a retrieval architecture that maintains performance at scale. The methods generalize to legal document review, housing policy analysis, and other domains requiring structured, scalable reasoning over large archival collections.
Date issued
2026-02Department
Massachusetts Institute of Technology. Department of Urban Studies and PlanningPublisher
Massachusetts Institute of Technology