Show simple item record

dc.contributor.authorGelbard, Andrew
dc.contributor.authorHamilton, Lei
dc.date.accessioned2025-09-10T13:47:08Z
dc.date.available2025-09-10T13:47:08Z
dc.date.issued2025-09-10
dc.identifier.urihttps://hdl.handle.net/1721.1/162628
dc.description.abstractThe accurate classification of government documents according to their sensitivity (e.g., UNCLASSIFIED, SECRET, TOP SECRET) is critical for national security, yet historically has relied on time-intensive manual review. The current manual classification process consumes millions of labor hours annually within the U.S. government, significantly diverting skilled personnel from essential analytical tasks. This research explores automating this security classification task using recently available declassified materials from the DISC dataset [1], addressing practical challenges such as noisy Optical Character Recognition (OCR) output, imbalanced data distributions, and potential leakage of explicit classification markers within document text. This dataset contains declassified government documents sourced from the Digital National Security Archive, providing authentic textual examples representative of actual classification scenarios. We evaluate both traditional machine learning approaches and advanced transformerbased language models to classify documents accurately across multiple sensitivity levels. Our results highlight that transformer-based models, particularly DeBERTa, effectively improve identification of the minority but critical TOP SECRET class, achieving recall over 70% and an overall balanced performance (macro F1 score of 0.75), while traditional methods exhibit similar overall accuracy but struggle with minority class recall. Despite promising findings, we caution that conclusions drawn here remain constrained by limited training data size and inherent uncertainties in human-labeled documents. We emphasize the need for larger, rigorously preprocessed datasets and suggest future research integrating authoritative classification guidelines directly into model training, potentially via retrieval-augmented methods. This work thus contributes a foundational, reproducible framework that demonstrates significant potential for machine-assisted security classification, guiding future research and practical applications in the information security domain.en_US
dc.description.sponsorshipThe Department of the Air Force Artificial Intelligence Acceleratoren_US
dc.language.isoen_USen_US
dc.subjectAir Force Artificial Intelligence Acceleratoren_US
dc.subjectArtificial Intelligenceen_US
dc.subjectDerivative Security Classificationen_US
dc.titleArtificial Intelligence for Derivative Security Classification: Applications to DoDen_US
dc.typeTechnical Reporten_US
dc.contributor.departmentLincoln Laboratoryen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record