Handwriting Identification using Random Forests and Score-based Likelihood Ratios
Date
Authors
Major Professor
Advisor
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Authors
Research Projects
Organizational Units
Journal Issue
Is Version Of
Versions
Series
Department
Abstract
Handwriting analysis is conducted through the expertise of Forensic Document Examiners (FDEs) by visually comparing writing samples. Through their training and years of experience, FDEs are able to recognize critical characteristics of writing to evaluate the evidence of writership. In recent years, there have been incentives to further investigate how to quantify the similarity between two written documents to support the conclusions drawn by FDEs.
One way to extract information from these documents is to define various features within handwritten samples. Using an automatic algorithm within the ‘handwriter‘ package in R, a sample can split into “graphs”, which are small units of writing (han, 2020). These graphs are sorted into 40 exemplar groups or “clusters”. The clusters consist of graphs that have similar structures found in documents throughout a database with many writers. The number of graphs per cluster for each document written by a given person acts as a quantitative feature of the handwriting. Previous work related to these features focused on quantifying the probability a questioned document was written by one of the writers in a closed set. In these cases, all of the potential sources of the handwriting are assumed to be known (Crawford, 2020). This project aims to use experimental data collected at CSAFE to study how classification tools can be used to assess the within-writer versus between-writer hypotheses in an open set of writers.
Specifically, a statistical model can be used to study the proportion of these graphs categorized within each of the 40 clusters for each document. Then, given two questioned handwritten documents, it is possible to quantify how similar the proportions across clusters are using a distance measure, such as the difference in proportions for each cluster and the Euclidean distance. Since writers over time and across documents have similar writing patterns, it is expected that the proportion of graphs classified to these clusters is comparable when written by the same person. Conversely, the proportion of graphs by cluster will be less similar when the documents do not share the same source. Outputs from trained random forest algorithms are used as dissimilarity scores. After estimating densities for a collection of these dissimilarity scores, multiple score-based likelihood ratios can be computed. These scores have shown clear discernment between handwriting features from the same writer and different writers to varying degrees across types of documents and writers.
Findings from this statistical research provide insight on another way to quantify the similarity between two questioned documents when all possible sources are unknown.