Handwriting Identification using Random Forests and Score-based Likelihood Ratios

Johnson, Madeline

Handwriting Identification using Random Forests and Score-based Likelihood Ratios

File

Johnson_Madeline_CC.pdf (3.36 MB)

Date

2021-01-01

Authors

Johnson, Madeline

Major Professor

Danica Ommen

Organizational Units

Organizational Unit

Statistics

As leaders in statistical research, collaboration, and education, the Department of Statistics at Iowa State University offers students an education like no other. We are committed to our mission of developing and applying statistical methods, and proud of our award-winning students and faculty.

Department

Statistics

Abstract

Handwriting analysis is conducted through the expertise of Forensic Document Examiners (FDEs) by visually comparing writing samples. Through their training and years of experience, FDEs are able to recognize critical characteristics of writing to evaluate the evidence of writership. In recent years, there have been incentives to further investigate how to quantify the similarity between two written documents to support the conclusions drawn by FDEs.

One way to extract information from these documents is to define various features within handwritten samples. Using an automatic algorithm within the ‘handwriter‘ package in R, a sample can split into “graphs”, which are small units of writing (han, 2020). These graphs are sorted into 40 exemplar groups or “clusters”. The clusters consist of graphs that have similar structures found in documents throughout a database with many writers. The number of graphs per cluster for each document written by a given person acts as a quantitative feature of the handwriting. Previous work related to these features focused on quantifying the probability a questioned document was written by one of the writers in a closed set. In these cases, all of the potential sources of the handwriting are assumed to be known (Crawford, 2020). This project aims to use experimental data collected at CSAFE to study how classification tools can be used to assess the within-writer versus between-writer hypotheses in an open set of writers.

Specifically, a statistical model can be used to study the proportion of these graphs categorized within each of the 40 clusters for each document. Then, given two questioned handwritten documents, it is possible to quantify how similar the proportions across clusters are using a distance measure, such as the difference in proportions for each cluster and the Euclidean distance. Since writers over time and across documents have similar writing patterns, it is expected that the proportion of graphs classified to these clusters is comparable when written by the same person. Conversely, the proportion of graphs by cluster will be less similar when the documents do not share the same source. Outputs from trained random forest algorithms are used as dissimilarity scores. After estimating densities for a collection of these dissimilarity scores, multiple score-based likelihood ratios can be computed. These scores have shown clear discernment between handwriting features from the same writer and different writers to varying degrees across types of documents and writers.

Findings from this statistical research provide insight on another way to quantify the similarity between two questioned documents when all possible sources are unknown.

Copyright

Fri Jan 01 00:00:00 UTC 2021

Collections

Creative Components

Full item page