Classifying and Decoding Historical Texts and Images using CNNs!

Aishwarya Singh 07 May, 2019 • 3 min read

Overview

A team of researchers have built a CNN model that can analyze and classify ancient graffiti
Multinomial logistic regression model gave an AUC score of 0.82, while a 2D-CNN showed an accuracy of 0.94
The researchers have open sourced the dataset to involve the entire community in this research

Introduction

Historical texts and images hold a special fascination for me. Ancient writings and codes have a mysterious aura about them, and as a data scientist that’s something I’m automatically drawn to. Could machine learning be the answer? Could our algorithms really decode texts written as back as thousands of years?

A team of researchers from the National Technical University of Ukraine and Huizhou University’s School of Information Science and Technology have designed an algorithm with the aim of detecting, isolating and classifying ancient Graffiti. Sounds really interesting, right? Stay with me, because we’re about to take a dive into the technique they’ve used!

Current techniques used for handwriting recognition give a very high accuracy on the hand written texts, but did not show a similar performance for graffiti images. According to the team, one of the reasons for this could be the difference in quality of text written with on a paper and text written on a stone. The quality of stone carved handwriting is comparatively poor, which is understandable gives we’re talking about centuries ago!

The first step was to preprocess the data and then train a model on this dataset. Broadly two datasets were used – CGCl and notMNIST. The CGCL dataset consists of carved Glagolitic and Cyrillic letters (CGCL) of graffiti from the St. Sophia Cathedral of Kyiv. These images were assembled and preprocessed to provide glyphs for recognition and prediction.

The dataset consists of 4000 images (34 types of letters – classes). Another dataset , notMNIST, was used to compare the results obtained with GCGL. The notMNISt dataset includes publicly available fonts from 10 classes.

A multinomial logistic regression model was applied to a subset of 10 classes. The AUC-ROC for individual letters was approximately 0.92 for notMNIST and 0.60 for CGCL (refer to the image below). The averaged AUC values were 0.99 and 0.82 for notMNIST and CGCL respectively. A 2D-CNN showed an accuracy of 0.94 on CGCL and 0.91 on notMNIST.

These details are mentioned in much more depth in the paper published by the team on arxiv.org – O pen Source Dataset and Machine Learning Techniques for Automatic Recognition of Historical Graffiti.

Our take on this

4000 images is a relatively small subset of images so expectations from this algorithm are still tempered. It’s a good start but ancient hieroglyphics had complex codes and texts, not something that one single algorithm will be able to crack open in a matter of weeks.

Having said that, this study still shows that the potential is there. It’s a good start, and while interpretability remains a question, I am looking forward to getting my hands on the dataset and performing some cool exploratory analysis.

Subscribe to AVBytes here to get regular data science, machine learning and AI updates in your inbox!

Aishwarya Singh 07 May 2019

An avid reader and blogger who loves exploring the endless world of data science and artificial intelligence. Fascinated by the limitless applications of ML and AI; eager to learn and discover the depths of data science.

AVbytes