Classifying and Decoding Historical Texts and Images using CNNs!

Aishwarya Singh Last Updated : 07 May, 2019

3 min read

Overview

A team of researchers have built a CNN model that can analyze and classify ancient graffiti
Multinomial logistic regression model gave an AUC score of 0.82, while a 2D-CNN showed an accuracy of 0.94
The researchers have open sourced the dataset to involve the entire community in this research

Introduction

Historical texts and images hold a special fascination for me. Ancient writings and codes have a mysterious aura about them, and as a data scientist that’s something I’m automatically drawn to. Could machine learning be the answer? Could our algorithms really decode texts written as back as thousands of years?

A team of researchers from the National Technical University of Ukraine and Huizhou University’s School of Information Science and Technology have designed an algorithm with the aim of detecting, isolating and classifying ancient Graffiti. Sounds really interesting, right? Stay with me, because we’re about to take a dive into the technique they’ve used!

Current techniques used for handwriting recognition give a very high accuracy on the hand written texts, but did not show a similar performance for graffiti images. According to the team, one of the reasons for this could be the difference in quality of text written with on a paper and text written on a stone. The quality of stone carved handwriting is comparatively poor, which is understandable gives we’re talking about centuries ago!

The first step was to preprocess the data and then train a model on this dataset. Broadly two datasets were used – CGCl and notMNIST. The CGCL dataset consists of carved Glagolitic and Cyrillic letters (CGCL) of graffiti from the St. Sophia Cathedral of Kyiv. These images were assembled and preprocessed to provide glyphs for recognition and prediction.

The dataset consists of 4000 images (34 types of letters – classes). Another dataset , notMNIST, was used to compare the results obtained with GCGL. The notMNISt dataset includes publicly available fonts from 10 classes.

A multinomial logistic regression model was applied to a subset of 10 classes. The AUC-ROC for individual letters was approximately 0.92 for notMNIST and 0.60 for CGCL (refer to the image below). The averaged AUC values were 0.99 and 0.82 for notMNIST and CGCL respectively. A 2D-CNN showed an accuracy of 0.94 on CGCL and 0.91 on notMNIST.

These details are mentioned in much more depth in the paper published by the team on arxiv.org – O pen Source Dataset and Machine Learning Techniques for Automatic Recognition of Historical Graffiti.

Our take on this

4000 images is a relatively small subset of images so expectations from this algorithm are still tempered. It’s a good start but ancient hieroglyphics had complex codes and texts, not something that one single algorithm will be able to crack open in a matter of weeks.

Having said that, this study still shows that the potential is there. It’s a good start, and while interpretability remains a question, I am looking forward to getting my hands on the dataset and performing some cool exploratory analysis.

Subscribe to AVBytes here to get regular data science, machine learning and AI updates in your inbox!

Aishwarya Singh

An avid reader and blogger who loves exploring the endless world of data science and artificial intelligence. Fascinated by the limitless applications of ML and AI; eager to learn and discover the depths of data science.

AVbytes

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.6

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Classifying and Decoding Historical Texts and Images using CNNs!

Overview

Introduction

Our take on this

Subscribe to AVBytes here to get regular data science, machine learning and AI updates in your inbox!

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Data analyst Learning Path

Tableau Learning Path

NLP Learning Path

Data Scientist Learning Path

Data Engineer Learning Path

MLOps Learning Path

AI Engineer Learning Path

Computer Vision Learning Path

Generative AI Learning Path

Generative AI Roadmap for Enterprises

LLMs Roadmap

Prompt Engineer Leaning Path

Classifying and Decoding Historical Texts and Images using CNNs!

Overview

Introduction

Our take on this

Subscribe to AVBytes here to get regular data science, machine learning and AI updates in your inbox!

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques