Your Code Leaves Fingerprints, and Machine Learning can now Identify it

Pranav Dar Last Updated : 07 May, 2019

2 min read

Overview

Researchers have designed an algorithm that can identify the author of a piece of code
The current system has shown an accuracy of 83% when tested on 600 individual programmers
It can be used to identify plagiarism in code, or detecting the author of a piece of malware

Introduction

Natural Language Processing is a challenging field because of how unstructured the data within it is. Finding and analyzing hidden patterns among the noise is where a data scientist earns the big bucks.

Recent progress in this field has included identifying the names of authors of wrote a certain piece of literature. This has been automated to quite an extent but how can we apply this to a programmer’s code? It’s also a collection of text and numbers, albeit in a very different manner.

A couple of researchers from Drexel University and the George Washington University have revealed (in this brilliant WIRED article) that code, just like literature, can also be analyzed to identify and pinpoint the author. They will be presenting their work at the DefCon hacking conference later this week.

So how did these researchers design their system? First, the features present in samples of code are identified by the algorithm. The two researchers then narrowed the features to only include those ones which helped them distinguish individual developers. This cut down the number of features significantly.

The researchers created an “abstract syntax tree” which was used to recognize the code’s underlying structure. As you can imagine, the algorithm requires a few examples/samples to train. In this research paper, the researchers along with others showed that it’s possible to identify the programmer using just their compiled binary code.

The researchers picked up code samples from Google’s annual Code Jam competition to test their algorithm. It achieved an impressive 96% accuracy when analyzing 100 individual coders (each had eight code samples). But the accuracy dropped a bit to 83% when the number of programmers was increased to 600.

The curious (though not altogether surprising) finding was that it was far easier to recognize experienced programmers from their code, as compared to newcomers. I imagine this must be because of the number of samples present plus the fact that each programmer must embed his/her own unique style in each piece of their code.

Our take on this

I previously covered DeepCode’s efforts to clean up a programmer’s code, but this latest project in a whole different beast. It tackles a variety of problems, like plagiarism and identity theft. It could also help in cyber security by identifying who created a specific piece of malware.

I believe we are still a fair bit away from seeing this algorithm being used in practical scenarios given how complex the problem is. Let’s wait and watch where this study leads us in the near future.

Subscribe to AVBytes here to get regular data science, machine learning and AI updates in your inbox!

Pranav Dar

Senior Editor at Analytics Vidhya.Data visualization practitioner who loves reading and delving deeper into the data science and machine learning arts. Always looking for new ways to improve processes using ML and AI.

AVbytes

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.6

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Your Code Leaves Fingerprints, and Machine Learning can now Identify it

Overview

Introduction

Our take on this

Subscribe to AVBytes here to get regular data science, machine learning and AI updates in your inbox!

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Data analyst Learning Path

Tableau Learning Path

NLP Learning Path

Data Scientist Learning Path

Data Engineer Learning Path

MLOps Learning Path

AI Engineer Learning Path

Computer Vision Learning Path

Generative AI Learning Path

Generative AI Roadmap for Enterprises

LLMs Roadmap

Prompt Engineer Leaning Path

Your Code Leaves Fingerprints, and Machine Learning can now Identify it

Overview

Introduction

Our take on this

Subscribe to AVBytes here to get regular data science, machine learning and AI updates in your inbox!

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques