Steal the code (ethically) and get better at ML/AI research

Jyotisman Last Updated : 13 Jun, 2021

8 min read

This article was published as a part of the Data Science Blogathon

Okay! I know that stealing is a crime and we should never get involved into it.

But as an AI/ML & Data Science enthusiast, you must have come across numerous research papers and studies (mostly published in journals) where the research has been excellent and you have been wanting to know how it was performed.

If you are a beginner or an intermediate in this field, you might have mostly done ML projects and courses conventionally. That’s VERY MUCH fine.

But, once you get going in the field of AI/ML, you must keep a glance over research papers and publications to elevate your standards and broaden your opportunities.

The various sections of a research paper quite vividly describe the work done by the computer-scientists, researchers, and engineers from various academic institutions and research groups across the world. A lot of work that we see in practice today originated as findings and methods published in such papers.

BUT WHAT ABOUT THE ACTUAL CODE?

Though the work is often described lucidly in research publications, a big limitation is that the original code is not made publicly accessible. And this comes due to many reasons. The research groups feel that it is only the results that are the most important for their publication. They just give an overview of how the work was carried out in the ‘methodology’ section and lines of code are kept hidden.

But the good news is that there are plenty of ways to access the codes and make the best use of research publications. We list out and describe some of the platforms and techniques to make this possible.

1. Paper with codes – A Gold Mine!

There are academicians and research groups who do publish their codes and files along with their research papers and publications. To find out such, you must visit paperswithcode.com, a platform that gathers Machine Learning papers and codes from journals and publishers across the world. The best part is that everything is free and openly licensed under CC-BY-SA. Even the public can contribute with an edit button.

There are so many things you can take away from here –

A state-of-the-art collection: A specially curated section of tasks (under topics like CV, NLP, Medical, Time Series, etc) which contain a collection of ML codes, papers, and models. These are compared & the best one under a given dataset is benchmarked. You get to know a great deal about what research is published in various journals.

You can additionally check for sota-benchmarking of ML models here.

Portals for various disciplines: Apart from just dealing with AI, there are special portals of Machine Learning, Computer Science, Statistics, etc. The opportunities just get endless from here.

Datasets from research: Around 4000+ standardized Machine Learning datasets are gathered and made publicly available. These can be filtered under various modalities, tasks, and languages. The most interesting is to see a listing of published papers under each dataset. Each of these paper highlights some research carried out using the dataset. So, this platform makes it easy for you to compare your own work on the given dataset with the results published earlier.

Methods for you to implement: Have you heard of things like GoogLeNet under Convolutional Neural Network and Transformer-XL under Natural Language Processing? If not DO NOT worry at all. These are some of the things the world has been working on and maybe become the thing of summer AI projects in few years. By discovering and knowing such concepts beforehand, you certainly broaden your scope and outlook. The methods section of this platform has a list of topics under every discipline like CV and NLP. Every method has many numbers of published papers under it for you to see and learn.

The platform also has a section for libraries and trends. Most importantly it advocates the openness and reproducibility of research papers and the codes used in them. This comes at a stage when the world is rapidly advancing with the use of ML/AI and thus publicization of new findings and resources (by keeping transparency) can have an encouraging impact. There’s an ML reproducibility challenge here that you must have a look on.

Paper with code is based on Facebook AI Research and even the primary data sources of the portal and free and open to download. Get all their information here.

2. Researchcode.com

Welcome to a search engine of research papers and publications with codes. This is an awesome tool to search ML/AI papers from various disciplines and it comes with a Google Chrome extension and an option to add new papers along with code. The search results can be sorted as per year, popularity, and relevance. Using the Chrome extension (download here), you get suggestion while browsing.

Upon making a log-in, you can add a paper to the site. For this, you need to provide all details and like name, link, and abstract. You must necessarily add the link to the code and its details.

3. CatalyzeX – A browser extension and portal

To catalyze your ML research, here’s another browser extension. Download this (from here) to get suggestions while browsing on Scholar, arXiv, Twitter, etc.

Additionally, you can browse through a curated list of papers with model/code under topics like Cancer Detection, Music Generation, Chatbots, etc

4. TheJournal.Club

Here’s an awesome collection of papers on various topics of Machine Learning and AI. But the most interesting part of this platform is – the clubs. These are a kind of user groups (like Facebook groups) where people discuss, share their opinions about various research publications, and hold webinars/meetings.

You also get an additional list of authors, datasets and codes on this platform. The ‘codes’ tab lists various ML/AI topics under the sun and the papers that implement these topics are highlighted.

5. Papers + Code – MIT-IBM Watson AI Lab

This is another list of peer-reviewed papers that come along with code, released by a joint collaboration between MIT and IBM Watson. They believe in producing reproducible and innovative results.

6. Automatic paper-to-code – an insightful approach

Imagine if we could automatically convert the deep learning research papers to executable code. Sounds like an interesting NLP project, right?

Say hi to the IBM Deep Learning IDE, proudly developed at IBM Research India. This seemingly untrue technology ingests deep learning algorithms in published research papers and recreates them in source code for inclusion in libraries for multiple deep learning frameworks (Tensorflow, Keras, Caffe). It uses the flowcharts and tables in a research paper to do the same.

The motivation came from the fact that it happens to be very difficult for even experienced engineers to reproduce the code in a given Deep Learning or AI/ML paper. In majority of publications, the code was not made available by the authors.

Automatic paper-to-code – an insightful approach ML/AI research — Showing how code is auto-generated from research papers by this technology. (Sourced from original research paper for education use only)

The IDE also has an interesting drag-and-drop UI editor, which can be used to manually edit and perfect the extracted design, and generate the source code in real-time. Currently, we are in the process of building a model zoo consisting of design and source code for models from 5,000 core deep learning research papers from arXiv. We are hoping to share this dataset soon with the larger research community to use and improve.

Additionally, check out another source code retrieval study here.

THE CONVENTIONAL APPROACHES YOU MUST KNOW

Above given are some platforms and methods for obtaining the source code. But the story doesn’t end here as you can also extract the code yourself. Here are some of the must-know approaches –

1. CHECK THE PAPER THOROUGHLY –

This may sound obvious but with the amount of text being huge in a research paper, the link to the code is often left hidden. If not a direct link, many papers describe where the source code may be extracted from or at least give a hint in their methodology section.

In case you don’t get anything, you will surely get the idea of how you can reproduce the study yourself. By doing so, you actually get better with your skills and know-how.

2. CODE (AND DATA) AVAILABILITY STATEMENT | SUPPLEMENTARY INFORMATION –

The source code (or even the dataset) is often not regarded as the main part of a paper.

So information relating to it is provided under code availability statement or supplementary information portion.

Where to get them? The availability statement is mentioned towards the very end of the research paper and/or at the publisher’s page (that you reach after clicking the DOI link of the paper)

Same is for supplementary information. This is often a separate part of the paper which is the not included in the main paper (like Appendix). You may get this at the very end of the combined document file or as a separate document found at the publisher’s page.

(Beware! You are most likely to overlook the above things…)

3. SEACH ON GITHUB (+ CHECK FOR REPOSITORIES) –

This is one of the things you do probably, isn’t it? When it comes to sharing code, it is mostly that GitHub is used to share the files. You may search on GitHub by using the DOI link, the paper’s name, the author’s name, or even the journal’s name. You can also search using the topic or sub-topic name. You will most likely get a repository having source code for carrying your research.

4. REACH OUT TO THE AUTHORS –

Probably you didn’t find luck in any other option. Don’t worry! Every paper has a corresponding author whose role is to receive correspondence and queries like this. Find out the details of that author and write to him/her on email.

NOTE THAT it’s up to the author to make the things available to you. If you are able to convince the author(s) of your intentions, they will most likely trust you and this has happened many times. That is how research used to work till now. Here are some possible scenarios in which they do so –

-> You find out some errors or lapses in their work and you want to verify/prove it or help them convincingly.

-> You suggest some additions which are of their interest. Then you may collaborate with them to improve their work (and maybe publish as another paper)

-> You can reproduce the code in a better/efficient way, in another programming language, or through another approach, etc.

-> For educational purposes, for giving your opinion, for sharing it, etc.

REPRODUCIBILITY & OPENNESS – THE NEW NORMAL?

Before we wrap, let’s have a word about the main issue – lack of transparency and openness. But this is obvious!

If you or me are working on something interesting for months together (probably a year) and we publish our study and findings as a paper, why should I make my code public? Some college guy may just grab and make better use of it with minimal efforts.

The only way to deal with this is by shifting to a new normal where sharing is caring, literally. This will start with you, me, and everyone around. Be it the research paper itself (that you have to access with a high-price subscription), the data used or the code utilized, things should be given and received openly.

There are huge benefits to this. The research gets fast-tracked, beginners get learning and motivation. Also, reproducibility is ensured.

What if I publish something that is really good, but it just worked for me and only for once? What if I made a mistake and it gets hidden unknowingly?

If others are also able to recreate those results (ie reproduce it), we ensure that research is full-proof and of high class. Scientists are left with no chance to make any mistakes as they must verify everything before publication. People who reproduce it also get a lot of things to learn.

Check out the ML reproducibility challenge here where you can participate to recreate the results found in various papers. Read another blog on this. Learn about executable research articles here.

About Author

Hello, this is Jyotisman Rath from Bhubaneswar, and it’s the toughest challenges that always excite me. Alongside pursuing I-MTech in Chemical Engineering, I owe a great interest in the field of AI/ML and Data Science. I look to integrate these with my research interests and real-world scenarios and I believe technology is much more than just coding.

Would love to see you on LinkedIn and Instagram. Mail me here for any queries.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

AI/ML Reasearch blogathon

Jyotisman

Beginner Interview Prep Researchers & Scientists

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

raveena aher

Hey thanx for sharing this blog over here. It seems useful to start career in clinical research. We will look foreward for more updates.

Vinayak Shanawad

Excellent article, thanks a lot for sharing sir.

Babu Vishwanath

really an interesting blog, keep on posting your blogs and posts bro

Siddharth

Very helpful, Indeed.

Reading list

Introduction to Deep Learning

Feed Forward Networks

Feed Forward Networks

Gradient Descent

Loss Function

Activation Functions

Introduction to Neural networks

Forward and Backward Propagation

Optimizers

Learning Rate Schedulers

NN on Structured Data

Improving the Deep Learning Model

Deep Learning Model Optimization

Unsupervised Deep Learning

AutoDL

Model Deployment

Introduction to PyTorch

Steal the code (ethically) and get better at ML/AI research

BUT WHAT ABOUT THE ACTUAL CODE?

1. Paper with codes – A Gold Mine!

2. Researchcode.com

3. CatalyzeX – A browser extension and portal

4. TheJournal.Club

5. Papers + Code – MIT-IBM Watson AI Lab

6. Automatic paper-to-code – an insightful approach

THE CONVENTIONAL APPROACHES YOU MUST KNOW

REPRODUCIBILITY & OPENNESS – THE NEW NORMAL?

About Author

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv