Introduction to Embedchain – A Data Platform Tailored for LLMs

Ajay Kumar Reddy 08 Nov, 2023

8 min read

Introduction

The introduction to tools like LangChain, and LangFlow, has made things easier when building applications with Large Language Models. Though building applications and choosing different Large Language Models has become easier, the data uploading part, where the data comes from various sources is still time-consuming for developers while developing LLM-powered applications as the developers need to convert data from these various sources into plain text before injecting them into vector stores. This is where Embedchain comes in, which makes it simple to upload data of any data type and start querying the LLM instantly. In this article, we will explore how to get started with embedchain.

Learning Objectives

Understanding the significance of embedchain in simplifying the process of managing and querying data for Large Language Models (LLMs)
Learn how to effectively integrate and upload unstructured data into embedchain, enabling developers to work with various data sources seamlessly
Knowing the different Large Language Models and Vector Stores supported by embedchain
Discover how to add various data sources, such as web pages and videos to the vector store, thus understanding the data ingestion

This article was published as a part of the Data Science Blogathon.

What is Embedchain?

Embedchain is a Python/Javascript library, with which a developer can connect multiple data sources with Large Language Models seamlessly. Embedchain allows us to upload, index, and retrieve unstructured data. The unstructured data can be of any type like a text, a URL to a website/YouTube video, an Image, etc.

Emdechain makes it simple to upload these unstructured data with a single command, thus creating vector embeddings for them and starting querying instantly with the data with the connected LLM. Behind the scenes, embedchain takes care of loading the data from the source, chunking it, then creating vector embeddings for it, and finally storing them in a vector store.

Creating First App with Embedchain

In this section, we will install the embedchain package and create an app with it. The first step would be using the pip command to install the package as shown below:

!pip install embedchain

!pip install embedchain[huggingface-hub]

The first statement will install the embedchain Python Package
The next line will install the huggingface-hub, this Python Package is required if we want to use any models provided by the hugging-face

Now we will be creating an environment variable to store the Hugging Face Inference API Token as below. We can obtain the Inference API Token by signing in to the Hugging Face website and then generating a token.

import os

os.environ["HUGGINGFACE_ACCESS_TOKEN"] = "Hugging Face Inferenece API Token"

The embedchain library will use the token provided above to infer the hugging face models. Next, we must create a YAML file defining the model we want to use from huggingface. A YAML file can be considered as a simple key-value store where we define the configurations for our LLM applications. These configurations can include what LLM model we are going to use or what Embedding Model we are going to use(To learn more about the YAML file please click here). Below is an example YAML file

config = """
llm:
  provider: huggingface
  config:
    model: 'google/flan-t5-xxl'
    temperature: 0.7
    max_tokens: 1000
    top_p: 0.8


embedder:
  provider: huggingface
  config:
    model: 'sentence-transformers/all-mpnet-base-v2'
"""


with open('huggingface_model.yaml', 'w') as file:
    file.write(config)

We are creating a YAML file from Python itself and storing it in the file named huggingface_model.yaml.
In this YAML file, we define our model parameters and even the embedding model being used.
In the above, we have specified the provider as huggingface and flan-t5 model with different configurations/parameters that include the temperature of the model, the max_tokens(i.e. The output length), and even the top_p value.
For the embedding model, we are using a popular embedding model from huggingface called the all-mpnet-base-v2, which will be responsible for creating embedding vectors for our model.

YAML Configuration

Next, we will create an app with the above YAML configuration file.

from embedchain import Pipeline as App

app = App.from_config(yaml_path="huggingface_model.yaml")

Here we import the Pipeline object as an App from the embedchain. The Pipeline object is responsible for creating LLM Apps taking in different configurations as we have defined above.
The App will create an LLM with the models specified in the YAML file. To this app, we can feed in data from different data sources, and to the same App, we can call in the query method to query the LLM on the data provided.
Now, let’s add some data.

app.add("https://en.wikipedia.org/wiki/Alphabet_Inc.")

The app.add() method will take in data and add it to the vector store.
Embedchain takes care of collecting the data from the web page, creating it into chunks, and then creating the embeddings for the data.
The data will then be stored in a vector database. The default database used in embedchain is chromadb.
In this example, we are adding the Wikipedia page of Alphabet, the parent of Google to the App.

Let’s query our App based on the uploaded data:

In the above Image, using the query() method, we have asked our App i.e. the flan-t5 model two questions related to the data that was added to the App. The model was able to answer them correctly. This way, we can add multiple data sources to the model by passing them to the add() method and internally they will be processed and the embeddings will be created for them, and finally will be added to the vector store. Then we can query the data with the query() method.

Configuring App with a Different Model and Vector Store

In the previous example, we have seen how to prepare an application that adds a website as the data and the Hugging Face Model as the underlying Large Language Model for the App. In this section, we will see how we can use other models and other vector databases to see how flexible the embedchain can be. For this example, we will be using Zilliz Cloud as our Vector Database, hence we need to download the respective Python client as shown below:

!pip install --upgrade embedchain[milvus]

!pip install pytube

The above will download the Pymilvus Python package with which we can interact with Zilliz Cloud.
The pytube library will let us convert YouTube videos to text so that they can be stored in the Vector Store.
Next, we can create a free account with the Zilliz Cloud. After creating the free account, go to the Zilliz Cloud Dashboard and create a Cluster.

After creating the Cluster we can obtain the credentials to connect to it as shown below:

OpenAI API Key

Copy the Public Endpoint and the Token and store these somewhere else, as these will be needed to connect to the Zilliz Cloud Vector Store. And now for the Large Language Model, this time we will use the OpenAI GPT model. So we will also need the OpenAI API Key to move forward. After obtaining all keys, create the environment variables as shown below:

os.environ["OPENAI_API_KEY"]="Your OpenAI API Key"

os.environ["ZILLIZ_CLOUD_TOKEN"]= "Your Zilliz Cloud Token"

os.environ["ZILLIZ_CLOUD_URI"]= "Your Zilliz Cloud Public Endpoint"

The above will store all the required credentials to the Zilliz Cloud and OpenAI as environment variables. Now it’s the time to define our app, which can be done as follows:

from embedchain.vectordb.zilliz import ZillizVectorDB

app = App(db=ZillizVectorDB())

app.add("https://www.youtube.com/watch?v=ZnEgvGPMRXA")

Here first we import the ZillizVectorDB class provided by the embedchain.
Then when creating our new app, we will pass the ZillizVectorDB() to the db variable inside the App() function.
As we have not specified any LLM, the default LLM is chosen as OpenAI GPT 3.5.
Now our app is defined with OpenAI as LLM and Zilliz as the Vector Store.
Next, we are adding a YouTube video to our app using the add() method.
Adding a YouTube video is as simple as passing the URL to add() function, all the video-to-text conversion is abstracted away by the embedchain, thus making it simple.

Zilliz Cloud

Now, the video is first converted to text, next it will be created into chunks and will be converted into vector embeddings by the OpenAI embedding model. These embeddings will then be stored inside the Zilliz Cloud. If we go to the Zilliz Cloud and check inside our cluster, we can find a new collected named “embedchain_store”, where all the data that we add to our app is stored:

As we can see, a new collection was created under the name “embedchain_store” and this collection contains the data that we have added in the previous step. Now we will query our app.

The video that was added to the app is about the new Windows 11 update. In the above image, we ask the app a question that was mentioned in the video. And the app correctly answers the question. In these two examples, we have seen how to use different Large Language Models and different databases with embedchain and have also uploaded data of different types, i.e. a webpage and a YouTube video.

Supported LLMs and Vector Stores by Embedchain

Embedchain has been growing a lot since it was released by bringing in support for a large variety of Large Language Models and Vector Databases. The supported Large Language Models can be seen below:

Hugging Face Models
OpenAI
Azure OpenAI
Anthropic
Llama2
Cohere
JinaChat
Vertex AI
GPT4All

Apart from supporting a wide range of Large Language Models, the embedchain also provides support to many vector databases that can seen in the below list:

ChromaDB
ElasticSearch
OpenSearch
Zilliz
Pinecone
Weaviate
Qdrant
LanceDB

Apart from these, the embedchain in the future will be adding support for more Large Language Models and Vector Databases.

Conclusion

While building applications with large language models, the main challenge will be when dealing with data, that is dealing with data coming from different data sources. All the data sources eventually need to be converted into a single type before being converted into embeddings. And every data source has its own way of handling it like there exists separate libraries for handling videos, others for handling websites, and so on. So, we have taken a look at a solution for this challenge with the Embedchain Python Package, which does all the heavy lifting for us, thus allowing us to integrate data from any data source without worrying about the underlying conversion.

Key Takeaways

Some of the key takeaways from this article include:

Embedchain supports a large set of Large Language Models, thus allowing us to work with any of them.
Also, Embedchain integrates with many popular Vector Stores.
A simple add() method can be used to store data of any type in the vector store.
Embedchain makes it easier to switch between LLMs and Vector DBs and provides simple methods to add and query the data.

Frequently Asked Questions

Q1. What is Embedchain?

A. Embedchain is a Python tool that allows users to add in data of any type and get it stored in a Vector Store thus allowing us to query it with any Large Language Model.

Q2. How do we use different Vector Stores in Embedchain?

A. A vector database of our choice can be given to the app we are developing either through the config.yaml file or directly to the App() class by passing the database to the “db” parameter inside the App() class.

Q3. Will the data be persisted locally?

A. Yes, in the case of using local vector databases like chromadb, when we perform an add() method, the data will be converted into vector embeddings and then be stored in a vector database like chromadb which will be persisted locally under the folder “db”.

Q4. Is it necessary to create a config.yaml for working with different Databases / LLMs?

A. No, it is not. We can configure our application by directly passing the configurations to the App() variables or instead use a config.yaml to generate an App from the YAML file. Config.yaml file will be useful to replicate the results / when we want to share the configuration of our application with someone else but it is not mandatory to use one.

Q5. What are the supported data sources by Embedchain?

A. Embedchain supports data coming from different data sources which include CSV, JSON, Notion, mdx files, docx, web pages, YouTube videos, pdfs, and many more. Embedchain abstracts away the way it handles all these data sources thus making it easier for us to add any data.