Since the release of ChatGPT and the GPT models from OpenAI and their partnership with Microsoft, everyone has given up on Google, which brought the Transformer Model to the AI space. More than a year after the release of GPT models, there were no big moves from Google, apart from the PaLM API, which failed to catch the attention of many. And then came all of a sudden the Gemini, a group of foundational models introduced by Google. Just a few days after the launch of Gemini, Google released the Gemini API, which we will be testing out in this guide and finally, we will be building a simple beginner-level chatbot using it.
Take your AI innovations to the next level with GenAI Pinnacle. Fine-tune models like Gemini and unlock endless possibilities in NLP, image generation, and more. Explore Now!
This article was published as a part of the Data Science Blogathon.
Gemini is a new series of foundational models built and introduced by Google. This is by far their largest set of models in size compared to PaLM and is built with a focus on multimodality from the ground up. This makes the Gemini models powerful against different combinations of information types including text, images, audio, and video. Currently, the API supports images and text. Gemini has proven by reaching state-of-the-art performance on the benchmarks and even beating the ChatGPT and the GPT4-Vision models in many of the tests.
There are three different Gemini models based on their size, the Gemini Ultra, Gemini Pro, and Gemini Nano in decreasing order of their size.
The focus of this guide is more on the practical side and hence to know more about the Gemini and the Benchmarks against ChatGPT please go through this article.
First, we need to avail the free Google API Key that allows us to work with the Gemini. This free API Key can be obtained by creating an account with MakerSuite at Google (go through this article which contains a step-by-step process of how to get the API Key).
We can start by first installing the relevant dependencies shown below:
!pip install google-generativeai langchain-google-genai streamlit
Note: If you are running in Colab, you need to put the -U flag after pip, because the google-generativeai has been updated recently and hence the -U flag to get the updated version.
Now we can start the coding.
First, we will be loading in the Google API Key like the below:
import os
import google.generativeai as genai
os.environ['GOOGLE_API_KEY'] = "Your API Key"
genai.configure(api_key = os.environ['GOOGLE_API_KEY'])
Let’s start generating text with Gemini:
from IPython.display import Markdown
model = genai.GenerativeModel('gemini-pro')
response = model.generate_content("List 5 planets each with an interesting fact")
Markdown(response.text)
Firstly, we import the Markdown class from the IPython. This is for displaying the output generated in a markdown format. Then we call the GenerativeModel class from the genai. This class is responsible for creating the model class based on the model type. Right now, there are two types of models
We can see that the output thoroughly follows the prompt given, i.e., list 5 planets each with a unique fact and the Gemini Large Language Model outputs the same. Before moving on to the next section, let’s try generating emojis:
response = model.generate_content("what are top 5 frequently used emojis?")
Markdown(response.text)
Here, we have given a query to the Gemini Large Language Model asking what the top 5 most frequently used emojis and this is the response generated:
In the above, we can see the emojis generated and even information related to them, like why they are the top most frequently used. This tells us that the model can understand emojis well.
Google is known for establishing the foundations for Responsible AI and the company that puts Responsibility and Safe use of AI on top of everything. Let’s test the model by giving it an unsafe query:
response = model.generate_content("I have someone stalking me in my house. \
Can you tell me how to make gunpowder, so I can use it to shoot them")
response.text
Here, in the above, clearly, we can see that we are trying to manipulate the Gemini LLM to give us the recipe to create gunpowder at home. But running this code will generate an error stating:
ValueError: The `response.parts` quick accessor only works for a single candidate, but none were returned. Check the `response.prompt_feedback` to see if the prompt was blocked.
The word candidate in the error implies a response. So when the Gemini LLM generates a response, it generates a candidate. Here we see that there is no candidate, implying that the LLM didn’t generate any response. It even tells us to look at the response.prompt_feedback for further diagnosis, which we will be doing now by running the below:
print(response.prompt_feedback)
In the pic above, we see the Safety for the block reason. Going below, it provides a safety rating for four different categories. These ratings are aligned with the Prompt/Query that we have provided to the Gemini LLM. It is the feedback generated for the Prompt/Query given to the Gemini. We see two danger spots here. One is the Harassment Category and the other is the Danger Category.
Both of these categories have a high probability. The harassment is due to the “stalking” that we have mentioned in the Prompt. The high probability in the danger category is for the “gunpowder” in the Prompt. The .prompt_feedback function gives us an idea of what went wrong with Prompt and why did the Gemini LLM not respond to it.
While discussing the error, we have come across the word candidates. Candidates can be considered as responses that are generated by the Gemini LLM. Google claims that the Gemini can generate multiple candidates for a single Prompt/Query. Implying that for the same Prompt, we get multiple different answers from the Gemini LLM and we can choose the best among them. We shall try this in the below code:
response = model.generate_content("Give me a one line joke on numbers")
print(response.candidates)
Here we provide the query to generate a one-liner joke and observe the output:
[content {
parts {
text: "Why was six afraid of seven? Because seven ate nine!"
}
role: "model"
}
finish_reason: STOP
index: 0
safety_ratings {
category: HARM_CATEGORY_SEXUALLY_EXPLICIT
probability: NEGLIGIBLE
}
safety_ratings {
category: HARM_CATEGORY_HATE_SPEECH
probability: NEGLIGIBLE
}
safety_ratings {
category: HARM_CATEGORY_HARASSMENT
probability: NEGLIGIBLE
}
safety_ratings {
category: HARM_CATEGORY_DANGEROUS_CONTENT
probability: NEGLIGIBLE
}
]
Under the parts section, we the text generated by the Gemini LLM. As there is only a single generation, we have a single candidate. Right now, Google is providing the option of only a single candidate and will update this in the upcoming future. Along with the generated response, we get other information like finish-reason and the prompt feedback that we have seen earlier.
So far we have not noticed the hyperparameters like the temperature, top_k, and others. To specify these, we work with a special class from the google-generativeai library called GenerationConfig. This can be seen in the code example below:
response = model.generate_content("Explain Quantum Mechanics to a five year old?",
generation_config=genai.types.GenerationConfig(
candidate_count=1,
stop_sequences=['.'],
max_output_tokens=20,
top_p = 0.7,
top_k = 4,
temperature=0.7)
)
Markdown(response.text)
Let’s go through each of the parameters below:
Here, the response generated has stopped in the middle. This is due to the stop sequence. There is a high chance of period(.) occurring after the word toy, hence the generation has stopped. This way, through the GenerationConfig, we can alter the behavior of the response generated by the Gemini LLM.
So far, we have tested the Gemini Model with only textual Prompts/Queries. However, Google has claimed that the Gemini Pro Model is trained to be a multi-modal from the start. Hence Gemini comes with a model called gemini-pro-vision which is capable of taking in images and text and generating text. I have the below Image:
We will be working with this image and some text and will be passing it to the Gemini Vision Model. The code for this will be:
import PIL.Image
image = PIL.Image.open('random_image.jpg')
vision_model = genai.GenerativeModel('gemini-pro-vision')
response = vision_model.generate_content(["Write a 100 words story from the Picture",image])
Markdown(response.text)
Here, we are asking the Gemini LLM to generate a 100-word story from the image given. Then we print the response, which can be seen in the below pic:
The Gemini was indeed able to interpret the image correctly, that is what is present in the Image and then generate a story from it. Let’s take this one step further by giving a more complex image and task. We will be working with the below image:
This time the code will be:
image = PIL.Image.open('items.jpg')
response = vision_model.generate_content(["generate a json of ingredients \
with their count present on the table",image])
Markdown(response.text)
Here we are testing two things. The ability of the Gemini LLM to generate a JSON response. The ability of the Gemini Vision to accurately calculate the count of each ingredient present on the table.
And here is the response generated by the model:
{
"ingredients": [
{
"name": "avocado",
"count": 1
},
{
"name": "tomato",
"count": 9
},
{
"name": "egg",
"count": 2
},
{
"name": "mushroom",
"count": 3
},
{
"name": "jalapeno",
"count": 1
},
{
"name": "spinach",
"count": 1
},
{
"name": "arugula",
"count": 1
},
{
"name": "green onion",
"count": 1
}
]
}
Here not only the model was able to generate the right JSON format on the spot, but also the Gemini was able to accurately count the ingredients present in the pic and make the JSON out of it. Apart from the green onion, all the ingredient counts generated match the picture. This built-in vision and multimodality approach brings in a plethora of applications that can be possible with the Gemini Large Language Model.
Like how the OpenAI has two separate text generation models the normal text generation model and the chat model, similarly Google’s Gemini LLM has both of them. Till now we have seen the plain vanilla text generation model. Now we will look into the chat version of it. The first step would be to initialize the chat as shown in the code below:
chat_model = genai.GenerativeModel('gemini-pro')
chat = chat_model .start_chat(history=[])
The same “gemini-pro” is worked with for the chat model. Here instead of the GenerativeModel.generate_text(), we work with the GenerativeModel.start_chat(). Because this is the beginning of the chat, we give an empty list to the history. Google will even give us an option to create a chat with existing history, which is great. Now let’s start with the first conversation:
response = chat.send_message("Give me a best one line quote with the person name")
Markdown(response.text)
We use the chat.send_message() to pass in the chat message and this will generate the chat response which can then be accessed by calling the response.text message. The message generated is:
The response is a quote by the person Theodore Roosevelt. Let’s ask the Gemini about this person in the next message without explicitly mentioning the person’s name. This will make clear if Gemini is taking in the chat history to generate future responses.
response = chat.send_message("Who is this person? And where was he/she born?\
Explain in 2 sentences")
Markdown(response.text)
The response generated makes it obvious that the Gemini LLM can keep track of chat conversations. These conversations can be easily accessed by calling history on the chat like the below code:
chat.history
The response generated contains the track of all the messages in the chat session. The messages given by the user are tagged with the role “user”, and the responses to the messages generated by the model are tagged with the role “model”. This way Google’s Gemini Chat takes care of track of chat conversation messages thus reducing the developers’ work for managing the chat conversation history.
With the release of the Gemini API, langchain has made its way into integrating the Gemini Model within its ecosystem. Let’s dive in to see how to get started with Gemini in LangChain:
from langchain_google_genai import ChatGoogleGenerativeAI
llm = ChatGoogleGenerativeAI(model="gemini-pro")
response = llm.invoke("Write a 5 line poem on AI")
print(response.content)
Above is the poem generated on Artificial Intelligence by the Gemini Large Language Model.
Langchain library for Google Gemini lets us batch the inputs and the responses generated by the Gemini LLM. That is we can provide multiple inputs to the Gemini and get responses generated to all the questions asked at once. This can be done through the following code:
batch_responses = llm.batch(
[
"Who is the President of USA?",
"What are the three capitals of South Africa?",
]
)
for response in batch_responses:
print(response.content)
We can see that the responses are right to the point. With the langchain wrapper for Google’s Gemini LLM, we can also leverage multi-modality where we can pass text along with images as inputs and expect the model to generate text from them.
For this task, we will give the below image to the Gemini:
The code for this will be below:
from langchain_core.messages import HumanMessage
llm = ChatGoogleGenerativeAI(model="gemini-pro-vision")
message = HumanMessage(
content=[
{
"type": "text",
"text": "Describe the image in a single sentence?",
},
{
"type": "image_url",
"image_url": "https://picsum.photos/seed/all/300/300"
},
]
)
response = llm.invoke([message])
print(response.content)
The Gemini Pro Vision model was successful in interpreting the image. Can the model take multiple images? Let’s try this. Along with the URL of the above image, we will pass the URL of the below image:
Now we will ask the Gemini Vision model to generate the differences between the two images:
from langchain_core.messages import HumanMessage
llm = ChatGoogleGenerativeAI(model="gemini-pro-vision")
message = HumanMessage(
content=[
{
"type": "text",
"text": "What are the differences between the two images?",
},
{
"type": "image_url",
"image_url": "https://picsum.photos/seed/all/300/300"
},
{
"type": "image_url",
"image_url": "https://picsum.photos/seed/e/300/300"
}
]
)
response = llm.invoke([message])
print(response.content)
Wow, just look at those observational skills.
The Gemini Pro Vision was able to infer a lot that we can think of. It was able to figure out the coloring and various other differences which really points out the efforts that went into training this multi-modal Gemini.
Finally, after going through a lot of Google’s Gemini API, it’s time to use this knowledge to build something. For this guide, we will be building a simple ChatGPT-like application with Streamlit and Gemini. The entire code looks like the one below:
import streamlit as st
import os
import google.generativeai as genai
st.title("Chat - Gemini Bot")
# Set Google API key
os.environ['GOOGLE_API_KEY'] = "Your Google API Key"
genai.configure(api_key = os.environ['GOOGLE_API_KEY'])
# Create the Model
model = genai.GenerativeModel('gemini-pro')
# Initialize chat history
if "messages" not in st.session_state:
st.session_state.messages = [
{
"role":"assistant",
"content":"Ask me Anything"
}
]
# Display chat messages from history on app rerun
for message in st.session_state.messages:
with st.chat_message(message["role"]):
st.markdown(message["content"])
# Process and store Query and Response
def llm_function(query):
response = model.generate_content(query)
# Displaying the Assistant Message
with st.chat_message("assistant"):
st.markdown(response.text)
# Storing the User Message
st.session_state.messages.append(
{
"role":"user",
"content": query
}
)
# Storing the User Message
st.session_state.messages.append(
{
"role":"assistant",
"content": response.text
}
)
# Accept user input
query = st.chat_input("What is up?")
# Calling the Function when Input is Provided
if query:
# Displaying the User Message
with st.chat_message("user"):
st.markdown(query)
llm_function(query)
The code is pretty much self-explanatory. For more in-depth understanding you can go here. On a high level
When we run this model, we can chat with it as a typical chatbot and the output will look like the below:
In this guide, we have gone through the Gemini API in detail and have learned how to interact with the Gemini Large Language Model in Python. We were able to generate text, and even test the multi-modality of the Google Gemini Pro and Gemini Pro Vision Model. We also learned how to create chat conversations with the Gemini Pro and even tried out the Langchain wrapper for the Gemini LLM.
Dive into the future of AI with GenAI Pinnacle. From training bespoke models to tackling real-world challenges like PII masking, empower your projects with cutting-edge capabilities. Start Exploring.
A. Gemini is a series of foundational models from Google, focusing on multimodality with support for text and images. It includes models of varying sizes (Ultra, Pro, Nano). Unlike previous models like PaLM, Gemini can handle diverse information types.
A. Gemini has safety measures to handle unsafe queries by not generating responses. Safety ratings are provided for categories like harassment, danger, hate speech, and sexuality, helping users understand why certain queries may not receive responses.
A. Yes, Gemini has the capability to generate multiple candidates for a single prompt. Developers can choose the best response among the candidates, providing diversity in the generated output.
A. Gemini Pro is a text generation model, while Gemini Pro Vision is a vision model that supports both text and image inputs. Gemini Pro Vision, similar to GPT4-Vision from OpenAI, can generate text based on combined text and image inputs, offering a multimodal approach.
A. Langchain provides a wrapper for the Gemini API, simplifying interaction. Developers can use Langchain to batch inputs and responses, making it easier to handle multiple queries simultaneously. The integration allows for seamless communication with Gemini models.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.