Open-weight models are driving the latest excitement in the AI landscape. Running powerful models locally improves privacy, cuts costs, and enables offline use. But the open-source models are far and few! But Google‘s Gemma 4 is here to change that!
This guide walks through what Gemma 4 is, would explores its variants, and outlines the hardware needed for its performance. You’ll also see how to test your setup and build a Second Brain AI project powered by Google’s Gemma 4. We’ll also use Claude Code CLI to streamline development and integrate workflows.
Gemma is Google’s family of open-weight language models, and Gemma 4 marks a significant step forward. It brings stronger reasoning, better efficiency, and broader multimodal support, handling not just text but also images, with some variants extending to audio and video. The models are designed to run locally, making them practical for privacy-sensitive and offline use cases.
Read more: Gemma 4: Hands-On
There are 4 different Gemma 4 variants. These include E2B, E4B, 26B A4B, and 31B. The E2B and E4B are abbreviated to the meaning of effective parameters. These models are appropriate to edge devices. The 26B A4B is based on Mixture-of-Experts (MoE) architecture. The Dense architecture is used in the 31B.
| Model | Effective/Active Params | Total Params | Architecture | Context Window |
|---|---|---|---|---|
| E2B | 2.3B effective | 5.1B with embeddings | Dense + PLE | 128K tokens |
| E4B | 4.5B effective | 8B with embeddings | Dense + PLE | 128K tokens |
| 26B-A4B | 3.8B active | 25.2B total | Mixture-of-Experts (MoE) | 256K tokens |
| 31B | 30.7B active | 30.7B total | Dense Transformer | 256K tokens |
The MoE structure permits effectiveness. Only particular professionals come into play over some task. This renders bigger models to be manageable. The Dense architecture employs all the parameters. All the Gemma 4 variations have their own advantages.
Ollama gives a simple approach. It assists in the easy running of local LLMs. Ollama is user-friendly. Its installation is simple. It manages models efficiently. Ollama 4 is locally available in Gemma 4 with Ollama.
Install Ollama on your PC. Install the application using the Ollama official site. Drag the application to your Applications. Open Ollama from there. It operates in your menu bar.
Then download Gemma 4 models. Open your terminal. Enter the ollama pull command. Indicate the appropriate tags.
gemma4:e2bgemma4:e4bgemma4:26bgemma4:31bThis fetches the model files. You now have Gemma 4 locally with Ollama.

Take into account the hardware of your PC. Gemma 4 variants have varying needs.
Run the model from your terminal. Use the ollama run command.
ollama run gemma4:e2b (Replace e2b with your chosen variant).
The model will load. You can then enter prompts.

Example Prompts:
Observe the response times. The bigger models are slower. It is easy to interact with Gemma 4 locally with Ollama.
We are going to create a Second Brain that is powered by AI. This project provides answers to your local files. It also summarizes documents. This part demonstrates its development. Claude Code CLI will be our coding assistant. Notably, we shall set Claude Code CLI to work with Gemma 4 locally and Ollama as its large language model. This renders our whole development and project local and private.
Claude Code CLI is an agentic coding tool. It operates directly in your terminal. It helps with code generation, debugging, and refactoring.
Installation:
Claude Code CLI works on macOS (10.15+), Linux, and Windows (10+ via WSL/Git Bash). It needs a minimum of 4GB RAM. 8GB or more is better.
For macOS and Linux, the recommended native installer is:
curl -fsSL https://claude.ai/install.sh | bash
For macOS users, Homebrew is an option:
brew install --cask claude-code
Connecting Claude Code CLI to Gemma 4 via Ollama:
After installing Claude Code CLI and pulling your desired gemma4 model with Ollama, you can launch Claude Code CLI, instructing it to use Gemma 4:
ollama launch claude --model gemma4:e4b (Replace e4b with your chosen variant).
This command tells Claude Code CLI to direct its LLM requests to your local Ollama instance, specifically using the gemma4 model you have pulled. No Anthropic API key is needed when operating in this fully local setup.
We use Claude Code CLI to write Python code. This code then interacts with Gemma 4 locally with Ollama.
1. Project Initialization & Structure with Claude Code CLI
Open your terminal. Navigate to your desired project directory. Ensure Claude Code CLI is active using ollama launch claude --model gemma4:26b (Replace e4b with your chosen variant) command.
I am using gemma4:26b locally without any cloud support, let’s see how it goes.

Prompt Claude Code CLI to create the basic structure:
“Generate a Python project structure for a "Second Brain" application. Include directories for 'data', 'scripts', 'vector_store', and a main 'app.py' file.”

It gave me a reply that no structure was created.

2. Document Loader & Chunker Script (using Claude Code CLI)
Now, we need a script to process documents.
“Write a Python script in 'scripts/data_processor.py'. This script should use 'langchain_community.document_loaders' (specifically 'PyMuPDFLoader' for PDFs and 'TextLoader' for TXT) and 'langchain.text_splitter.RecursiveCharacterTextSplitter'. It loads documents from the 'data' directory. Chunks are 1000 characters with 100 overlap. Each chunk must retain its original source and page metadata. Make sure the script handles multiple file types and returns the processed chunks as a list of LangChain 'Document' objects.“

3. Embedding & Vector Store Script (using Claude Code CLI)
Next, we generate embeddings and save them.
“Create a Python script in 'scripts/vector_db_manager.py'. This script should take a list of LangChain 'Document' objects. It generates embeddings using the Ollama embedding model ('OllamaEmbeddings' from 'langchain_community.embeddings', model 'gemma4:e2b'). Then, it persists them into a ChromaDB instance in the 'vector_store' directory. It must also have a function to load an existing ChromaDB.”

4. RAG Query Function (using Claude Code CLI)
Now, for the core question-answering.
“Develop a Python function in 'app.py' called 'query_second_brain(query_text: str)'. This function loads the ChromaDB. It retrieves the top 3 relevant chunks. It then uses 'langchain_openai.ChatOpenAI' (configured for Ollama's API: 'base_url="http://localhost:11434/v1"', 'model="gemma4:e2b"') to answer 'query_text' using the retrieved chunks as context. Use a clear RAG prompt structure. Show the full function.”

Finally, a summarization feature.
“Add a Python function to 'app.py' called 'summarize_document(file_path: str)'. This function should load the document, pass its content to the local Gemma 4 model via Ollama, and return a concise summary. Use a suitable prompt for summarization.”

Through each step, Claude Code CLI, powered by Gemma 4, generated the code. Gemma 4 itself performed the core AI tasks of the project.
NOTE: On running the final code that is python app.py as suggested by Gemma4, I ran into an error.

I tried to fix it, providing the exact error for several iterations but this local model was not able to fix the code, the claude code broke several time by just providing the summary but no changes.

In one prompt it just gave us the code content asked to create a file on your own.
We then decided to switch to the cloud version of gemma4:31b which is available in Ollama cloud in a free tier. Just use this command
ollama launch claude --model gemma4:31b-cloud
It will prompt you to sign in on the browser just do it and you are ready to code.

We gave it a simple prompt
❯ analyse the @second_brain/ project and make a full plan to make the project functional
Then the Gemma 4 31b cloud model analysed the full project, corrected every code. It took almost 7 minutes to do this work but it completed each and every broken code and verified the full working of the app.
On opening the app looks like this.

We uploaded a sample text file and ran the ingestion pipeline.

Now, let’s test the chat feature using the local Gemma4 model and ollama local endpoint for answering:

After numerous iterations, I believe that running models locally and using them for code generation with a popular tool like Claude Code still has a long way to go. While local LLMs running on personal PCs are promising, they face significant constraints regarding hardware requirements, inference latency, and intelligence limitations. Ultimately, to get complex work done efficiently, we had to switch back to cloud-deployed models.
From installation to exploring its different variants, it’s clear this model family is built for practical, real-world use. Running it locally gives you control over data, reduces dependency on external APIs, and opens the door to building faster, more private workflows.
The Second Brain project highlights what’s possible when you combine Gemma 4 with tools like Claude Code CLI. This hybrid setup blends strong reasoning with efficient development, making it easier to build intelligent systems that work in production.
A. When used locally with Ollama, Gemma 4 guarantees privacy of data, lowers API fees, and gives offline access to strong AI services.
A. Gemma 4 variation is the most suitable variant depending on the RAM of your Mac. E2B/E4B suit 8GB+ RAM. 26B/31B variants need 16GB-24GB+ VRAM.
A. A personal knowledge system is an AI-powered Second Brain. It answers questions and summarizes local documents using local LLMs.