This is How Andrej Karpathy Uses LLMs

Shaik Hamzah Shareef Last Updated : 09 Mar, 2025
22 min read

In his latest video, “How I use LLMs: Andrej Karpathy,” the renowned AI expert pulls back the curtain on the evolving world of LLM. Serving as a follow-up to his earlier video “Deep Diving into LLMs” from the General Audience Playlist on his YouTube channel, this presentation explores how the initial textual chat interface hosted by OpenAI sparked a revolution in AI interaction. Karpathy explains how the ecosystem has rapidly transformed from a simple text-based system into a rich, multi-modal experience, integrating advanced tools and functionalities. This article is inspired by his technical demonstrations, advanced tool integrations, and personal insights, offering readers an in-depth look at the future of AI.

Evolving Landscape of LLMs

Karpathy begins by mapping out the rapidly expanding ecosystem of LLMs. While ChatGPT remains the pioneering force, he highlights emerging competitors such as Gemini, Copilot, Claude, Grok, and even international players like DeepSeek and LeChat. Each model offers unique features, pricing tiers, and experiences.

ChatGPT is like the original gangster of conversational AI, but the ecosystem has grown into a diverse playground of experimentation and specialization,” he explains.

Ongoing with the podcast, Karpathy even provided some links from where you can compare and analyze the performances of these several models:

Using these 2 links we can keep track of the several models which are currently publicly available to be utilized

Beyond Text: Embracing Multi-Modality

Let us now explore multi modality in detail below:

Text Generation

When it comes to generating text, models like ChatGPT truly excel especially in creative tasks such as writing haikus, poems, cover letters, resumes, and even email replies. As Karpathy puts it, our interactions with these models appear as lively “Chat Bubbles” that encapsulate a dynamic conversation between you and the AI.

Breaking Down the Magic: Tokens and Context

Every time you input a query, the model dissects your text into smaller building blocks called tokens. You can explore this process yourself using tools like OpenAI’s Tokenizer or Tiktokenizer. These tokens form a sequential stream often referred to as the token sequence or Context Window which acts as the AI’s working memory.

Under the hood, additional tagging is incorporated into both the input and output sequences. This includes techniques like Part-of-Speech (POS) tagging and Named Entity Recognition (NER), similar to what you might find in the Penn Treebank. These tags help the model better understand the role and identity of each word.

Tokenization Algorithms and Special Tokens

Modern language models typically use Byte-Pair Encoding (BPE) to split words into subwords. For instance, the word “university” might be broken down into “uni”, “vers”, and “ity.” This process ensures that even rare or complex words are represented in a way that the model can process efficiently.

Some important special tokens include:

  • <|endoftext|>: Marks the end of a sequence.
  • <|user|> and <|assistant|>: Distinguish between user input and the AI’s output.

Karpathy illustrated this beautifully with a diagram [shown in the next section] how a fresh chat begins with an empty token stream. Once you type your query, the model takes over, appending its own stream of tokens. This continuous flow known as the Context Window represents the working memory that guides the AI’s response.

Two Pillars of AI Training: Pre-Training and Post-Training

Two Pillars of AI Training

“I like to think of the model as a one terabyte zip file it’s full of compressed knowledge from the internet, but it’s the human touch in post-training that gives it a soul,” he explains.

Transformer Architecture

At the heart of LLMs lies the Transformer architecture. Key elements include:

  • Self-Attention Mechanism: This mechanism allows the model to weigh the importance of different tokens in a sequence. It calculates attention scores so that the model can focus on relevant parts of the input while generating responses.
  • Positional Encoding: Since transformers lack inherent sequential information, positional encodings are added to tokens to preserve the order of words.
  • Feed-Forward Networks and Layer Normalization: These components help process the attention outputs and stabilize training.

To really grasp how these models generate text, it’s crucial to understand the two major phases of their training:

Pre-Training: Compressing the Internet into Parameters

In this phase, the model processes vast amounts of data from books and websites to code repositories and academic papers. Think of it as compressing the world’s knowledge into a “zip file” of parameters:

  • Data Scale and Sources: Models like GPT-4 digest trillions of tokens, equivalent to millions of books or billions of web pages.
  • Transformer Architecture: These networks learn relationships between words by processing tokens in sequence.
  • Parameter Compression: The knowledge is stored in neural network parameters, acting as a “lossy zip file”. This means that while the model retains general knowledge, some niche details might be omitted.
  • Probabilistic Nature: Since the model predicts the next token based on likelihoods, it sometimes generates outputs that aren’t entirely accurate, commonly referred to as hallucinations.
  • Cost and Limitations: Pre-training is extremely expensive, taking months of computation and costing tens of millions of dollars. This process also leads to knowledge cutoffs, meaning the model’s information is only as current as its last training update.

Post-Training: Specializing for Human Interaction

After pre-training, the model undergoes post-training (or supervised fine-tuning) where it learns to interact with humans:

  • Human-Labeled Data: Conversations are fine-tuned with curated examples where prompts are paired with ideal responses.
  • Persona Adoption: The model learns to adopt specific roles be it a teacher, assistant, or customer support agent making its interactions more natural. In addition to memory, users can set custom instructions to adjust the AI’s tone, style, and level of formality. This feature is especially useful for tasks like language learning or content creation, where consistency in voice is essential.
  • Task Specialization: Enhanced performance in areas like Q&A, code generation, and creative writing is achieved through targeted training.
  • Reducing Hallucinations: Although not entirely eliminated, post-training helps to reinforce factual accuracy.

Karpathy also pointed out that as our conversations with these models grow longer, it’s often beneficial to start a new chat when switching topics. This resets the context window, ensuring that the model’s responses remain accurate and efficient.

Model Selection: Finding the Right Balance

When choosing a model, it’s essential to consider the trade-offs between cost and performance:

  • Free Tiers: Offer basic capabilities suited for simple tasks like drafting emails or creative writing.
  • Paid Tiers: Provide advanced features, including broader knowledge, faster inference, and access to tools like internet search and code execution. For instance, a developer debugging complex code might opt for GPT-4 despite the higher cost, while a student summarizing a textbook chapter might find a free-tier model sufficient.
Model Selection

An interesting personal tip comes from experimenting with multiple models. For example, when asking Gemini for a cool city recommendation, I got Zermatt as an answer a suggestion I found quite appealing. Gemini’s interface includes a model selector in the top left, which allows you to upgrade to more advanced tiers for improved performance. The same applies to Grok: instead of relying on Grok 2, I prefer to use Grok 3 since it’s the most advanced version available. In fact, I often pay for several models and ask them the same question, treating them as my personal “LLM council.” This way, I can compare responses and decide which model best suits my needs whether I’m planning a vacation or tackling a technical problem.

The key takeaway is to experiment with different providers and pricing tiers for the specific challenges you’re working on. By doing so, you can find the model that fits your workflow best and even leverage multiple models to get a well-rounded perspective.

Decoding and Sampling Techniques

When generating text, the model doesn’t simply choose the highest-probability token every time. Instead, it uses various decoding strategies:

  • Nucleus Sampling (Top-p Sampling): The model selects from a subset of tokens whose cumulative probability meets a threshold.
  • Top-k Sampling: Limits the selection to the top k most likely tokens.
  • Beam Search: Explores multiple possible token sequences in parallel to find the most coherent output.

Enhancing Functionality with External Tools

Modern LLMs don’t just generate text so they can also integrate external tools to boost their capabilities:

  • Internet Search: Fetches up-to-date information to overcome knowledge cutoffs.

“When I read The Wealth of Nations, the model helps me understand the nuances by summarizing chapters and answering my clarifying questions. It’s like having a knowledgeable study partner,” he remarks.

  • Python Interpreter: Executes code for calculations, data analysis, and visualizations. He shows how this capability can be used to plot trends such as extrapolating company valuations over time while also cautioning users to verify any assumptions made by the AI in its generated code.

“When a multiplication problem becomes too tricky to solve in your head, the model simply writes a Python script and runs it. It’s like having a junior data analyst at your fingertips,” Karpathy explains.

  • File Uploads: Allows for the processing of documents like PDFs or spreadsheets, enabling detailed summaries and data extraction.

Image Generation and Video Integration

Karpathy demonstrates that LLMs are evolving beyond text. He shows how images are generated by coupling a captioning system with a dedicated image-generation model (such as ideogram.ai) to create visuals on demand. This technique, he notes, “stitches up” two separate models so that the user experience remains seamless even when the underlying processes are distinct.

“The image output isn’t done fully in the model. It’s like a beautiful collaboration between text-to-image captioning and a separate image generator,” he remarks.

Additionally, Karpathy introduces video capabilities where the model “sees” via a camera feed. In one demonstration, he points the camera at everyday objects: a book cover, a detailed map and the model correctly identifies and comments on each item. This all has been explained in the later in more detail.

Voice and Audio Capabilities

Voice interaction is a major highlight of the video. Karpathy explains that on mobile devices, users can simply speak to the model, which then converts audio to text for processing. Beyond simple transcription, advanced modes allow the model to generate audio responses in various “personas” from Yoda’s wise cadence to a gruff pirate accent.

“Don’t type stuff out, use your voice. It’s super fast and sometimes even more fun when the AI speaks back to you in a characterful tone,” he ssid.

He further differentiates between “fake audio” (where voice is converted to text and back) and “true audio,” which tokenizes audio natively. True audio processing represents a leap forward by eliminating intermediary steps, making interactions more fluid and natural. This all has been explained in the later in more detail.

Everyday Interactions and Practical Problem Solving

Karpathy shares several practical examples from calculating caffeine content in a beverage to interactive troubleshooting of code. These everyday use cases highlight how seamlessly integrated AI tools can enhance productivity and decision-making in daily life.

“I once asked ChatGPT about how much caffeine is in a shot of Americano. It quickly recalled that it’s roughly 63 milligrams, a simple yet powerful example of everyday AI assistance,” he explains.

Advanced Data Analysis and Visualization

Beyond everyday tasks, the integration of a Python interpreter transforms the AI into a competent data analyst. Whether it’s generating trend lines from financial data or debugging complex code, these capabilities offer tremendous value for both professionals and hobbyists.

“Imagine having a junior data analyst who not only writes code for you but also visualizes data trends in real time. That’s the power of integrated tool use,” Karpathy asserts.

Thinking Models: When to Let the AI “Ponder”

One of the most fascinating advancements in modern LLMs is the emergence of “thinking models.” These models are designed to tackle complex problems by effectively “thinking out loud” much like a human solving a tough puzzle.

The Training Journey: From Pre-Training to Reinforcement Learning

Karpathy explains that the development of LLMs involves multiple stages:

  • Pre-Training: The model ingests vast amounts of data from the internet, learning to predict the next token in a sequence.
  • Supervised Fine-Tuning: Human-curated conversations help shape the model’s responses into a more interactive, friendly dialogue.
  • Reinforcement Learning (RL): Here’s where it gets really interesting. The model practices on a large collection of problems ranging from math puzzles to coding challenges that resemble textbook exercises. Through this practice, it begins to discover effective “thinking strategies.” These strategies mimic an inner monologue, where the model explores different ideas, backtracks, and revisits its assumptions to arrive at a solution.

Discovering the “Thinking” Process

The reinforcement learning stage is relatively recent, emerging only in the past couple of years and is viewed as a breakthrough. It’s the stage where the model learns to “think” before delivering an answer. Instead of rushing to the final token, a thinking model may generate a series of internal reasoning steps that guide it toward a more accurate solution.

DeepSeek was the first to publicly discuss this concept, presenting a paper on incentivizing reasoning capabilities in LLMs via reinforcement learning, a paper we explored in a previous video. This breakthrough in RL allows models to refine their internal reasoning, a process that was previously too difficult to hard-code by human labelers.

Concrete Example

Here’s a concrete example from my own experience:

He was once stuck on a programming problem involving a gradient check failure in an optimization of a multi-layer perceptron. He copied and pasted the code and asked for advice. Initially, GPT-4.0, the flagship, most powerful model from OpenAI, responded without thinking. It listed several potential issues and debugging tips, but none of these suggestions pinpointed the core problem. The model merely offered general advice rather than solving the issue.

Then, He later switched to one of OpenAI’s thinking models available through the dropdown. OpenAI’s thinking models, which include variants labeled O1, O3 Mini, O3 Mini High, and O1 Pro (the latter being the most advanced and available for premium subscribers), are tuned with reinforcement learning. When he asked the same question, the thinking model took its time emitting a detailed sequence of internal reasoning steps (summaries of its “thought process”). After about a minute, it identified that the issue was caused by mismatched parameters. This extra deliberation allowed it to solve the problem accurately. This model took a minute to generate a detailed internal chain-of-thought, eventually pinpointing that my parameters were mismatched during packing and unpacking. The result? A correct solution that emerged after a series of reflective steps.

You can read more about the reasoning model o3 here.

The LLM Council

He doesn’t rely on just one model. He often asked the same question across multiple models, treating them as his personal “LLM council.” For instance, while one model might solve a problem quickly with a standard response, another, more advanced thinking model may take a few extra minutes but deliver a highly accurate, well-reasoned answer. This approach is especially useful for tasks like complex math problems or intricate code debugging.

I’ve also experimented with other models:

  • Claude: When I gave Claude the same prompt, it correctly identified the issue and solved it albeit using a different approach from other models.
  • Gemini: Gemini delivered the correct answer too, sometimes without needing any extra “thinking” time.
  • Grok 3: Grok 3 also provided a solid solution after a period of internal “pondering” over the problem.
  • Perplexity.ai (DeepSeek R1): This model even reveals snippets of its internal reasoning (raw thoughts) if you expand them, offering a window into its problem-solving process.
Thinking Process of Perplexity

For everyday queries like travel recommendations a non-thinking model might be preferable for its speed. However, for deep, technical, or critical tasks, switching to a thinking model can significantly improve accuracy and performance.

When to Use Thinking Models

Thinking models are most beneficial for challenging tasks:

  • Complex Math Problems: When simple arithmetic isn’t enough.
  • Intricate Code Debugging: For cases where subtle issues might be hidden in layers of logic.
  • Deep Reasoning Tasks: Problems that require a series of thought processes to reach the correct answer.

Things that require a lot of thinking things that are very simple like might not actually benefit from this but things that are actually deep and hard might benefit a lot.

For everyday queries like travel recommendations or quick fact-checks a standard, non-thinking model might be preferable due to its faster response times. However, if accuracy is paramount and the problem is inherently complex, switching to a thinking model is well worth the extra wait.

Tool Use: Internet Search and Deep Research

Modern LLMs overcome static knowledge limitations by integrating with external tools:

Internet Search: Accessing Real-Time Information

Up to this point, our interaction with LLMs has been limited to text, the “zip file” of pre-trained data that provides tokens. However, real-world applications demand that these models access fresh, up-to-date information. That’s where internet search comes in.

While traditional LLM interactions rely solely on pre-trained knowledge, a “zip file” of static data the integration of internet search transforms these models into dynamic information hubs. Instead of manually sifting through search results and dodging distracting ads, the model can now actively retrieve up-to-date information, integrate it into its working memory, and answer your queries accurately.

How It Works

  • Triggering a Search: When the model recognizes that your query involves recent or evolving information, it emits a special search token. This signals the application to halt normal token generation and launch a web search.
  • Executing the Search: The model-generated query is used to search the internet. The system visits multiple webpages, extracts relevant text, and compiles the information.
  • Integrating Results: The retrieved content is then injected into the model’s context window and its working memory so the AI can provide an answer enriched with real-time data and proper citations.

For instance, if you ask, “When are new episodes of White Lotus Season 3 coming out?” the model will detect that this information isn’t in its pre-trained data. It will then search the web, load the resulting articles into the context, and provide you with the latest schedule along with links for verification.

Model-Specific Behaviors

Different models have varying levels of internet search integration:

  • Claude: As of my last update in April 2024, Claude doesn’t support integrated web search. It relies solely on its knowledge cutoff from that time, so it will simply state that it doesn’t know.
  • Gemini: Gemini 2.0 Pro Experimental, for example, may not have full access to real-time info, whereas a variant like Gemini 2.0 Flash shows sources and related content, indicating a built-in search tool.
  • ChatGPT: In some instances, ChatGPT will automatically detect when a search is needed; in other cases, you may need to explicitly select the “search the web” option.
  • Perplexity.ai: Known for its robust search integration, Perplexity often retrieves and displays real-time data along with citations, making it a popular choice for queries that resemble Google searches.

Real-World Use Cases

I frequently use the internet search tool for various types of queries:

  • Current Events and Trends: For instance, checking if the market is open on President’s Day Perplexity quickly confirms that it’s closed.
  • Niche Information: Questions like “Where was White Lotus Season 3 filmed?” or “Does Vercel offer PostgreSQL?” benefit from the latest online data.
  • Dynamic Updates: Inquiries about the Apple launch, stock movements (e.g., “Why is the Palantir stock going up?”), or even specifics like “What toothpaste does Brian Johnson use?” are all well-suited for search tools, as these details can change over time.
  • Trending Topics: When I see buzz on Twitter about USAID or the latest travel advisories, a quick search gives me a digest of the current context without having to manually click through multiple links.

Practical Tips

  • Be Explicit: Sometimes, it helps to prompt the model directly by saying “Search the web for…” to ensure it retrieves real-time data.
  • Cross-Verify: Always check the provided citations to confirm the accuracy of the information.
  • Model Selection: Not every model is equipped with internet search. Depending on your needs, choose one that supports real-time data (e.g., ChatGPT with the search option or Perplexity.ai) or be prepared to switch between models to get a comprehensive answer.

Deep Research: Comprehensive Reports via Integrated Search and Reasoning

Deep research empowers LLMs to go beyond superficial answers by combining extensive internet searches with advanced reasoning. This process allows the model to gather, process, and synthesize information from a wide array of sources almost as if it were generating a custom research paper on any topic.

How It Works

When you activate deep research (typically a feature available on higher-tier subscriptions, such as $200/month), the model embarks on an extended process:

  • Initiation: You provide a detailed prompt. For example, consider this prompt:
    “CAAKG is one of the health actives in Brian Johnson’s blueprint at 2.5 grams per serving. Can you do research on CAAKG? Tell me about why it might be found in the longevity mix, its possible efficacy in humans or animal models, potential mechanisms of action, and any concerns or toxicity issues.”
  • Clarifying Questions: Before diving into research, the model may ask for clarifications such as whether to focus on human clinical studies, animal models, or both to fine-tune its search strategy.
  • Multi-Source Querying: The model then issues multiple internet search queries. It scans academic papers, clinical studies, and reputable web pages, accumulating the text from numerous sources. These documents are then inserted into its context window, a giant working memory that holds thousands of tokens.
  • Synthesis: Once the research phase is complete (which can take around 10 minutes for complex queries), the model synthesizes the gathered data into a coherent report. It generates detailed summaries, includes citations for verification, and even highlights key points such as proposed mechanisms of action, efficacy studies in various models (worms, drosophila, mice, and ongoing human trials), and potential safety concerns.

Technical Aspects

  • Iterative Searching: Deep research leverages iterative internet searches and internal “thinking” steps. The model uses reinforcement learning strategies to decide which sources are most relevant and how to weave them into a structured response.
  • Context Accumulation: As the model retrieves information, each document’s content is added to the context window. This massive repository of tokens allows the model to reference multiple sources simultaneously.
  • Citation Integration: The final report comes with citations, enabling you to verify each piece of information. This is crucial given that the model’s outputs are probabilistic and can sometimes include hallucinations or inaccuracies.
  • Chain-of-Thought Processing: Throughout the process, the model may reveal snippets of its internal reasoning (if you expand them), offering insight into how it connected different pieces of data to form its conclusions.

Examples in Practice

  • Supplement Research: In the example prompt above about CAAKG, the model processes dozens of research articles, clinical studies, and review papers. It then produces a detailed report outlining:
    • Why CAAKG might be included in the longevity mix.
    • Its efficacy as demonstrated in both human and animal models.
    • Proposed mechanisms of action.
    • Any potential concerns or toxicity issues.
deep search
  • Industry Comparisons: He also used deep research to compare products such as researching life extension in mice. The model provided an extensive overview, discussing various longevity experiments, while compiling data from multiple sources.
  • LLM Lab Analysis: In another use case, He asked for a table comparing LLM labs in the USA, including funding levels and company size. Although the resulting table was hit-or-miss (with some omissions like XAI and unexpected inclusions like Hugging Face), it still provided a valuable starting point for further inquiry.

Practical Considerations

  • First Draft, Not Final: Always treat the deep research output as a first draft. Use the provided citations as a guide for further reading and follow-up questions.
  • Varying Quality: Different platforms offer deep research with varying levels of depth. For instance, my experience shows that the Chachapiti offering is currently the most thorough, while Perplexity.ai and Grok provide briefer summaries.
  • Extended Processing Time: Be prepared for long processing times (sometimes 10 minutes or more) as the model gathers and synthesizes large amounts of data.

File Uploads for Documents and Multimedia

File uploads empower LLMs to extend their context by integrating external documents and multimedia files directly into their working memory. For example, if you’re curious about a recent paper from the Art Institute on a language model trained on DNA, you can simply drag and drop the PDF even one as large as 30 MB into the model’s interface. Typically, the model converts the document into text tokens, often discarding non-text elements like images. Once in the token window, you can ask for a summary, pose detailed questions, or dive into specific sections of the document. This makes it possible to “read” a paper together with the AI and explore its content interactively.

“Uploading a document is like handing the AI your personal library. It can then sift through the information and help you understand the finer details exactly what you need when tackling complex research papers,” Karpathy during his talk.

Real-World Examples and Use Cases

Consider the scenario where you’re reviewing a groundbreaking study on genomic sequence analysis. By uploading the PDF directly into the system, you can ask the model, “Can you summarize the methodology used in this study?” The model will convert the paper into tokens, process the key sections, and provide you with a coherent summary, complete with citations. This approach is not limited to academic papers; it also works with product manuals, legal documents, and even lengthy reports like blood test results.

For instance, I recently uploaded my 20‑page blood test report. The model transcribed the results, enabling me to ask, “What do these cholesterol levels indicate about my health?” This two-step process first verifying the transcription accuracy, then asking detailed questions ensures that the insights are as reliable as possible.

Python Interpreter: Dynamic Code Execution and Data Analysis

Modern LLMs now incorporate an integrated Python interpreter, transforming them into dynamic, interactive coding assistants. This feature enables the model to generate, execute, and even debug Python code in real time acting as a “junior data analyst” right within your conversation.

“The Python interpreter integration is a game-changer. Instead of switching between a chat window and your IDE, you get your code, its output, and even visual plots all in one seamless experience,” Karpathy explained during a demonstration.

How It Works in Practice

When you pose a complex problem say, debugging a multi-layer perceptron where the gradient check is failing the model can automatically produce Python code to diagnose the issue. For example, you might ask, “Can you help me debug this gradient check failure?” The model generates code that simulates the error scenario, executes it, and then returns detailed output, such as error messages and variable states, directly within the chat.

In another case, I needed to plot sales trends for a project. I simply requested, “Generate a plot of the sales data for 2023,” and the model wrote and executed the necessary Python script. The resulting graph was immediately displayed, complete with annotations and trends, saving me the hassle of manual coding.

Python Interpreter

Extended Use Cases

  • Data Visualization: Beyond basic plots, the interpreter can generate complex visualizations like heatmaps, scatter plots, or time series graphs based on your data. This is particularly useful for quick data analysis without leaving the chat interface.
  • Algorithm Testing: If you’re experimenting with machine learning models, you can have the interpreter run simulations and even compare different model performances side-by-side.
  • Debugging Assistance: When dealing with intricate code bugs, the model’s step-by-step execution helps pinpoint issues that might be hard to spot in a large codebase.

Custom Visual and Code Tools: Claude Artifacts and Cursor Composer

Modern LLMs have evolved to be more than text generators they’re now creative studios. With Claude Artifacts, you can build custom mini-apps or generate interactive diagrams. For instance, imagine needing a flowchart for a complex project. With a few clear prompts, Claude Artifacts can produce a diagram that visually organizes your ideas. As Karpathy noted,
“Claude Artifacts doesn’t just give you plain text it gives you interactive visuals that bring your concepts to life.”

adam smith flash cards

Alongside this, Cursor: Composer serves as your real-time coding assistant. Whether you’re writing new code or debugging an error, Cursor: Composer can generate, edit, and even visualize code snippets. For example, when I was prototyping a new web application, I simply typed,
“Generate a responsive layout in React,”
and the tool not only produced the code but also highlighted how different components interacted. This seamless integration speeds up development while helping you understand the underlying logic step by step.

If you want to read more about Cursor AI read this.

Audio Interactions and NotebookLM Podcast Generation

The audio features in modern LLMs significantly enhance user interaction. With standard Audio Input/Output, you can ask questions by speaking instead of typing. For instance, you might ask,
“Why is the sky blue?”
and receive both a text-based response and an audible explanation. Karpathy remarked,
“Voice input makes it feel like you’re conversing with a friend, and the model listens intently.”

Advanced Voice Mode takes it a step further by processing audio natively. Instead of converting speech into text first, the model tokenizes audio directly through spectrograms. This means it can capture the nuances in tone and intonation. Imagine asking,
“Tell me a joke in Yoda’s voice,”
and then hearing,
“Wise insights I shall share, hmmm funny, it is.”

Complementing these, NotebookLM offers an innovative twist by generating custom podcasts from uploaded documents. For example, after uploading a 30‑MB research paper on genomic analysis, you might ask,
“Can you generate a podcast summarizing the key points of this paper?”
Within minutes, NotebookLM synthesizes the content and produces a 30‑minute audio summary that you can listen to while commuting.

Audio Interactions and NotebookLM Podcast Generation
Source: Andrej Karpathy YouTube

Visual Modalities: Image Input/OCR, Image Output, and Video Processing

Image Input with OCR allows you to transform photos and screenshots into searchable text. For example, when I uploaded a nutrition label from a health supplement, I then asked,
“What are the key ingredients, and why are they included?”
The model successfully extracted the text and explained each component, complete with safety rankings.

image output

Image Output tools like DALL·E and Ideogram let you generate custom visuals. You can prompt the model with requests such as,
“Generate an artistic depiction of today’s headlines in a cyberpunk style,”
and watch as the AI crafts an image that visually encapsulates the news. Karpathy pointed out,
“It’s fascinating how a caption for today’s news can be transformed into a stunning piece of art using these tools.”

Video Input takes visual processing even further. Using your camera, you can perform point-and-talk interactions. For example, if you point your phone at a book cover, you might ask,
“What’s the title of this book?”
and the model will analyze the visual snapshot to provide an accurate answer. Meanwhile, Video Output systems such as Sora or Veo 2 can turn text descriptions into dynamic video clips, enabling the creation of engaging video summaries or tutorials.

Personalization: Memory, Custom Instructions, and Custom GPTs

Personalization is the cornerstone of making interactions with LLMs truly your own. These features ensure that the AI not only responds to your queries but also adapts to your unique style and recurring needs.

Memory: Retaining Context Across Conversations

LLMs can store key details from past interactions in a memory bank that’s appended to future context windows. This means that over time, the model learns about your preferences and habits. For example, if you mention your favorite movie genres or specific research interests, future conversations will automatically reflect that knowledge.
“It’s like the model gradually gets to know you a personalized conversation that evolves as you interact more,” Karpathy observed.

Custom Instructions: Shaping AI Behavior

Custom instructions let you define exactly how you want the model to respond. You can specify tone, verbosity, and even task-specific rules. Whether you need the model to explain complex topics in simple terms or adopt a particular style for translations, these instructions are injected into every conversation, ensuring consistency and a tailored experience.

Custom Instructions

Custom GPTs: Building Task-Specific Models

Custom GPTs allow you to create specialized versions of the model for recurring tasks. Imagine having a dedicated assistant for language learning that extracts vocabulary and creates flashcards, or a coding helper that consistently generates accurate code snippets. By providing a few examples through few-shot prompting, you build a custom model that saves time and delivers more precise results.
“Custom GPTs are like having your personal, task-specific assistant that’s tuned exactly to your needs,” Karpathy explained.

example
ocr result

Lessons for Beginners: Maximizing Your LLM Experience

For those just starting out, Karpathy’s insights offer a clear pathway to harnessing the full potential of LLMs:

  • Understand Tokenization: Learn how your input is broken down into tokens, as this is the fundamental building block of model processing.
  • Keep It Concise: Manage your context window by starting fresh when switching topics; a crowded context can dilute the effectiveness of responses.
  • Experiment with Different Models: Use free tiers for simple tasks and consider upgrading to advanced models when you need higher accuracy or additional features.
  • Leverage External Tools: Don’t hesitate to integrate internet search, file uploads, and even a Python interpreter to extend the model’s capabilities.
  • Stay Updated: Follow provider updates, join community forums, and experiment with beta features to keep pace with the rapidly evolving ecosystem.

End Note

Andrey Karpathy’s video takes us deep into the inner workings of LLMs from the granular details of tokenization and transformer-based architecture to the expansive capabilities unlocked by tool integrations and multimodal interactions. These models compress vast amounts of knowledge into billions (or even trillions) of parameters, using sophisticated training techniques to predict the next token and generate human-like responses. By combining pre-training with targeted post-training, and by integrating external tools like internet search and Python interpreters, modern LLMs are evolving into versatile, intelligent partners that can both inform and inspire.

As Karpathy succinctly concludes:

“From compressed tokens to interactive chat bubbles, the inner workings of LLMs are a blend of elegant mathematical principles and massive-scale data compression. Each new advancement brings us closer to a future where AI is an integral, intuitive part of our daily lives.”

This comprehensive ecosystem from personalization features to advanced research and multimodal integration provides a robust platform for everyone, from beginners to experts.

llm token

If you wish to watch the video yourselves then click here.

Key Points

Below are the key points with their timestamps for your reference:

  • 00:00:00 Intro into the growing LLM ecosystem
  • 00:02:54 ChatGPT interaction under the hood
  • 00:13:12 Basic LLM interactions examples
  • 00:18:03 Be aware of the model you’re using, pricing tiers
  • 00:22:54 Thinking models and when to use them
  • 00:31:00 Tool use: internet search
  • 00:42:04 Tool use: deep research
  • 00:50:57 File uploads, adding documents to context
  • 00:59:00 Tool use: python interpreter, messiness of the ecosystem
  • 01:04:35 ChatGPT Advanced Data Analysis, figures, plots
  • 01:09:00 Claude Artifacts, apps, diagrams
  • 01:14:02 Cursor: Composer, writing code
  • 01:22:28 Audio (Speech) Input/Output
  • 01:27:37 Advanced Voice Mode aka true audio inside the model
  • 01:37:09 NotebookLM, podcast generation
  • 01:40:20 Image input, OCR
  • 01:47:02 Image output, DALL-E, Ideogram, etc.
  • 01:49:14 Video input, point and talk on app
  • 01:52:23 Video output, Sora, Veo 2, etc etc.
  • 01:53:29 ChatGPT memory, custom instructions
  • 01:58:38 Custom GPTs
  • 02:06:30 Summary

GenAI Intern @ Analytics Vidhya | Final Year @ VIT Chennai
Passionate about AI and machine learning, I'm eager to dive into roles as an AI/ML Engineer or Data Scientist where I can make a real impact. With a knack for quick learning and a love for teamwork, I'm excited to bring innovative solutions and cutting-edge advancements to the table. My curiosity drives me to explore AI across various fields and take the initiative to delve into data engineering, ensuring I stay ahead and deliver impactful projects.

Login to continue reading and enjoy expert-curated content.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details