In his latest video, “How I use LLMs: Andrej Karpathy,” the renowned AI expert pulls back the curtain on the evolving world of LLM. Serving as a follow-up to his earlier video “Deep Diving into LLMs” from the General Audience Playlist on his YouTube channel, this presentation explores how the initial textual chat interface hosted by OpenAI sparked a revolution in AI interaction. Karpathy explains how the ecosystem has rapidly transformed from a simple text-based system into a rich, multi-modal experience, integrating advanced tools and functionalities. This article is inspired by his technical demonstrations, advanced tool integrations, and personal insights, offering readers an in-depth look at the future of AI.
Karpathy begins by mapping out the rapidly expanding ecosystem of LLMs. While ChatGPT remains the pioneering force, he highlights emerging competitors such as Gemini, Copilot, Claude, Grok, and even international players like DeepSeek and LeChat. Each model offers unique features, pricing tiers, and experiences.
“ChatGPT is like the original gangster of conversational AI, but the ecosystem has grown into a diverse playground of experimentation and specialization,” he explains.
Ongoing with the podcast, Karpathy even provided some links from where you can compare and analyze the performances of these several models:
Using these 2 links we can keep track of the several models which are currently publicly available to be utilized
Let us now explore multi modality in detail below:
When it comes to generating text, models like ChatGPT truly excel especially in creative tasks such as writing haikus, poems, cover letters, resumes, and even email replies. As Karpathy puts it, our interactions with these models appear as lively “Chat Bubbles” that encapsulate a dynamic conversation between you and the AI.
Every time you input a query, the model dissects your text into smaller building blocks called tokens. You can explore this process yourself using tools like OpenAI’s Tokenizer or Tiktokenizer. These tokens form a sequential stream often referred to as the token sequence or Context Window which acts as the AI’s working memory.
Under the hood, additional tagging is incorporated into both the input and output sequences. This includes techniques like Part-of-Speech (POS) tagging and Named Entity Recognition (NER), similar to what you might find in the Penn Treebank. These tags help the model better understand the role and identity of each word.
Modern language models typically use Byte-Pair Encoding (BPE) to split words into subwords. For instance, the word “university” might be broken down into “uni”, “vers”, and “ity.” This process ensures that even rare or complex words are represented in a way that the model can process efficiently.
Some important special tokens include:
Karpathy illustrated this beautifully with a diagram [shown in the next section] how a fresh chat begins with an empty token stream. Once you type your query, the model takes over, appending its own stream of tokens. This continuous flow known as the Context Window represents the working memory that guides the AI’s response.
“I like to think of the model as a one terabyte zip file it’s full of compressed knowledge from the internet, but it’s the human touch in post-training that gives it a soul,” he explains.
At the heart of LLMs lies the Transformer architecture. Key elements include:
To really grasp how these models generate text, it’s crucial to understand the two major phases of their training:
In this phase, the model processes vast amounts of data from books and websites to code repositories and academic papers. Think of it as compressing the world’s knowledge into a “zip file” of parameters:
After pre-training, the model undergoes post-training (or supervised fine-tuning) where it learns to interact with humans:
Karpathy also pointed out that as our conversations with these models grow longer, it’s often beneficial to start a new chat when switching topics. This resets the context window, ensuring that the model’s responses remain accurate and efficient.
When choosing a model, it’s essential to consider the trade-offs between cost and performance:
An interesting personal tip comes from experimenting with multiple models. For example, when asking Gemini for a cool city recommendation, I got Zermatt as an answer a suggestion I found quite appealing. Gemini’s interface includes a model selector in the top left, which allows you to upgrade to more advanced tiers for improved performance. The same applies to Grok: instead of relying on Grok 2, I prefer to use Grok 3 since it’s the most advanced version available. In fact, I often pay for several models and ask them the same question, treating them as my personal “LLM council.” This way, I can compare responses and decide which model best suits my needs whether I’m planning a vacation or tackling a technical problem.
The key takeaway is to experiment with different providers and pricing tiers for the specific challenges you’re working on. By doing so, you can find the model that fits your workflow best and even leverage multiple models to get a well-rounded perspective.
When generating text, the model doesn’t simply choose the highest-probability token every time. Instead, it uses various decoding strategies:
Modern LLMs don’t just generate text so they can also integrate external tools to boost their capabilities:
“When I read The Wealth of Nations, the model helps me understand the nuances by summarizing chapters and answering my clarifying questions. It’s like having a knowledgeable study partner,” he remarks.
“When a multiplication problem becomes too tricky to solve in your head, the model simply writes a Python script and runs it. It’s like having a junior data analyst at your fingertips,” Karpathy explains.
Karpathy demonstrates that LLMs are evolving beyond text. He shows how images are generated by coupling a captioning system with a dedicated image-generation model (such as ideogram.ai) to create visuals on demand. This technique, he notes, “stitches up” two separate models so that the user experience remains seamless even when the underlying processes are distinct.
“The image output isn’t done fully in the model. It’s like a beautiful collaboration between text-to-image captioning and a separate image generator,” he remarks.
Additionally, Karpathy introduces video capabilities where the model “sees” via a camera feed. In one demonstration, he points the camera at everyday objects: a book cover, a detailed map and the model correctly identifies and comments on each item. This all has been explained in the later in more detail.
Voice interaction is a major highlight of the video. Karpathy explains that on mobile devices, users can simply speak to the model, which then converts audio to text for processing. Beyond simple transcription, advanced modes allow the model to generate audio responses in various “personas” from Yoda’s wise cadence to a gruff pirate accent.
“Don’t type stuff out, use your voice. It’s super fast and sometimes even more fun when the AI speaks back to you in a characterful tone,” he ssid.
He further differentiates between “fake audio” (where voice is converted to text and back) and “true audio,” which tokenizes audio natively. True audio processing represents a leap forward by eliminating intermediary steps, making interactions more fluid and natural. This all has been explained in the later in more detail.
Karpathy shares several practical examples from calculating caffeine content in a beverage to interactive troubleshooting of code. These everyday use cases highlight how seamlessly integrated AI tools can enhance productivity and decision-making in daily life.
“I once asked ChatGPT about how much caffeine is in a shot of Americano. It quickly recalled that it’s roughly 63 milligrams, a simple yet powerful example of everyday AI assistance,” he explains.
Beyond everyday tasks, the integration of a Python interpreter transforms the AI into a competent data analyst. Whether it’s generating trend lines from financial data or debugging complex code, these capabilities offer tremendous value for both professionals and hobbyists.
“Imagine having a junior data analyst who not only writes code for you but also visualizes data trends in real time. That’s the power of integrated tool use,” Karpathy asserts.
One of the most fascinating advancements in modern LLMs is the emergence of “thinking models.” These models are designed to tackle complex problems by effectively “thinking out loud” much like a human solving a tough puzzle.
Karpathy explains that the development of LLMs involves multiple stages:
The reinforcement learning stage is relatively recent, emerging only in the past couple of years and is viewed as a breakthrough. It’s the stage where the model learns to “think” before delivering an answer. Instead of rushing to the final token, a thinking model may generate a series of internal reasoning steps that guide it toward a more accurate solution.
DeepSeek was the first to publicly discuss this concept, presenting a paper on incentivizing reasoning capabilities in LLMs via reinforcement learning, a paper we explored in a previous video. This breakthrough in RL allows models to refine their internal reasoning, a process that was previously too difficult to hard-code by human labelers.
Here’s a concrete example from my own experience:
He was once stuck on a programming problem involving a gradient check failure in an optimization of a multi-layer perceptron. He copied and pasted the code and asked for advice. Initially, GPT-4.0, the flagship, most powerful model from OpenAI, responded without thinking. It listed several potential issues and debugging tips, but none of these suggestions pinpointed the core problem. The model merely offered general advice rather than solving the issue.
Then, He later switched to one of OpenAI’s thinking models available through the dropdown. OpenAI’s thinking models, which include variants labeled O1, O3 Mini, O3 Mini High, and O1 Pro (the latter being the most advanced and available for premium subscribers), are tuned with reinforcement learning. When he asked the same question, the thinking model took its time emitting a detailed sequence of internal reasoning steps (summaries of its “thought process”). After about a minute, it identified that the issue was caused by mismatched parameters. This extra deliberation allowed it to solve the problem accurately. This model took a minute to generate a detailed internal chain-of-thought, eventually pinpointing that my parameters were mismatched during packing and unpacking. The result? A correct solution that emerged after a series of reflective steps.
You can read more about the reasoning model o3 here.
He doesn’t rely on just one model. He often asked the same question across multiple models, treating them as his personal “LLM council.” For instance, while one model might solve a problem quickly with a standard response, another, more advanced thinking model may take a few extra minutes but deliver a highly accurate, well-reasoned answer. This approach is especially useful for tasks like complex math problems or intricate code debugging.
I’ve also experimented with other models:
For everyday queries like travel recommendations a non-thinking model might be preferable for its speed. However, for deep, technical, or critical tasks, switching to a thinking model can significantly improve accuracy and performance.
Thinking models are most beneficial for challenging tasks:
Things that require a lot of thinking things that are very simple like might not actually benefit from this but things that are actually deep and hard might benefit a lot.
For everyday queries like travel recommendations or quick fact-checks a standard, non-thinking model might be preferable due to its faster response times. However, if accuracy is paramount and the problem is inherently complex, switching to a thinking model is well worth the extra wait.
Modern LLMs overcome static knowledge limitations by integrating with external tools:
Up to this point, our interaction with LLMs has been limited to text, the “zip file” of pre-trained data that provides tokens. However, real-world applications demand that these models access fresh, up-to-date information. That’s where internet search comes in.
While traditional LLM interactions rely solely on pre-trained knowledge, a “zip file” of static data the integration of internet search transforms these models into dynamic information hubs. Instead of manually sifting through search results and dodging distracting ads, the model can now actively retrieve up-to-date information, integrate it into its working memory, and answer your queries accurately.
For instance, if you ask, “When are new episodes of White Lotus Season 3 coming out?” the model will detect that this information isn’t in its pre-trained data. It will then search the web, load the resulting articles into the context, and provide you with the latest schedule along with links for verification.
Different models have varying levels of internet search integration:
I frequently use the internet search tool for various types of queries:
Deep research empowers LLMs to go beyond superficial answers by combining extensive internet searches with advanced reasoning. This process allows the model to gather, process, and synthesize information from a wide array of sources almost as if it were generating a custom research paper on any topic.
When you activate deep research (typically a feature available on higher-tier subscriptions, such as $200/month), the model embarks on an extended process:
File uploads empower LLMs to extend their context by integrating external documents and multimedia files directly into their working memory. For example, if you’re curious about a recent paper from the Art Institute on a language model trained on DNA, you can simply drag and drop the PDF even one as large as 30 MB into the model’s interface. Typically, the model converts the document into text tokens, often discarding non-text elements like images. Once in the token window, you can ask for a summary, pose detailed questions, or dive into specific sections of the document. This makes it possible to “read” a paper together with the AI and explore its content interactively.
“Uploading a document is like handing the AI your personal library. It can then sift through the information and help you understand the finer details exactly what you need when tackling complex research papers,” Karpathy during his talk.
Consider the scenario where you’re reviewing a groundbreaking study on genomic sequence analysis. By uploading the PDF directly into the system, you can ask the model, “Can you summarize the methodology used in this study?” The model will convert the paper into tokens, process the key sections, and provide you with a coherent summary, complete with citations. This approach is not limited to academic papers; it also works with product manuals, legal documents, and even lengthy reports like blood test results.
For instance, I recently uploaded my 20‑page blood test report. The model transcribed the results, enabling me to ask, “What do these cholesterol levels indicate about my health?” This two-step process first verifying the transcription accuracy, then asking detailed questions ensures that the insights are as reliable as possible.
Modern LLMs now incorporate an integrated Python interpreter, transforming them into dynamic, interactive coding assistants. This feature enables the model to generate, execute, and even debug Python code in real time acting as a “junior data analyst” right within your conversation.
“The Python interpreter integration is a game-changer. Instead of switching between a chat window and your IDE, you get your code, its output, and even visual plots all in one seamless experience,” Karpathy explained during a demonstration.
When you pose a complex problem say, debugging a multi-layer perceptron where the gradient check is failing the model can automatically produce Python code to diagnose the issue. For example, you might ask, “Can you help me debug this gradient check failure?” The model generates code that simulates the error scenario, executes it, and then returns detailed output, such as error messages and variable states, directly within the chat.
In another case, I needed to plot sales trends for a project. I simply requested, “Generate a plot of the sales data for 2023,” and the model wrote and executed the necessary Python script. The resulting graph was immediately displayed, complete with annotations and trends, saving me the hassle of manual coding.
Modern LLMs have evolved to be more than text generators they’re now creative studios. With Claude Artifacts, you can build custom mini-apps or generate interactive diagrams. For instance, imagine needing a flowchart for a complex project. With a few clear prompts, Claude Artifacts can produce a diagram that visually organizes your ideas. As Karpathy noted,
“Claude Artifacts doesn’t just give you plain text it gives you interactive visuals that bring your concepts to life.”
Alongside this, Cursor: Composer serves as your real-time coding assistant. Whether you’re writing new code or debugging an error, Cursor: Composer can generate, edit, and even visualize code snippets. For example, when I was prototyping a new web application, I simply typed,
“Generate a responsive layout in React,”
and the tool not only produced the code but also highlighted how different components interacted. This seamless integration speeds up development while helping you understand the underlying logic step by step.
If you want to read more about Cursor AI read this.
The audio features in modern LLMs significantly enhance user interaction. With standard Audio Input/Output, you can ask questions by speaking instead of typing. For instance, you might ask,
“Why is the sky blue?”
and receive both a text-based response and an audible explanation. Karpathy remarked,
“Voice input makes it feel like you’re conversing with a friend, and the model listens intently.”
Advanced Voice Mode takes it a step further by processing audio natively. Instead of converting speech into text first, the model tokenizes audio directly through spectrograms. This means it can capture the nuances in tone and intonation. Imagine asking,
“Tell me a joke in Yoda’s voice,”
and then hearing,
“Wise insights I shall share, hmmm funny, it is.”
Complementing these, NotebookLM offers an innovative twist by generating custom podcasts from uploaded documents. For example, after uploading a 30‑MB research paper on genomic analysis, you might ask,
“Can you generate a podcast summarizing the key points of this paper?”
Within minutes, NotebookLM synthesizes the content and produces a 30‑minute audio summary that you can listen to while commuting.
Image Input with OCR allows you to transform photos and screenshots into searchable text. For example, when I uploaded a nutrition label from a health supplement, I then asked,
“What are the key ingredients, and why are they included?”
The model successfully extracted the text and explained each component, complete with safety rankings.
Image Output tools like DALL·E and Ideogram let you generate custom visuals. You can prompt the model with requests such as,
“Generate an artistic depiction of today’s headlines in a cyberpunk style,”
and watch as the AI crafts an image that visually encapsulates the news. Karpathy pointed out,
“It’s fascinating how a caption for today’s news can be transformed into a stunning piece of art using these tools.”
Video Input takes visual processing even further. Using your camera, you can perform point-and-talk interactions. For example, if you point your phone at a book cover, you might ask,
“What’s the title of this book?”
and the model will analyze the visual snapshot to provide an accurate answer. Meanwhile, Video Output systems such as Sora or Veo 2 can turn text descriptions into dynamic video clips, enabling the creation of engaging video summaries or tutorials.
Personalization is the cornerstone of making interactions with LLMs truly your own. These features ensure that the AI not only responds to your queries but also adapts to your unique style and recurring needs.
LLMs can store key details from past interactions in a memory bank that’s appended to future context windows. This means that over time, the model learns about your preferences and habits. For example, if you mention your favorite movie genres or specific research interests, future conversations will automatically reflect that knowledge.
“It’s like the model gradually gets to know you a personalized conversation that evolves as you interact more,” Karpathy observed.
Custom instructions let you define exactly how you want the model to respond. You can specify tone, verbosity, and even task-specific rules. Whether you need the model to explain complex topics in simple terms or adopt a particular style for translations, these instructions are injected into every conversation, ensuring consistency and a tailored experience.
Custom GPTs allow you to create specialized versions of the model for recurring tasks. Imagine having a dedicated assistant for language learning that extracts vocabulary and creates flashcards, or a coding helper that consistently generates accurate code snippets. By providing a few examples through few-shot prompting, you build a custom model that saves time and delivers more precise results.
“Custom GPTs are like having your personal, task-specific assistant that’s tuned exactly to your needs,” Karpathy explained.
For those just starting out, Karpathy’s insights offer a clear pathway to harnessing the full potential of LLMs:
Andrey Karpathy’s video takes us deep into the inner workings of LLMs from the granular details of tokenization and transformer-based architecture to the expansive capabilities unlocked by tool integrations and multimodal interactions. These models compress vast amounts of knowledge into billions (or even trillions) of parameters, using sophisticated training techniques to predict the next token and generate human-like responses. By combining pre-training with targeted post-training, and by integrating external tools like internet search and Python interpreters, modern LLMs are evolving into versatile, intelligent partners that can both inform and inspire.
As Karpathy succinctly concludes:
“From compressed tokens to interactive chat bubbles, the inner workings of LLMs are a blend of elegant mathematical principles and massive-scale data compression. Each new advancement brings us closer to a future where AI is an integral, intuitive part of our daily lives.”
This comprehensive ecosystem from personalization features to advanced research and multimodal integration provides a robust platform for everyone, from beginners to experts.
If you wish to watch the video yourselves then click here.
Below are the key points with their timestamps for your reference: