Analyse Data for Free Using Microsoft’s Latest AI Tool: Data Formulator

Riya Bansal. Last Updated : 31 May, 2025
9 min read

In today’s data-driven world, every researcher and analyst requires the ability to yield prompt information from raw data and present it in visual form. That’s exactly what Microsoft’s new AI tool, Data Formulator, can help you with. It simplifies data visualization by presenting the data as interesting charts and graphs, especially for those without much knowledge of data manipulation and visualization tools. In this article, we’ll dive deep into Microsoft’s Data Formulator tool and learn how to use it.

What is Data Formulator?

Data Formulator is an open-source application developed by Microsoft Research that uses LLMs as a means to transform data and facilitate faster data visualization. What differentiates Data Formulator from traditional chat-based AI tools is its hybrid interactions. It has an intuitive user interface that supplements natural language inputs and simple drag-and-drop interactions.

Source: Microsoft

At its core, the tool was designed to bridge the huge gap between having a visualization idea and actually creating it. Typical tools either force users to write complicated code or choose from an endless list of menu-driven options to visually represent data. In contrast, Data formulator offers immediate interaction with the user to express visualization intent, while the heavy transformation work is taken care of by AI, behind the scenes.

Key Features of Microsoft Data Formulator

Some of the key features of Data Formulator are:

  • Hybrid Interaction Model: It offers the best of both worlds: precision through direct manipulation (drag and drop), and flexibility through natural language conversational prompts. This helps users add chart-type visualizations directly and then clarify hard-to-express requirements via text.
  • AI-Powered Data Transformation: When users ask for fields that do not exist in its dataset, the AI will create new calculated fields. It will also aggregate the data or apply filters to meet the visualization specifications.
  • Multiple Data Source Support: Data Formulator supports a wide range of data sources, such as CSV files, databases (MySQL, DuckDB), and cloud services such as Azure Data Explorer. The external data loaders enable easy integration even with expensive enterprise data sources.
  • Large Dataset Handling: Since version 0.2, Data Formulator has been handling large datasets efficiently by uploading data to a local DuckDB. Then it starts fetching just enough data for the visualization, drastically minimizing the waiting time.
  • Data Threading and Anchoring: The tool records all the visualization attempts under ‘Data Threads’, allowing users to retrace their path during exploration. It can save intermediate datasets as anchoring points to be further pursued as analyses, thereby eliminating unnecessary confusion and improving efficiency.

Architecture of Data Formulator

The modular architecture of Data Formulator provides flexibility and extensibility through the following layers:

  • Frontend Layer: The frontend, built with modern web technologies like TypeScript and React, is what allows users to upload or preview datasets. It lets users add visual encodings via drag and drop, input natural language prompts, and view generated visualizations and code.
  • Backend Processing Engine: This Python-based part of the backend system loads & preprocesses the data and communicates with various LLM providers. Then, accordingly, it generates the code involved in transforming the data and renders visualizations through Altair/Vega-Lite libraries.
  • AI-Integration Layer: This layer of the framework is involved in LLM prompt engineering, response processing, code validation, and execution. It also handles error handling and debugging assistance, as well as context management for iterative conversations.
  • Data Management Layer: It deals with connecting the tool to multiple data sources and operating on a local database (DuckDB). It allows for caching and eventually optimizing data and implementations of external data loaders.
Source: Microsoft

How Does Data Formulator Work?

Data Formulator blends interactions from users with AI-powered data processing following the process below:

Step 1: Intent Specification

Users select a chart type and drag data fields to visual properties (x-axis, y-axis, color, size, etc.). If the reference fields do not exist in the original dataset, they are tasked as a flair for requiring data transformation.

Step 2: AI Interpretation

The system observes the user’s specifications of visual encodings along with any free-text natural language prompts. It tries to understand exactly what the user wants to visualize by analysing the data types and the relationship between the fields.

Step 3: Code Generation

Once interpreted, Data Formulator produces the data transformation code needed. In most cases, it uses Python with Pandas or Polars, to build the necessary derived fields, aggregations, and filtering operations.

Step 4: Execution and Validation

The generated code is then executed using the dataset, with built-in error handling to find and fix common errors. If it cannot do so, the AI goes back and iteratively reworks the code.

Step 5: Visualization Creation

The system generates a visualization specification once the data has been properly transformed and proceeds to produce a final chart out of it.

Step 6: Iterative Refinement

Users can provide feedback, ask follow-up questions, or change encodings iteratively to refine the visualization over time, thus creating a natural iterative workflow.

Source: Microsoft

Getting Started with Data Formulator

There are three ways to start using Data Formulator.

Method 1: Through Python Installation

One of the easiest ways to get started with Data Formulator is via installation through PIP. For this:

  1. Install the Data Formulator in a virtual environment.
pip install data_formulator
  1. You can start the application using any of the following commands:
data_formulator

OR

python -m data_formulator
  1. You can also specify the custom port if required.
python -m data_formulator --port 8080

Method 2: Through GitHub Codespaces

The Data Formulator tool can be run in a completely zero-setup environment in GitHub Codespaces:

  1. Visit the Data Formulator Github repository.
  2. Click “Open in GitHub Codespaces”.
  3. Wait for this environment to initialize (~5 minutes).
  4. You can start using Data Formulator immediately.

Method 3: Through Developer Mode

For users who want the entire development environment in their hands, they can do so by following these steps:

  1. Create a git clone of the repository:  https://github.com/microsoft/data-formulator.
  2. Follow the instructions in DEVELOPMENT.md thoroughly for the setup.
  3. Set up your favourite development environment.
  4. Configure the AI model by choosing a policy for entering API keys for your preferred LLM.
  5. Upload your data in the form of a CSV file, or connect it to a data source.
  6. Start making visualisations from the user interface.

Hands-on Application of Data Formulator

Now, let’s try building a sales performance dashboard using the Data Formulator. For this task, we’ll be using GitHub CodeSpaces to launch a dedicated development environment.

Step 1: Open GitHub CodeSpaces and click on the green button on the GitHub repository, which will create a separate workspace for you.

Step 2: Let the CodeSpace initialize, which usually takes ~ 2-5 minutes. Once the Github CodeSpace is created, it will look like this:

Step 3: In the terminal of the Codespace, run the following command:

python3 -m data_formulator

Which will show an output like:

Starting server on port 3000
...
Open http://localhost:3000

Step 4: In the CodeSpaces toolbar, click on ‘Port’. This will open your interface in a separate browser window.

Step 5: Here, you can select your preferred key type, model name, and set the secret key for the creation of the dashboard.

Step 6: Upload the dataset. For our example, I am uploading supermarket_sales.csv data for analysis.

Step 7: For the basic visualization, you can choose a bar chart out of all the options, and then assign the x-axis and y-axis values. For our analysis, I have assigned the branch to the x-axis and the total to the y-axis. Here’s the chart Data Formulator created for me.

Step 8: For a different AI-powered calculation, you can choose other fields on the x-axis and y-axis. Then add your prompt and formulate. For instance, here I’m going to type in the prompt box “Sum the total sales for each city” and click on “Formulate”.

Step 9: You can create various other types of charts and visualizations using the customized dashboard and come up with amazing analyses of your data.

Use Cases of Data Formulator

Microsoft’s Data Formulator is of great use across various domains as it enables AI-powered explorations and visualizations. Some of its most prominent use cases are:

  • Business Intelligence and Reporting: Fueled by executive dashboards and operational reports, Data Formulator stands out. Business analysts can instantly transform sales data, financial metrics, or operational KPIs into visualizations and representations without exercising any technical expertise.
  • Academic Research and Analysis: In the research context, Data Formulator assists in the investigation of complicated datasets and the generation of publication-ready visualizations. Because of its iterative nature, the tool supports exploratory data analysis workflows common in academic research.
  • Marketing Analytics: With Data Formulator, marketing professionals analyze campaign performances, customer segmentations, and conversion funnels. The presence of calculated fields makes it easy to compute the metrics. For example, customer lifetime value, retention rates, and campaign ROI can be computed without any convoluted formulas.
  • Financial Analysis: Financial analysts can build complex models for risk measurement, portfolio analysis, and performance tracking. It can handle large data sets and connect to real-time data. Therefore, it can be used in analyzing market data, trade patterns, and financial forecasts.

Advantages of Data Formulator

The Data Formulator is headed toward maximizing the accessibility of data, speed, and intelligent data processing.

  • Democratization of Data Analysis: The strength of Data Formulator is its largest in making advanced data visualization truly accessible to non-technical users. It eliminates the need for coding skills to analyze data directly, without having to go through technical resources.
  • Rapid Prototyping and Iteration: The conversational interface allows users to consider various visualization approaches quickly. Users can analyze ideas briefly, put the finishing touches on a chart, and view alternative ways to look at their data. The tool significantly reduces the time it takes to go from question to insight.
  • Intelligent Data Transformations: While an ordinary tool expects users to prepare their data, Data Formulator handles complex transformations and aggregations. It does calculations from users’ instructions automatically, which helps save hours otherwise spent in manual data wrangling.
  • Transparency and Explainability: This system generates human-readable code for all transformations. It makes it easier for users as they may safely ascertain the logic of their visualizations to build trust and learn.
  • Cost-Effective Solution: Being an open-source tool, Data Formulator provides enterprise-grade capabilities at zero licensing cost. The organizations may also deploy the tool internally, keeping total control over the data and any customizations.

Limitations of Data Formulator

While Data Formulator is overcoming some of the greatest challenges, it does not come without some constraints, namely:

  • AI Model Dependencies: The efficacy of Data Formulator depends on the particular ability of an AI Model. Complex analytical tasks may require the intervention of expensive high-end models, which could even entail the very best models.
  • Limited Visualization Types: It supports standard chart types and specialized visualizations such as network analysis, geospatial mapping, and other statistical plotting.
  • Workability on Large Datasets: While it performs better on large datasets, the implementation using DuckDB continues to face bottlenecks on very large datasets. They are usually measured in terabytes in its early phases of data loading.
  • Ambiguity of Natural Language: Complicated analytical requests may be interpreted wrongly by the AI and thus subjected to wrong transformations. Clear, precise prompts should be given by the users, which may usually be a tough task for those lacking technical skills.
  • Privacy and Security Considerations: Cloud-based AI models may pose the risk of transmission of sensitive data to external services. Organizations with a strict data governance policy may prefer to deploy local models or adopt necessary security measures.

Conclusion

Microsoft marks a landmark in enhancing accessibility to data analysis and data visualization through the Data Formulator tool. By merging AI with intuitive user interfaces, the research group has been able to develop a tool that bridges the gaps between data complexity and analytical insights. By automatic conversion of complicated data transformations through code generation, it caters equally well to all users.

Data Formulator presents a compelling, cost-effective solution for organizations that want to do data analytics and visualization on their own. As AI evolves, tools like Data Formulator will further reduce the time between posing a question about data and receiving an answer in return.

Gen AI Intern at Analytics Vidhya 
Department of Computer Science, Vellore Institute of Technology, Vellore, India 

I am currently working as a Gen AI Intern at Analytics Vidhya, where I contribute to innovative AI-driven solutions that empower businesses to leverage data effectively. As a final-year Computer Science student at Vellore Institute of Technology, I bring a solid foundation in software development, data analytics, and machine learning to my role. 

Feel free to connect with me at [email protected] 

Login to continue reading and enjoy expert-curated content.

Responses From Readers

Clear