Mastering LlamaIndex: From Fundamentals to Building AI Apps/

...

Using LLMs with LlamaIndex

Learn how to effectively connect LlamaIndex with different LLM providers, structure prompts, and test model responses to ensure accurate retrieval and response generation.

We'll cover the following...

Installing dependencies
Connecting LlamaIndex to LLM providers
Understanding prompting in LlamaIndex
Code execution
Conclusion

LlamaIndex enables seamless interaction between external data sources and large language models (LLMs), enhancing their ability to retrieve and process information intelligently. Before building complex AI applications, it is essential to understand how to connect LlamaIndex to different LLM providers and structure prompts effectively to optimize LLM outputs.

In this lesson, we will cover the following key topics:

Connecting LlamaIndex to LLM providers: We’ll configure OpenAI, Hugging Face, and local models for use with LlamaIndex.
Writing and structuring effective prompts for LLMs: We’ll explore how to design prompts that improve response quality and accuracy.
Generating LLM responses: We’ll generate LLM responses and ensure they are coherent and factually accurate.

Press + to interact

By the end of this lesson, we will be able to configure an LLM of our choice and interact with it through LlamaIndex to generate meaningful responses.

Installing dependencies

Before we begin, let’s ensure we have LlamaIndex installed and the necessary libraries for LLM integration. The following command installs the required dependencies for integrating LlamaIndex with different types of large language models (LLMs), including OpenAI models, Groq’s hosted models, Hugging Face models via API, and local models from Hugging Face and LlamaCPP.

llama-index: The LlamaIndex framework integrates LLMs with structured and unstructured data, creates agentic workflows, and builds complex AI systems using multiple tools for different tasks.

Fun fact: LlamaIndex was originally called GPT-Index. It was rebranded to LlamaIndex to emphasize its support for multiple LLMs, beyond just GPT models!

llama-index-llms-openai: It provides an interface to connect with OpenAI’s hosted models, such as GPT-4 and GPT-3.5-turbo, using the OpenAI API.
llama-index-llms-groq: It allows integration with Groq’s high-speed inference API for running LLaMA 3 models efficiently. Groq is a cost-effective and high-performance alternative to traditional cloud-based LLMs.
llama-index-llms-huggingface-api: It is a package that provides an interface to interact with Hugging Face’s hosted API models. Instead of downloading and running models locally, this package allows users to send requests to Hugging Face’s Inference API and receive responses.
llama-index-llms-huggingface: This library allows users to download and run Hugging Face models locally. It integrates with Hugging Face’s transformers library, enabling the use of models without requiring an API key. This is particularly useful when running models on local GPUs or CPUs instead of relying on cloud-based API requests.
llama-index-llms-llama-cpp: This package is specifically for running lightweight LLMs optimized for local execution. It integrates with llama.cpp, an efficient inference engine for running GGUF (Grok-Gemma Unified Format) models locally. It is ideal for deploying LLMs in environments with limited compute power.

Fun fact: The "cpp" in llama.cpp stands for C++, the programming language in which it is written. llama.cpp was created to enable highly optimized and efficient inference of LLMs on CPUs, making running models even on low-end devices possible!

In this lesson, we’ll use OpenAI’s GPT, Groq-hosted LLaMA 3 models, Hugging Face-hosted Mistral models, and Llama-based models via Llama.cpp. However, LlamaIndex supports many more LLMs. The full list is available in their official documentation.

Press + to interact

Connecting LlamaIndex to LLM providers

LlamaIndex supports multiple LLM providers, including OpenAI, Groq, Hugging Face, and local models. Let’s configure each provider.

Using OpenAI’s API with LlamaIndex

OpenAI’s GPT models are among the most powerful LLMs available. To connect LlamaIndex to OpenAI, we need to follow these steps:

Set up an OpenAI API key: We must create an OpenAI account and generate an API key from OpenAI's platform. The API key should be stored securely to avoid unauthorized access.
Configure OpenAI as our LLM provider: Once we have the API key, we can configure OpenAI as our LLM provider using the following lines of code.

Line 1: We import the OpenAI class from llama_index.llms.openai, which is specifically designed to work with OpenAI models.
Line 3: The OpenAI class provides an interface for API calls to OpenAI’s hosted models (e.g., GPT-4, GPT-3.5-turbo).
- The model parameter specifies which OpenAI model to use, such as gpt-3.5-turbo or gpt-4, or gpt-4o.
- The api_key="YOUR_OPENAI_API_KEY" parameter is where you must insert your actual OpenAI API key to authenticate and access the API.

Using Groq’s API with LlamaIndex

Groq provides hosted access to instruction-tuned LLaMA 3 models with exceptional speed and cost efficiency. Groq is a suitable alternative when response time and cost are critical considerations.

To connect Groq-hosted models with LlamaIndex, we need to follow these steps:

Set up a Groq API key: We need to create an account at console.groq.com and generate an API key. This key should be stored securely.
Configure Groq as the LLM provider: Once we have the API key, we can configure Groq as our LLM provider in LlamaIndex using the following lines of code:

Line 1: We import the Groq class from llama_index.llms.groq, which provides the interface to Groq’s hosted LLaMA models.
Line 2: The Groq class provides an interface for making API calls to Groq-hosted models.
- The model parameter allows us to select available Groq-supported models, such as "llama3-70b-8192" or "llama3-8b-8192", or the latest llama 4 model, meta-llama/llama-4-scout-17b-16e-instruct.
- The api_key parameter is required to authenticate our request using the Groq API.

Using Hugging Face API with LlamaIndex

Hugging Face provides access to a wide range of open-source LLMs.

Line 1: We import the HuggingFaceInferenceAPI class from llama_index.llms.huggingface_api. This class is designed to interface with models hosted on Hugging Face’s Inference API, allowing us to use pretrained models without requiring local computation.
Line 3: The HuggingFaceInferenceAPI class enables interaction with Hugging Face-hosted models via an API call.
- The model parameter specifies which model to use. In this case, "mistralai/Mistral-7B-Instruct-v0.3" refers to the Mistral 7B Instruct model, a high-quality open-weightOpen-weight means the model’s trained parameters (or “weights”) are publicly available, so developers can download and run it locally or fine-tune it for their own use. This contrasts with closed models like GPT-4, where the weights are not released and can only be accessed via API. LLM optimized for instruction-following tasks.
- The token parameter requires your Hugging Face API token for authentication, which grants access to Hugging Face’s hosted models.

Groq models are particularly helpful when building applications that require fast, scalable, and cost-conscious inference capabilities.

Running LLMs locally using LlamaIndex

If we prefer running models locally for privacy, cost efficiency, or offline capabilities, LlamaIndex allows integration with locally hosted models. Running LLMs on local hardware ensures full control over data privacy, removes dependency on cloud-based APIs, and can be more cost-effective in the long run.

LlamaIndex supports multiple ways to run local LLMs, including Hugging Face transformers and llama.cpp.

Hugging Face transformers

Hugging Face provides access to a wide range of pretrained models, which can be loaded and used locally without relying on external APIs. LlamaIndex supports this integration through the HuggingFaceLLM class.

Line 1: We import the HuggingFaceLLM class from llama_index.llms.huggingface. This class allows us to load and use Hugging Face models locally without relying on an API.
Line 3: The HuggingFaceLLM class instantiates a local model.
- The model_name parameter specifies which model to use. Here, "mistralai/Mistral-7B-Instruct-v0.3" refers to the Mistral 7B Instruct model, an open-weight LLM optimized for instruction-following tasks.
- This model is downloaded and cached locally for efficient reuse, eliminating the need for cloud-based inference.

This approach is best suited when we have access to a GPU with at least 13–16 GB of VRAM, or enough CPU memory to handle large models in full precision. It gives us flexibility and high-quality responses but requires more memory.

llama.cpp

For running highly optimized local models, LlamaIndex provides support for LlamaCPP, which is a lightweight and efficient implementation of LLMs. This approach is useful for running models on devices with limited resources.

Line 1: We import the LlamaCPP class from llama_index.llms.llama_cpp, which provides an interface for running GGUF-format models locally using the optimized llama.cpp implementation.
Line 3: We instantiate LlamaCPP to load a GGUF-format model for local inference.
- The model_url parameter specifies the direct download URL for the model.
- In this case, qwen2.5-7b-instruct-q3_k_m.gguf is an instruction-tuned model available on Hugging Face.
- If not already downloaded, the model will be retrieved and loaded into memory, enabling efficient offline inference.

Quantized models like q3_k_m or q4_k_m significantly reduce memory usage. This allows us to run 7B models efficiently on systems with:

8–16 GB RAM (for CPU execution)
6–8 GB VRAM (for GPU acceleration)

💡 Tip: We recommend quantized GGUF models with llama.cpp when working in limited environments or testing on a personal machine. These models are smaller, faster, and surprisingly capable.

Understanding prompting in LlamaIndex

Now that we have learned how to connect an LLM in LlamaIndex, the next step is to structure prompts effectively. LlamaIndex provides mechanisms to define system prompts, incorporate retrieved data, and fine-tune response settings. This helps guide the model’s responses and improves contextual accuracy.

Defining system prompts

A system prompt sets the overall behavior of the LLM, ensuring responses align with a specific role or objective. In LlamaIndex, we can define a system prompt as follows:

By using a system prompt, we ensure that the model consistently follows the intended instructions throughout interactions.

Generating responses

Once we define prompts, the next step is to send them to the LLM and receive a generated response. LlamaIndex provides two methods for this: one synchronous (.complete()) and one asynchronous (.acomplete()).

Synchronous method: `.complete()`

This is the standard way to request a response. When we use .complete(), the program waits until the LLM finishes responding before moving on to the next line of code.

Note on asynchronous code:
In Python, we can run certain operations asynchronously, which means they don’t have to wait for one task to finish before starting another. This is especially useful when calling external APIs—like an LLM—where waiting on a response can slow things down.
To use async code in Python:
async def defines an asynchronous function.
await tells Python to pause and wait for something to finish (like an LLM call), without blocking the rest of the program.
asyncio.run() is used to start an async function from regular code.
This pattern is especially recommended for agentic systems and workflows, where multiple async tasks might run together.

Fine-tuning response behavior

LlamaIndex enables us to fine-tune the model’s response generation by adjusting the key prompt parameters. Here’s how we configure them:

Temperature: A lower value (e.g., 0.3) makes responses more deterministic, while a higher value introduces more randomness.
Max tokens: Limits the length of responses to ensure conciseness.
Top-p (nucleus sampling): Controls how diverse the token selection is—lower values focus on high-probability words, producing more focused outputs.

By fine-tuning these parameters, we can balance creativity and precision based on our application’s needs.

Code execution

To enhance the learning experience, we have provided a playground where you can experiment with connecting to and running LLMs from different providers via API. In the playground below, we have used Groq by default, but you can try other providers by updating the first two lines of the code.

Note: You can experiment with API-based LLM integrations (e.g., OpenAI, Groq, Hugging Face API) directly on our platform. However, running local LLMs (e.g., LlamaCPP, Hugging Face local models) may take significantly longer, and our session time is limited. If you would like to run local models, we recommend executing the code on Google Colab or your own local machine.

Press + to interact

Conclusion

In this lesson, we explored how to integrate LlamaIndex with various LLM providers, including OpenAI, Groq, Hugging Face, and local models using Llama.cpp. We covered installing dependencies, connecting each provider, structuring prompts, and fine-tuning responses for clarity and consistency.

Groq provides an excellent option for cost-effective, fast inference using LLaMA 3 models, making it a strong addition to the LLM landscape. With these tools in place, we can now confidently choose and switch between LLMs to suit the needs of our AI applications.

Getting Started

Core Concepts and Using LLMs

Building a RAG Pipeline

Extracting Structured Outputs from LLMs

Agents and Workflows

Monitoring and Evaluating LLM Applications

Building Real-World Applications with LlamaIndex

Wrap Up

Using LLMs with LlamaIndex

Installing dependencies

Connecting LlamaIndex to LLM providers

Using OpenAI’s API with LlamaIndex

Using Groq’s API with LlamaIndex

Using Hugging Face API with LlamaIndex

Running LLMs locally using LlamaIndex

Hugging Face transformers

llama.cpp

Understanding prompting in LlamaIndex

Defining system prompts

Generating responses

Synchronous method: `.complete()`

Asynchronous method: `.acomplete()`

Fine-tuning response behavior

Code execution

Conclusion

Using LLMs with LlamaIndex

Installing dependencies

Connecting LlamaIndex to LLM providers

Using OpenAI’s API with LlamaIndex

Using Groq’s API with LlamaIndex

Using Hugging Face API with LlamaIndex

Running LLMs locally using LlamaIndex

Hugging Face transformers

llama.cpp

Understanding prompting in LlamaIndex

Defining system prompts

Generating responses

Synchronous method: .complete()

Asynchronous method: .acomplete()

Fine-tuning response behavior

Code execution

Conclusion

Synchronous method: `.complete()`

Asynchronous method: `.acomplete()`