Using LLMs with LlamaIndex
Learn how to effectively connect LlamaIndex with different LLM providers, structure prompts, and test model responses to ensure accurate retrieval and response generation.
LlamaIndex enables seamless interaction between external data sources and large language models (LLMs), enhancing their ability to retrieve and process information intelligently. Before building complex AI applications, it is essential to understand how to connect LlamaIndex to different LLM providers and structure prompts effectively to optimize LLM outputs.
In this lesson, we will cover the following key topics:
Connecting LlamaIndex to LLM providers: We’ll configure OpenAI, Hugging Face, and local models for use with LlamaIndex.
Writing and structuring effective prompts for LLMs: We’ll explore how to design prompts that improve response quality and accuracy.
Generating LLM responses: We’ll generate LLM responses and ensure they are coherent and factually accurate.
By the end of this lesson, we will be able to configure an LLM of our choice and interact with it through LlamaIndex to generate meaningful responses.
Installing dependencies
Before we begin, let’s ensure we have LlamaIndex installed and the necessary libraries for LLM integration. The following command installs the required dependencies for integrating LlamaIndex with different types of large language models (LLMs), including OpenAI models, Groq’s hosted models, Hugging Face models via API, and local models from Hugging Face and LlamaCPP.
!pip install llama-index llama-index-llms-openai llama-index-llms-groq llama-index-llms-huggingface-api llama-index-llms-huggingface llama-index-llms-llama-cpp
llama-index
: The LlamaIndex framework integrates LLMs with structured and unstructured data, creates agentic workflows, and builds complex AI systems using multiple tools for different tasks.
Fun fact: LlamaIndex was originally called GPT-Index. It was rebranded to LlamaIndex to emphasize its support for multiple LLMs, beyond just GPT models!
llama-index-llms-openai
: It provides an interface to connect with OpenAI’s hosted models, such as GPT-4 and GPT-3.5-turbo, using the OpenAI API.llama-index-llms-groq
: It allows integration with Groq’s high-speed inference API for running LLaMA 3 models efficiently. Groq is a cost-effective and high-performance alternative to traditional cloud-based LLMs.llama-index-llms-huggingface-api
: It is a package that provides an interface to interact with Hugging Face’s hosted API models. Instead of downloading and running models locally, this package allows users to send requests to Hugging Face’s Inference API and receive responses.llama-index-llms-huggingface
: This library allows users to download and run Hugging Face models locally. It integrates with Hugging Face’s transformers library, enabling the use of models without requiring an API key. This is particularly useful when running models on local GPUs or CPUs instead of relying on cloud-based API requests.llama-index-llms-llama-cpp
: This package is specifically for running lightweight LLMs optimized for local execution. It integrates withllama.cpp
, an efficient inference engine for running GGUF (Grok-Gemma Unified Format) models locally. It is ideal for deploying LLMs in environments with limited compute power.
Fun fact: The "cpp" in
llama.cpp
stands for C++, the programming language in which it is written.llama.cpp
was created to enable highly optimized and efficient inference of LLMs on CPUs, making running models even on low-end devices possible!
In this lesson, we’ll use OpenAI’s GPT, Groq-hosted LLaMA 3 models, Hugging Face-hosted Mistral models, and Llama-based models via Llama.cpp. However, LlamaIndex supports many more LLMs. The full list is available in their official documentation.
Connecting LlamaIndex to LLM providers
LlamaIndex supports multiple LLM providers, including OpenAI, Groq, Hugging Face, and local models. Let’s configure each provider.
Using OpenAI’s API with LlamaIndex
OpenAI’s GPT models are among the most powerful LLMs available. To connect LlamaIndex to OpenAI, we need to follow these steps:
Set up an OpenAI API key: We must create an OpenAI account and generate an API key from OpenAI's platform. The API key should be stored securely to avoid unauthorized access.
Configure OpenAI as our LLM provider: Once we have the API key, we can configure OpenAI as our LLM provider using the following lines of code.
from llama_index.llms.openai import OpenAIllm = OpenAI(model="gpt-4o", api_key="YOUR_OPENAI_API_KEY")
Line 1: We import the
OpenAI
class fromllama_index.llms.openai
, which is specifically designed to work with OpenAI models.Line 3: The
OpenAI
class provides an interface for API calls to OpenAI’s hosted models (e.g., GPT-4, GPT-3.5-turbo).The
model
parameter specifies which OpenAI model to use, such asgpt-3.5-turbo
orgpt-4
, orgpt-4o
.The
api_key="YOUR_OPENAI_API_KEY"
parameter is where you must insert your actual OpenAI API key to authenticate and access the API.
Using Groq’s API with LlamaIndex
Groq provides hosted access to instruction-tuned LLaMA 3 models with exceptional speed and cost efficiency. Groq is a suitable alternative when response time and cost are critical considerations.
To connect Groq-hosted models with LlamaIndex, we need to follow these steps:
Set up a Groq API key: We need to create an account at console.groq.com and generate an API key. This key should be stored securely.
Configure Groq as the LLM provider: Once we have the API key, we can configure Groq as our LLM provider in LlamaIndex using the following lines of code:
from llama_index.llms.groq import Groqllm = Groq(model="meta-llama/llama-4-scout-17b-16e-instruct", api_key="YOUR_GROQ_API_KEY")
Line 1: We import the
Groq
class fromllama_index.llms.groq
, which provides the interface to Groq’s hosted LLaMA models.Line 2: The
Groq
class provides an interface for making API calls to Groq-hosted models.The
model
parameter allows us to select available Groq-supported models, such as"llama3-70b-8192"
or"llama3-8b-8192"
, or the latest llama 4 model,meta-llama/llama-4-scout-17b-16e-instruct
.The
api_key
parameter is required to authenticate our request using the Groq API.
Using Hugging Face API with LlamaIndex
Hugging Face provides access to a wide range of open-source LLMs.
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPIllm = HuggingFaceInferenceAPI(model="mistralai/Mistral-7B-Instruct-v0.3", token="HUGGING_FACE_TOKEN")
Line 1: We import the
HuggingFaceInferenceAPI
class fromllama_index.llms.huggingface_api
. This class is designed to interface with models hosted on Hugging Face’s Inference API, allowing us to use pretrained models without requiring local computation.Line 3: The
HuggingFaceInferenceAPI
class enables interaction with Hugging Face-hosted models via an API call.The
model
parameter specifies which model to use. In this case,"mistralai/Mistral-7B-Instruct-v0.3"
refers to the Mistral 7B Instruct model, a high-quality LLM optimized for instruction-following tasks.open-weight Open-weight means the model’s trained parameters (or “weights”) are publicly available, so developers can download and run it locally or fine-tune it for their own use. This contrasts with closed models like GPT-4, where the weights are not released and can only be accessed via API. The
token
parameter requires your Hugging Face API token for authentication, which grants access to Hugging Face’s hosted models.
Groq models are particularly helpful when building applications that require fast, scalable, and cost-conscious inference capabilities.
Running LLMs locally using LlamaIndex
If we prefer running models locally for privacy, cost efficiency, or offline capabilities, LlamaIndex allows integration with locally hosted models. Running LLMs on local hardware ensures full control over data privacy, removes dependency on cloud-based APIs, and can be more cost-effective in the long run.
LlamaIndex supports multiple ways to run local LLMs, including Hugging Face transformers and llama.cpp
.
Hugging Face transformers
Hugging Face provides access to a wide range of pretrained models, which can be loaded and used locally without relying on external APIs. LlamaIndex supports this integration through the HuggingFaceLLM
class.
from llama_index.llms.huggingface import HuggingFaceLLMllm = HuggingFaceLLM(model_name="mistralai/Mistral-7B-Instruct-v0.3")
Line 1: We import the
HuggingFaceLLM
class fromllama_index.llms.huggingface
. This class allows us to load and use Hugging Face models locally without relying on an API.Line 3: The
HuggingFaceLLM
class instantiates a local model.The
model_name
parameter specifies which model to use. Here,"mistralai/Mistral-7B-Instruct-v0.3"
refers to the Mistral 7B Instruct model, an open-weight LLM optimized for instruction-following tasks.This model is downloaded and cached locally for efficient reuse, eliminating the need for cloud-based inference.
This approach is best suited when we have access to a GPU with at least 13–16 GB of VRAM, or enough CPU memory to handle large models in full precision. It gives us flexibility and high-quality responses but requires more memory.
llama.cpp
For running highly optimized local models, LlamaIndex provides support for LlamaCPP
, which is a lightweight and efficient implementation of LLMs. This approach is useful for running models on devices with limited resources.
from llama_index.llms.llama_cpp import LlamaCPPllm = LlamaCPP(model_url="https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q3_k_m.gguf")
Line 1: We import the
LlamaCPP
class fromllama_index.llms.llama_cpp
, which provides an interface for running GGUF-format models locally using the optimizedllama.cpp
implementation.Line 3: We instantiate
LlamaCPP
to load a GGUF-format model for local inference.The
model_url
parameter specifies the direct download URL for the model.In this case,
qwen2.5-7b-instruct-q3_k_m.gguf
is an instruction-tuned model available on Hugging Face.If not already downloaded, the model will be retrieved and loaded into memory, enabling efficient offline inference.
Quantized models like q3_k_m
or q4_k_m
significantly reduce memory usage. This allows us to run 7B models efficiently on systems with:
8–16 GB RAM (for CPU execution)
6–8 GB VRAM (for GPU acceleration)
💡 Tip: We recommend quantized GGUF models with llama.cpp when working in limited environments or testing on a personal machine. These models are smaller, faster, and surprisingly capable.
Understanding prompting in LlamaIndex
Now that we have learned how to connect an LLM in LlamaIndex, the next step is to structure prompts effectively. LlamaIndex provides mechanisms to define system prompts, incorporate retrieved data, and fine-tune response settings. This helps guide the model’s responses and improves contextual accuracy.
Defining system prompts
A system prompt sets the overall behavior of the LLM, ensuring responses align with a specific role or objective. In LlamaIndex, we can define a system prompt as follows:
llm.system_prompt = "You are an AI assistant that provides concise answers based on user's query."
By using a system prompt, we ensure that the model consistently follows the intended instructions throughout interactions.
Generating responses
Once we define prompts, the next step is to send them to the LLM and receive a generated response. LlamaIndex provides two methods for this: one synchronous (.complete()
) and one asynchronous (.acomplete()
).
Synchronous method: .complete()
This is the standard way to request a response. When we use .complete()
, the program waits until the LLM finishes responding before moving on to the next line of code.
user_prompt = "What is the capital of France?"response = llm.complete(user_prompt)print(response)
Asynchronous method: .acomplete()
The asynchronous version .acomplete()
lets the rest of the program keep running while the LLM works on generating the response. This is useful when building apps or agents that need to handle multiple tasks at once or stay responsive.
import asyncioasync def get_response():response = await llm.acomplete("What is the capital of France?")print(response.text)asyncio.run(get_response())
Note on asynchronous code:
In Python, we can run certain operations asynchronously, which means they don’t have to wait for one task to finish before starting another. This is especially useful when calling external APIs—like an LLM—where waiting on a response can slow things down.
To use async code in Python:
async def
defines an asynchronous function.
await
tells Python to pause and wait for something to finish (like an LLM call), without blocking the rest of the program.
asyncio.run()
is used to start an async function from regular code.This pattern is especially recommended for agentic systems and workflows, where multiple async tasks might run together.
Fine-tuning response behavior
LlamaIndex enables us to fine-tune the model’s response generation by adjusting the key prompt parameters. Here’s how we configure them:
llm.temperature = 0.3 # Reduces randomnessllm.max_tokens = 512 # Limits response lengthllm.top_p = 0.9 # Controls diversity of token selection
Temperature: A lower value (e.g.,
0.3
) makes responses more deterministic, while a higher value introduces more randomness.Max tokens: Limits the length of responses to ensure conciseness.
Top-p (nucleus sampling): Controls how diverse the token selection is—lower values focus on high-probability words, producing more focused outputs.
By fine-tuning these parameters, we can balance creativity and precision based on our application’s needs.
Code execution
To enhance the learning experience, we have provided a playground where you can experiment with connecting to and running LLMs from different providers via API. In the playground below, we have used Groq by default, but you can try other providers by updating the first two lines of the code.
Note: You can experiment with API-based LLM integrations (e.g., OpenAI, Groq, Hugging Face API) directly on our platform. However, running local LLMs (e.g., LlamaCPP, Hugging Face local models) may take significantly longer, and our session time is limited. If you would like to run local models, we recommend executing the code on Google Colab or your own local machine.
from llama_index.llms.groq import Groqllm = Groq(model="llama3-70b-8192", api_key="{{GROQ_API_KEY}}")query = "What is the capital of France?"response = llm.complete(query)print(response)
Conclusion
In this lesson, we explored how to integrate LlamaIndex with various LLM providers, including OpenAI, Groq, Hugging Face, and local models using Llama.cpp. We covered installing dependencies, connecting each provider, structuring prompts, and fine-tuning responses for clarity and consistency.
Groq provides an excellent option for cost-effective, fast inference using LLaMA 3 models, making it a strong addition to the LLM landscape. With these tools in place, we can now confidently choose and switch between LLMs to suit the needs of our AI applications.