polars

Discover how to work with polars, a blazingly fast alternative to pandas for data manipulation.

Introduction

In this lesson, we’ll learn about a powerful Python library called polars. It’s a data manipulation library implemented in Rust and designed to work efficiently with large datasets, offering a fast and flexible alternative to the classic pandas library. The polars library has also been shown to demonstrate significant speed improvements on pandas in data processing efficiency benchmarks, such as for Join operations.

The polars library is able to achieve these performance improvements due to the following reasons:

  • Rust implementation: polars is written in the Rust programming language, which is known for its speed and memory safety. This allows for optimizations that might not be as easily achievable in Python.

  • Lazy evaluation: polars leverages lazy evaluation to optimize the execution of a query plan. This means that operations aren’t immediately executed, but instead, a logical plan is constructed. The optimal physical plan is then executed once the user requests the data, leading to more efficient computations.

  • Multi-threading: polars often makes use of multi-threading for computations, taking full advantage of modern CPU architecture. This allows for parallel processing of the data, significantly speeding up computations, especially on multi-core systems.

  • Out-of-core: polars supports out-of-core data transformation with its streaming API, thereby allowing us to process our results without requiring all data to be in memory at the same time.

  • Vectorized Query Engine: polars uses Apache Arrow, a columnar data format, to process queries in a vectorized manner.

Besides delivering enhanced performance and scalability, polars also provides a DataFrame API similar to pandas, making it easy for pandas users to learn and utilize. In this lesson, we’ll work with an online retail transaction dataset, as shown below:

Get hands-on with 1200+ tech skills courses.