Writing a Python module using Rust

Feb. 19, 2023

Using Rust to implement a faster Python module and other use case.

PyO3 is a library for creating bindings between Rust and Python, and it is used by many popular Python packages such as orjson, tokenizers, polars, pydantic-core, ruff and many more. In this article, we will attempt to implement a hypothetical Python module that calculate directory sizes faster, and then we will expose Rust code highlight library to the Python world.

To build and package the Python module, we will use maturin to create the skeleton structure and default configuration.

Python module in rust

Install CPython with shared enable, --enable-shared is linker option to enable building a shared Python library libpython.

# on mac, using pyenv

$ env PYTHON_CONFIGURE_OPTS="--enable-shared" pyenv install 3.11.0

Reference on pyenv python flag build on project wiki

Install maturin using pip then create a new project named tobira (扉).

$ python -mpip install --user -U maturin
$ maturin init -b pyo3 tobira
$ cd tobira

By default maturin create a Rust project with type lib cdylib (shared library) and using pyo3 as dependency, here the Cargo.toml file:

[package]
name = "tobira"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[lib]
name = "tobira"
crate-type = ["cdylib"]

[dependencies]
pyo3 = { version = "0.18.0", features = ["extension-module"] }

The generated lib.rs created by maturin consist of sum_as_string function which calculate sum of two integer operator returning result as String, and tobira module definition:

use pyo3::prelude::*;

/// Formats the sum of two numbers as string.
#[pyfunction]
fn sum_as_string(a: usize, b: usize) -> PyResult<String> {
    Ok((a + b).to_string())
}

/// A Python module implemented in Rust.
#[pymodule]
fn tobira(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(sum_as_string, m)?)?;
    Ok(())
}

Rust attribute #[pyfunction] used to defined function and #[pymodule] used to define python module.

To build the package, maturin require to use virtual environment. So we setup using venv module.

$ python -mvenv venv
$ source ./venv/bin/activate
# maturin develop, to build and install in current virtual env
$ maturin develop
🔗 Found pyo3 bindings
🐍 Found CPython 3.11 at /Users/sakti/dev/tobira/venv/bin/python3
💻 Using `MACOSX_DEPLOYMENT_TARGET=11.0` for aarch64-apple-darwin by default
   Compiling target-lexicon v0.12.6
   Compiling autocfg v1.1.0
   Compiling proc-macro2 v1.0.51
...
    Finished dev [unoptimized + debuginfo] target(s) in 7.00s
📦 Built wheel for CPython 3.11 to /Users/sakti/dev/tobira/target/wheels/tobira-0.1.0-cp311-cp311-macosx_11_0_arm64.whl

to test default generated module you can execute:

$ python -c "import tobira; print(tobira.sum_as_string(1, 1))"
2

Calculate number of bytes in a directory (recursive)

For our hypothetical use case, we will implement a program that calculates the size of a directory and its subdirectories. This differs from disk usage, which is based on hardware block size. Our program will calculate the size based on file metadata. For a complete implementation, please refer to the Rust Coreutils re-implementation.

Python naive implementation

We use os.scandir method which faster than os.listdir and then use humanize package to convert directory size number to more readable format by adding unit information, KB, MB, GB.

# scandirsize.py
import os
import sys
import humanize


def get_size(start_path: str = ".") -> int:
    total_size = 0

    with os.scandir(start_path) as it:
        for entry in it:
            if entry.is_file():
                total_size += entry.stat().st_size
            elif entry.is_dir():
                total_size += get_size(entry.path)

    return total_size


def format(size: int) -> str:
    return humanize.naturalsize(size, binary=True)

def get_directory_size(path: str = ".") -> str:
    result = get_size(path)
    return format(result)

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("need path argument")
        sys.exit(1)

    result = get_directory_size(sys.argv[1])
    print(result)

let's test the get_directory_size function with linux source code:

$ python scandirsize.py ~/dev/linux-6.1.11 
1.2 GiB

Note: symbolic link is ignored in this implementation

Rust implementation

For Rust, method std::fs::read_dir is used and will return result in iterator. And then use std::fs::Metadata len() to get the size of the file in bytes with type u64.

// src/lib.rs
use std::io::Result;
use std::{fs::read_dir, path::Path};

use humansize::{format_size, BINARY};
use pyo3::exceptions::PyValueError;
...

fn get_size(path: &Path) -> Result<u64> {
    let mut total_size = 0;
    let path_metadata = path.symlink_metadata()?;

    if path_metadata.is_dir() {
        for entry in read_dir(&path)? {
            let entry = entry?;
            let entry_metadata = entry.metadata()?;

            if entry_metadata.is_dir() {
                total_size += get_size(&entry.path())?;
            } else {
                total_size += entry_metadata.len();
            }
        }
    } else {
        total_size = path_metadata.len();
    }

    Ok(total_size)
}

fn format(size: u64) -> String {
    format_size(size, BINARY)
}

#[pyfunction]
fn get_directory_size(path: String) -> PyResult<String> {
    let p = Path::new(&path);
    let result = match get_size(&p) {
        Ok(value) => Ok(value),
        Err(e) => Err(PyValueError::new_err(e.to_string())),
    }?;
    Ok(format(result))
}

#[pymodule]
fn tobira(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(sum_as_string, m)?)?;
    m.add_function(wrap_pyfunction!(get_directory_size, m)?)?;
    Ok(())
}

Don't forget to add humansize package for humanize equivalent in Rust ecosystem.

$ cargo add humanzie
# build the package with release profile
$ maturin develop --release
# test run the package
$ python -c "import tobira; print(tobira.get_directory_size('/Users/sakti/dev/linux-6.1.11'))"
1.21 GiB

Running benchmark

First, need to create benchmark script to compare both implementation, we create calculate.py file:

# calculate.py

import argparse

import tobira  # rust module
import scandirsize  # python module

parser = argparse.ArgumentParser(description="Compare directory size module")
parser.add_argument("path")
parser.add_argument("-rust", "--rust-module", action="store_true")


if __name__ == "__main__":
    args = parser.parse_args()
    if args.rust_module:
        result = tobira.get_directory_size(args.path)
    else:
        result = scandirsize.get_directory_size(args.path)
    print(result)

Then we can execute python calculate.py ~/dev/linux-6.1.11 to run pure Python module and for Rust python calculate.py -rust ~/dev/linux-6.1.11.

Benchmark using hyperfine,

$ hyperfine --warmup 1 -r 20 "python calculate.py ~/dev/linux-6.1.11" "python calculate.py -rust ~/dev/linux-6.1.11" --export-json result.json
Benchmark 1: python calculate.py ~/dev/linux-6.1.11
  Time (mean ± σ):     358.6 ms ±   2.7 ms    [User: 94.4 ms, System: 263.1 ms]
  Range (min … max):   356.3 ms … 367.4 ms    20 runs

Benchmark 2: python calculate.py -rust ~/dev/linux-6.1.11
  Time (mean ± σ):     336.0 ms ±   1.1 ms    [User: 60.2 ms, System: 274.6 ms]
  Range (min … max):   334.8 ms … 339.0 ms    20 runs

Summary
  'python calculate.py -rust ~/dev/linux-6.1.11' ran
    1.07 ± 0.01 times faster than 'python calculate.py ~/dev/linux-6.1.11'

So result is 1.07 ± 0.01 times faster, but if we only see the userspace number its down from 94.4 ms to 60.2 ms.

Running hyperfine using --export-json to generate json output of raw data to further processing. Lets create histogram graph.

$ python plot_histogram.py -o histogram.png result.json

Coreutils du implementation

uutils coreutils is an open source project with attempt to rewrite cross-platform Rust version of GNU coretutils.

One of the tool is to check disk usage man du, coreutils du (display disk usage statistics) rust implementation, https://github.com/uutils/coreutils/blob/main/src/uu/du/src/du.rs

Exposing tree-sitter-highlight library

Another use case is to use library available from Rust ecosystem but not available yet in Python. tree-sitter-highlight is library to do syntax highlighting from Tree-sitter parser.

First add required dependencies using cargo add.

$ cargo add tree-sitter-highlight tree-sitter-python
    Updating crates.io index
      Adding tree-sitter-highlight v0.20.1 to dependencies.
      Adding tree-sitter-python v0.20.2 to dependencies.

and from readme from the project, we need to define list of highlights_names we want.

Lets create highlight module, highlight.rs:

use tree_sitter_highlight::{HighlightConfiguration, Highlighter, HtmlRenderer};

const HIGHLIGHT_NAMES: &[&str; 18] = &[
    "attribute",
    "constant",
    "function.builtin",
    "function",
    "keyword",
    "operator",
    "property",
    "punctuation",
    "punctuation.bracket",
    "punctuation.delimiter",
    "string",
    "string.special",
    "tag",
    "type",
    "type.builtin",
    "variable",
    "variable.builtin",
    "variable.parameter",
];

pub fn code_highlight(code: String) -> Result<String, Box<dyn std::error::Error>> {
    let mut highlighter = Highlighter::new();

    let mut python_config = HighlightConfiguration::new(
        tree_sitter_python::language(),
        tree_sitter_python::HIGHLIGHT_QUERY,
        "",
        "",
    )?;

    python_config.configure(HIGHLIGHT_NAMES);
    let html_attrs: Vec<String> = HIGHLIGHT_NAMES
        .iter()
        .map(|s| format!("class=\"{}\"", s.replace('.', " ")))
        .collect();

    let highlights = highlighter.highlight(&python_config, code.as_bytes(), None, |_| None)?;

    let mut renderer = HtmlRenderer::new();
    renderer.render(highlights, code.as_bytes(), &|highlight| {
        html_attrs[highlight.0].as_bytes()
    })?;

    Ok(String::from_utf8_lossy(renderer.html.as_slice()).to_string())
}

Now use it in lib.rs:

mod highlight;

...

#[pyfunction]
fn code_highlight(code: String) -> PyResult<String> {
    match highlight::code_highlight(code) {
        Ok(value) => Ok(value),
        Err(e) => Err(PyValueError::new_err(e.to_string())),
    }
}

#[pymodule]
fn tobira(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(sum_as_string, m)?)?;
    m.add_function(wrap_pyfunction!(get_directory_size, m)?)?;
    m.add_function(wrap_pyfunction!(code_highlight, m)?)?;
    Ok(())
}

Lets test it on calculate.py file:

$ maturin develop --release
$ python -c "import tobira; fh = open('calculate.py'); code = fh.read(); print(tobira.code_highlight(code))"
<span class="keyword">import</span> <span class="variable">argparse</span>

<span class="keyword">import</span> <span class="variable">tobira</span>  # rust module
<span class="keyword">import</span> <span class="variable">scandirsize</span>  # python module

<span class="variable">parser</span> <span class="operator">=</span> <span class="variable">argparse</span>.ArgumentParser(<span class="variable">description</span><span class="operator">=</span><span class="string">&quot;Compare directory size module&quot;</span>)
<span class="variable">parser</span>.<span class="function">add_argument</span>(<span class="string">&quot;path&quot;</span>)
<span class="variable">parser</span>.<span class="function">add_argument</span>(<span class="string">&quot;-rust&quot;</span>, <span class="string">&quot;--rust-module&quot;</span>, <span class="variable">action</span><span class="operator">=</span><span class="string">&quot;store_true&quot;</span>)


<span class="keyword">if</span> <span class="variable">__name__</span> <span class="operator">==</span> <span class="string">&quot;__main__&quot;</span>:
    <span class="variable">args</span> <span class="operator">=</span> <span class="variable">parser</span>.<span class="function">parse_args</span>()
    <span class="keyword">if</span> <span class="variable">args</span>.<span class="variable">rust_module</span>:
        <span class="variable">result</span> <span class="operator">=</span> <span class="variable">tobira</span>.<span class="function">get_directory_size</span>(<span class="variable">args</span>.<span class="variable">path</span>)
    <span class="keyword">else</span>:
        <span class="variable">result</span> <span class="operator">=</span> <span class="variable">scandirsize</span>.<span class="function">get_directory_size</span>(<span class="variable">args</span>.<span class="variable">path</span>)
    <span class="function builtin">print</span>(<span class="variable">result</span>)

Conclusion

Python is a popular programming language known for its simplicity and ease of use, but sometimes it can be slow when dealing with computationally intensive tasks. Rust, on the other hand, is a modern systems programming language known for its speed, memory safety, and concurrency features. In this context, it can be beneficial to combine Python and Rust to leverage the strengths of both languages.

Overall, using Rust with PyO3 and Maturin can provide a powerful and efficient way to create Python modules that leverage the strengths of both languages.

All the source code are available in https://github.com/sakti/tobira.

Return to blog

footer