Writing a Python module using Rust
PyO3 is a library for creating bindings between Rust and Python, and it is used by many popular Python packages such as orjson, tokenizers, polars, pydantic-core, ruff and many more. In this article, we will attempt to implement a hypothetical Python module that calculate directory sizes faster, and then we will expose Rust code highlight library to the Python world.
To build and package the Python module, we will use maturin to create the skeleton structure and default configuration.
Python module in rust
Install CPython with shared enable, --enable-shared
is linker option to enable building a shared Python library libpython.
# on mac, using pyenv
$ env PYTHON_CONFIGURE_OPTS="--enable-shared" pyenv install 3.11.0
Reference on pyenv python flag build on project wiki
Install maturin using pip
then create a new project named tobira (扉).
$ python -mpip install --user -U maturin
$ maturin init -b pyo3 tobira
$ cd tobira
By default maturin create a Rust project with type lib cdylib
(shared library) and using pyo3
as dependency, here the Cargo.toml
file:
[package]
name = "tobira"
version = "0.1.0"
edition = "2021"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[lib]
name = "tobira"
crate-type = ["cdylib"]
[dependencies]
pyo3 = { version = "0.18.0", features = ["extension-module"] }
The generated lib.rs
created by maturin consist of sum_as_string
function which calculate sum of two integer operator returning result as String, and tobira
module definition:
use pyo3::prelude::*;
/// Formats the sum of two numbers as string.
#[pyfunction]
fn sum_as_string(a: usize, b: usize) -> PyResult<String> {
Ok((a + b).to_string())
}
/// A Python module implemented in Rust.
#[pymodule]
fn tobira(_py: Python, m: &PyModule) -> PyResult<()> {
m.add_function(wrap_pyfunction!(sum_as_string, m)?)?;
Ok(())
}
Rust attribute #[pyfunction]
used to defined function and #[pymodule]
used to define python module.
To build the package, maturin require to use virtual environment. So we setup using venv
module.
$ python -mvenv venv
$ source ./venv/bin/activate
# maturin develop, to build and install in current virtual env
$ maturin develop
🔗 Found pyo3 bindings
🐍 Found CPython 3.11 at /Users/sakti/dev/tobira/venv/bin/python3
💻 Using `MACOSX_DEPLOYMENT_TARGET=11.0` for aarch64-apple-darwin by default
Compiling target-lexicon v0.12.6
Compiling autocfg v1.1.0
Compiling proc-macro2 v1.0.51
...
Finished dev [unoptimized + debuginfo] target(s) in 7.00s
📦 Built wheel for CPython 3.11 to /Users/sakti/dev/tobira/target/wheels/tobira-0.1.0-cp311-cp311-macosx_11_0_arm64.whl
to test default generated module you can execute:
$ python -c "import tobira; print(tobira.sum_as_string(1, 1))"
2
Calculate number of bytes in a directory (recursive)
For our hypothetical use case, we will implement a program that calculates the size of a directory and its subdirectories. This differs from disk usage, which is based on hardware block size. Our program will calculate the size based on file metadata. For a complete implementation, please refer to the Rust Coreutils re-implementation.
Python naive implementation
We use os.scandir
method which faster than os.listdir
and then use humanize
package to convert directory size number to more readable format by adding unit information, KB, MB, GB.
# scandirsize.py
import os
import sys
import humanize
def get_size(start_path: str = ".") -> int:
total_size = 0
with os.scandir(start_path) as it:
for entry in it:
if entry.is_file():
total_size += entry.stat().st_size
elif entry.is_dir():
total_size += get_size(entry.path)
return total_size
def format(size: int) -> str:
return humanize.naturalsize(size, binary=True)
def get_directory_size(path: str = ".") -> str:
result = get_size(path)
return format(result)
if __name__ == "__main__":
if len(sys.argv) != 2:
print("need path argument")
sys.exit(1)
result = get_directory_size(sys.argv[1])
print(result)
let's test the get_directory_size
function with linux source code:
$ python scandirsize.py ~/dev/linux-6.1.11
1.2 GiB
Note: symbolic link is ignored in this implementation
Rust implementation
For Rust, method std::fs::read_dir is used and will return result in iterator. And then use std::fs::Metadata
len() to get the size of the file in bytes with type u64
.
// src/lib.rs
use std::io::Result;
use std::{fs::read_dir, path::Path};
use humansize::{format_size, BINARY};
use pyo3::exceptions::PyValueError;
...
fn get_size(path: &Path) -> Result<u64> {
let mut total_size = 0;
let path_metadata = path.symlink_metadata()?;
if path_metadata.is_dir() {
for entry in read_dir(&path)? {
let entry = entry?;
let entry_metadata = entry.metadata()?;
if entry_metadata.is_dir() {
total_size += get_size(&entry.path())?;
} else {
total_size += entry_metadata.len();
}
}
} else {
total_size = path_metadata.len();
}
Ok(total_size)
}
fn format(size: u64) -> String {
format_size(size, BINARY)
}
#[pyfunction]
fn get_directory_size(path: String) -> PyResult<String> {
let p = Path::new(&path);
let result = match get_size(&p) {
Ok(value) => Ok(value),
Err(e) => Err(PyValueError::new_err(e.to_string())),
}?;
Ok(format(result))
}
#[pymodule]
fn tobira(_py: Python, m: &PyModule) -> PyResult<()> {
m.add_function(wrap_pyfunction!(sum_as_string, m)?)?;
m.add_function(wrap_pyfunction!(get_directory_size, m)?)?;
Ok(())
}
Don't forget to add humansize
package for humanize
equivalent in Rust ecosystem.
$ cargo add humanzie
# build the package with release profile
$ maturin develop --release
# test run the package
$ python -c "import tobira; print(tobira.get_directory_size('/Users/sakti/dev/linux-6.1.11'))"
1.21 GiB
Running benchmark
First, need to create benchmark script to compare both implementation, we create calculate.py
file:
# calculate.py
import argparse
import tobira # rust module
import scandirsize # python module
parser = argparse.ArgumentParser(description="Compare directory size module")
parser.add_argument("path")
parser.add_argument("-rust", "--rust-module", action="store_true")
if __name__ == "__main__":
args = parser.parse_args()
if args.rust_module:
result = tobira.get_directory_size(args.path)
else:
result = scandirsize.get_directory_size(args.path)
print(result)
Then we can execute python calculate.py ~/dev/linux-6.1.11
to run pure Python module and for Rust python calculate.py -rust ~/dev/linux-6.1.11
.
Benchmark using hyperfine,
$ hyperfine --warmup 1 -r 20 "python calculate.py ~/dev/linux-6.1.11" "python calculate.py -rust ~/dev/linux-6.1.11" --export-json result.json
Benchmark 1: python calculate.py ~/dev/linux-6.1.11
Time (mean ± σ): 358.6 ms ± 2.7 ms [User: 94.4 ms, System: 263.1 ms]
Range (min … max): 356.3 ms … 367.4 ms 20 runs
Benchmark 2: python calculate.py -rust ~/dev/linux-6.1.11
Time (mean ± σ): 336.0 ms ± 1.1 ms [User: 60.2 ms, System: 274.6 ms]
Range (min … max): 334.8 ms … 339.0 ms 20 runs
Summary
'python calculate.py -rust ~/dev/linux-6.1.11' ran
1.07 ± 0.01 times faster than 'python calculate.py ~/dev/linux-6.1.11'
So result is 1.07 ± 0.01 times faster, but if we only see the userspace number its down from 94.4 ms to 60.2 ms.
Running hyperfine using --export-json
to generate json output of raw data to further processing. Lets create histogram graph.
$ python plot_histogram.py -o histogram.png result.json
Coreutils du
implementation
uutils coreutils is an open source project with attempt to rewrite cross-platform Rust version of GNU coretutils.
One of the tool is to check disk usage man du
, coreutils du (display disk usage statistics) rust implementation, https://github.com/uutils/coreutils/blob/main/src/uu/du/src/du.rs
Exposing tree-sitter-highlight
library
Another use case is to use library available from Rust ecosystem but not available yet in Python. tree-sitter-highlight is library to do syntax highlighting from Tree-sitter parser.
First add required dependencies using cargo add
.
$ cargo add tree-sitter-highlight tree-sitter-python
Updating crates.io index
Adding tree-sitter-highlight v0.20.1 to dependencies.
Adding tree-sitter-python v0.20.2 to dependencies.
and from readme from the project, we need to define list of highlights_names
we want.
Lets create highlight module, highlight.rs
:
use tree_sitter_highlight::{HighlightConfiguration, Highlighter, HtmlRenderer};
const HIGHLIGHT_NAMES: &[&str; 18] = &[
"attribute",
"constant",
"function.builtin",
"function",
"keyword",
"operator",
"property",
"punctuation",
"punctuation.bracket",
"punctuation.delimiter",
"string",
"string.special",
"tag",
"type",
"type.builtin",
"variable",
"variable.builtin",
"variable.parameter",
];
pub fn code_highlight(code: String) -> Result<String, Box<dyn std::error::Error>> {
let mut highlighter = Highlighter::new();
let mut python_config = HighlightConfiguration::new(
tree_sitter_python::language(),
tree_sitter_python::HIGHLIGHT_QUERY,
"",
"",
)?;
python_config.configure(HIGHLIGHT_NAMES);
let html_attrs: Vec<String> = HIGHLIGHT_NAMES
.iter()
.map(|s| format!("class=\"{}\"", s.replace('.', " ")))
.collect();
let highlights = highlighter.highlight(&python_config, code.as_bytes(), None, |_| None)?;
let mut renderer = HtmlRenderer::new();
renderer.render(highlights, code.as_bytes(), &|highlight| {
html_attrs[highlight.0].as_bytes()
})?;
Ok(String::from_utf8_lossy(renderer.html.as_slice()).to_string())
}
Now use it in lib.rs
:
mod highlight;
...
#[pyfunction]
fn code_highlight(code: String) -> PyResult<String> {
match highlight::code_highlight(code) {
Ok(value) => Ok(value),
Err(e) => Err(PyValueError::new_err(e.to_string())),
}
}
#[pymodule]
fn tobira(_py: Python, m: &PyModule) -> PyResult<()> {
m.add_function(wrap_pyfunction!(sum_as_string, m)?)?;
m.add_function(wrap_pyfunction!(get_directory_size, m)?)?;
m.add_function(wrap_pyfunction!(code_highlight, m)?)?;
Ok(())
}
Lets test it on calculate.py
file:
$ maturin develop --release
$ python -c "import tobira; fh = open('calculate.py'); code = fh.read(); print(tobira.code_highlight(code))"
<span class="keyword">import</span> <span class="variable">argparse</span>
<span class="keyword">import</span> <span class="variable">tobira</span> # rust module
<span class="keyword">import</span> <span class="variable">scandirsize</span> # python module
<span class="variable">parser</span> <span class="operator">=</span> <span class="variable">argparse</span>.ArgumentParser(<span class="variable">description</span><span class="operator">=</span><span class="string">"Compare directory size module"</span>)
<span class="variable">parser</span>.<span class="function">add_argument</span>(<span class="string">"path"</span>)
<span class="variable">parser</span>.<span class="function">add_argument</span>(<span class="string">"-rust"</span>, <span class="string">"--rust-module"</span>, <span class="variable">action</span><span class="operator">=</span><span class="string">"store_true"</span>)
<span class="keyword">if</span> <span class="variable">__name__</span> <span class="operator">==</span> <span class="string">"__main__"</span>:
<span class="variable">args</span> <span class="operator">=</span> <span class="variable">parser</span>.<span class="function">parse_args</span>()
<span class="keyword">if</span> <span class="variable">args</span>.<span class="variable">rust_module</span>:
<span class="variable">result</span> <span class="operator">=</span> <span class="variable">tobira</span>.<span class="function">get_directory_size</span>(<span class="variable">args</span>.<span class="variable">path</span>)
<span class="keyword">else</span>:
<span class="variable">result</span> <span class="operator">=</span> <span class="variable">scandirsize</span>.<span class="function">get_directory_size</span>(<span class="variable">args</span>.<span class="variable">path</span>)
<span class="function builtin">print</span>(<span class="variable">result</span>)
Conclusion
Python is a popular programming language known for its simplicity and ease of use, but sometimes it can be slow when dealing with computationally intensive tasks. Rust, on the other hand, is a modern systems programming language known for its speed, memory safety, and concurrency features. In this context, it can be beneficial to combine Python and Rust to leverage the strengths of both languages.
Overall, using Rust with PyO3 and Maturin can provide a powerful and efficient way to create Python modules that leverage the strengths of both languages.
All the source code are available in https://github.com/sakti/tobira.