README#

Author:: Hai Vo
Authors:: Hai Vo
Date:: 7/21/23

Introduction#

This project is a python package to mimic r::stringr functionalities, the core functions are written in Rust and then export to Python. Note that I write this package mostly for personal use (convenience and speed) and learning purpose, so please use with care!

Any type of contribution are welcome!

How it works#

Using arrow format to store main input array.
Using pyo3 for python binding
Convert Python type (mostly List) to Rust type (mostly Vec) for the case not using arrow. This may cause some overhead, but it make the code more flexible. For example: many function not only vectorize over main array but also it arugments.

Installation#

This package is not on PyPi yet, so you need to compile from source.

First you need rust compiler:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Then install this package as normal python package:

git clone https://github.com/vohai611/stringpy.git
pip3 install ./stringpy

Or you can download and install from prebuild wheels under github action artifact

Milestone#

v0.1.0#

☒ Implement basic function
☒ Add document
☒ Add test
☒ Add CI/CD
☒ Add example
☒ Add codecov
[] Release PyPi

v0.2.0#

[] Add benchmark
[] Vectorize on arguments

Documentation#

The documentation can be found at here

Usage example#

# setup
import stringpy as sp
import pandas as pd
import numpy as np
import random
import string

Combine string within group#

df = pd.DataFrame({'group': ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'b', 'a', 'b'],
              'value': ['one', 'two', 'three', 'four',None, 'six', 'seven', 'eight', 'nine', 'ten']})

df2 = df.groupby('group').agg(lambda x: sp.str_c(x, collapse='->'))

df2

	value
group
a	one->three->->seven->nine
b	two->four->six->eight->ten

Split string#

sp.str_split(df2['value'], pattern='->')

<pyarrow.lib.ListArray object at 0x13490e7a0>
[
  [
    "one",
    "three",
    "",
    "seven",
    "nine"
  ],
  [
    "two",
    "four",
    "six",
    "eight",
    "ten"
  ]
]

Camel case to snake case#

a = sp.str_replace_all(['ThisIsSomeCamelCase', 'ObjectNotFound'],
                      pattern='([a-z])([A-Z])', replace= '$1 $2').to_pylist()
sp.str_replace_all(sp.str_to_lower(a), pattern = ' ', replace = '_')

<pyarrow.lib.StringArray object at 0x13490c3a0>
[
  "this_is_some_camel_case",
  "object_not_found"
]

Remove accent#

vietnam = ['Hà Nội', 'Hồ Chí Minh', 'Đà Nẵng', 'Hải Phòng', 'Cần Thơ', 'Biên Hòa', 'Nha Trang', 'BMT', 'Huế', 'Buôn Ma Thuột', 'Bắc Giang', 'Bắc Ninh', 'Bến Tre', 'Bình Dương', 'Bình Phước', 'Bình Thuận', 'Cà Mau', 'Cao Bằng', 'Đắk Lắk', 'Đắk Nông', 'Điện Biên', 'Đồng Nai', 'Đồng Tháp']

sp.str_remove_ascent(vietnam)

<pyarrow.lib.StringArray object at 0x134b44ee0>
[
  "Ha Noi",
  "Ho Chi Minh",
  "Da Nang",
  "Hai Phong",
  "Can Tho",
  "Bien Hoa",
  "Nha Trang",
  "BMT",
  "Hue",
  "Buon Ma Thuot",
  ...
  "Binh Duong",
  "Binh Phuoc",
  "Binh Thuan",
  "Ca Mau",
  "Cao Bang",
  "Dak Lak",
  "Dak Nong",
  "Dien Bien",
  "Dong Nai",
  "Dong Thap"
]

Random speed comparison#

Although this package is not aim to speed optimization, but in most case, it still get a decent speed up compare with pandas, thank to Rust!

Below are some of random comparison between stringpy and pandas:

letters = string.ascii_lowercase
a = [''.join(random.choice(letters) for i in range(10))  for i in range(600_000)]

a_sr = pd.Series(a)

Replace pattern#

%%time
a_sr.str.replace('\w', 'b', regex=True)

CPU times: user 443 ms, sys: 7.78 ms, total: 451 ms
Wall time: 452 ms

0         bbbbbbbbbb
1         bbbbbbbbbb
2         bbbbbbbbbb
3         bbbbbbbbbb
4         bbbbbbbbbb
             ...
599995    bbbbbbbbbb
599996    bbbbbbbbbb
599997    bbbbbbbbbb
599998    bbbbbbbbbb
599999    bbbbbbbbbb
Length: 600000, dtype: object

%%time
sp.str_replace_all(a, pattern='\w', replace= 'b')

CPU times: user 5.02 s, sys: 40.9 ms, total: 5.06 s
Wall time: 5.09 s

<pyarrow.lib.StringArray object at 0x134b45d80>
[
  "bbbbbbbbbb",
  "bbbbbbbbbb",
  "bbbbbbbbbb",
  "bbbbbbbbbb",
  "bbbbbbbbbb",
  "bbbbbbbbbb",
  "bbbbbbbbbb",
  "bbbbbbbbbb",
  "bbbbbbbbbb",
  "bbbbbbbbbb",
  ...
  "bbbbbbbbbb",
  "bbbbbbbbbb",
  "bbbbbbbbbb",
  "bbbbbbbbbb",
  "bbbbbbbbbb",
  "bbbbbbbbbb",
  "bbbbbbbbbb",
  "bbbbbbbbbb",
  "bbbbbbbbbb",
  "bbbbbbbbbb"
]

Subset by index#

%%time
a_sr.str.slice(2,4)

CPU times: user 54.9 ms, sys: 4.1 ms, total: 59 ms
Wall time: 59.1 ms

0         vw
1         to
2         su
3         ik
4         eb
          ..
599995    hj
599996    wc
599997    pd
599998    ns
599999    kw
Length: 600000, dtype: object

%%time
sp.str_sub(a, start=2, end=4)

CPU times: user 272 ms, sys: 7.4 ms, total: 279 ms
Wall time: 279 ms

<pyarrow.lib.StringArray object at 0x134b45a80>
[
  "vw",
  "to",
  "su",
  "ik",
  "eb",
  "vn",
  "et",
  "ix",
  "sz",
  "de",
  ...
  "ag",
  "el",
  "mi",
  "yc",
  "me",
  "hj",
  "wc",
  "pd",
  "ns",
  "kw"
]

## Counting

::: {.cell execution_count=11}
``` {.python .cell-code}
%%time
a_sr.str.count('a')

CPU times: user 132 ms, sys: 3.22 ms, total: 136 ms
Wall time: 136 ms

0         0
1         1
2         0
3         0
4         0
         ..
599995    0
599996    0
599997    0
599998    0
599999    0
Length: 600000, dtype: int64

.. container:: cell

%%time
sp.str_count(a, pattern='a')

CPU times: user 428 ms, sys: 2.98 ms, total: 431 ms
Wall time: 432 ms

<pyarrow.lib.Int32Array object at 0x134b45b40>
[
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  ...
  1,
  0,
  0,
  2,
  0,
  0,
  0,
  0,
  0,
  0
]

Implement list#

part 1#

☒ str_count
☒ str_detect
☒ str_extract /str_extract_all
[] str_locate() str_locate_all()
☒ str_match() str_match_all()
☒ str_replace() str_replace_all()
☒ str_remove() str_remove_all()
☒ str_split()
[] str_split_1() str_split_fixed() str_split_i()
☒ str_starts() str_ends()
☒ str_subset()
☒ str_which()
☒ str_c(), str_combine()
[] str_flatten() str_flatten_comma()

part 2#

☒ str_dup()
☒ str_length() str_width()
☒ str_pad()
☒ str_sub()/ str_sub_all()
☒ str_trim() str_squish()
☒ str_trunc()
[] str_wrap()
☒ str_to_upper() str_to_lower() str_to_title() str_to_sentence()
☒ str_unique()
☒ str_remove_ascent()

Different type of i/o#

Python#

@export: one array in, one array out
@export2: multiple array in, one array out

Rust#

apply_utf8!()
apply_utf8_bool!()
apply_utf8_lst!()

vec in vec out

apply_utf8!()
@export

vec+ in vec out

apply_utf8!()
@export2

vec in vec out

apply_utf8_bool!()
@export

vec in vec<vec> out

apply_utf8_lst!()
@export