Regular Expressions
When there's non-standard semi-structured data formats¶
Every now and then, we encounter semi-structured data that isn't in a commonly supported
format like JSON or XML. One example is Operational Technology (OT) data such as
Controller Area Network (CAN) data collected via Linux's can-utils
library's candump
command. The output looks something like this:
(1544106679.432583) can0 562#122DF953813CA2
(1544106679.631330) can0 1F0#4C96281B62ACBC20
(1544106679.830333) can0 687#C3F26E
(1544106680.029125) can0 412#06C136
(1544106680.228086) can0 67E#01C65274D19FAC6D
(1544106680.427065) can0 4FA#5F6502615110B55A
(1544106680.625958) can0 429#6FB48932FBCECE26
There's four distinct pieces of information in each line
(1544106679.432583)
is the unix epoch timestamp in microsecondscan0
is the network name564
before the#
is the destination ID122DF953813CA2
after the#
is the data for the destination
Solution: Regular Expressions¶
Pythex.org, regexr.com, regex101.com and similar online tools provide ways to rapidly construct a Regular Expression (regex) to parse arbitrary strings using pre-determined rules.
Semi-structured to Structured
-
Can you write a regex that will match a
candump
data sample? -
Can you name each matched part of the data as part of the regex?
-
How can we try this in python?
-
Are you able to convert the output of the regex into a Python dictionary?
-
Are you able to output the dictionary you just created to a CSV file?
Speeding up the slow native Python regex library¶
There are two options for somewhat increasing the speed of a regex in Python if you need to parse lots of non-standard semi-structured data.
-
Google's
re2
: re2 and it's Python wrappergoogle-re2
provides methods forcompiling
regular expressions. This will require your dev environment to include C compilers and build tools most python packages don't require. -
Intel's
hyperscan
: Hyperscan is a largely stale project from Intel that leveraged some of their CPU level optimizations for regular expressions to achieve top performance for the job. Unfortunately, it's not easily used in Python and only outperformsre2
when doing particularly complex or large scale operations.