regex for parsing dimension measurement descriptions (Python)

binomial-torrent · 7/31/19

Essentially, for rows whose work_height, work_width, work_depth dimensions are missing but there's a description of those dimensions in the work_dimensions column, I want to parse the said description into the work_height, work_width, work_depth columns. There are a few types of structures available based on my exploration:

__ unit x __ unit x __ unit e.g. 200 x 300 mm. This one should be easy.
__ unit x __ unit \newline __ unit x __ unit, e.g. 200 x 300 mm\n400 x 760 mm I believe these are two different image dimension settings possible for the same image. I want to create a new image item (row) with the second setting (or third or whatever).
The written out mixed fractions, e.g. 16 7/8 in (42.8 cm) or 16 7/8in (42.8cm). How is this supposed to be parsed? This is one of the hard ones. Since the unit column work_measurement_unit is generally mm, that's the unit to parse I presume (and even then I have to convert from cm to mm).
Measurement Description, followed by the mixed fraction and other unit in parentheses above, i.e. Diameter: 19 3/7 in (72.5 cm).

To access the rows above I used:

[CODE lang="python" title="code to get missing data"]mask = (df['work_dimensions'] != '-1') & (df['work_dimensions'].notnull()) & ((df[['work_height','work_width','work_depth']] == -1.0).sum(axis=1) == 3)
df[['work_dimensions','work_height','work_width','work_depth','work_measurement_unit']][mask][/CODE]

I'm not too familiar with regexp stuff in Python or in general so any help would be appreciated!

Daniel Duffy · 7/31/19

Regular expression - Wikipedia

en.wikipedia.org

Then "Python in a Nutshell" chapter 9.

regex

Alternative regular expression module, to replace re.

pypi.org

Daniel Duffy · 7/31/19

There are two kinds of developer; those that know regex (Perl?) and them that don't.
It's a special area indeed.

ExSan · 7/31/19

Daniel Duffy said:
There are two kinds of developer; those that know regex (Perl?) and them that don't.
It's a special area indeed.

There are two kinds of developer: those that know C/C++ and them that don't

Daniel Duffy · 7/31/19

The learning curve for regex can be steep.

binomial-torrent · 8/2/19

Daniel Duffy said:
The learning curve for regex can be steep.

I found a Github package to do my bidding.

For those interested

regex for parsing dimension measurement descriptions (Python)

binomial-torrent

Daniel Duffy

C++ author, trainer

Regular expression - Wikipedia

regex

Daniel Duffy

C++ author, trainer

ExSan

Daniel Duffy

C++ author, trainer

binomial-torrent