regex for parsing dimension measurement descriptions (Python)

595c7e256f8551ac1cb8cee3624d32d8.png


83b4788c161aa302db80eeedaef9208b.png


Essentially, for rows whose work_height, work_width, work_depth dimensions are missing but there's a description of those dimensions in the work_dimensions column, I want to parse the said description into the work_height, work_width, work_depth columns. There are a few types of structures available based on my exploration:

  • __ unit x __ unit x __ unit e.g. 200 x 300 mm. This one should be easy.
  • __ unit x __ unit \newline __ unit x __ unit, e.g. 200 x 300 mm\n400 x 760 mm I believe these are two different image dimension settings possible for the same image. I want to create a new image item (row) with the second setting (or third or whatever).
  • The written out mixed fractions, e.g. 16 7/8 in (42.8 cm) or 16 7/8in (42.8cm). How is this supposed to be parsed? This is one of the hard ones. Since the unit column work_measurement_unit is generally mm, that's the unit to parse I presume (and even then I have to convert from cm to mm).
  • Measurement Description, followed by the mixed fraction and other unit in parentheses above, i.e. Diameter: 19 3/7 in (72.5 cm).
To access the rows above I used:

code to get missing data:
mask = (df['work_dimensions'] != '-1') & (df['work_dimensions'].notnull()) & ((df[['work_height','work_width','work_depth']] == -1.0).sum(axis=1) == 3)
df[['work_dimensions','work_height','work_width','work_depth','work_measurement_unit']][mask]
I'm not too familiar with regexp stuff in Python or in general so any help would be appreciated!
 

Daniel Duffy

C++ author, trainer
There are two kinds of developer; those that know regex (Perl?) and them that don't.
It's a special area indeed.
 
Top