OVO Tech Blog
OVO Tech Blog

Our journey navigating the technosphere

Tom Cammann
Author

Share


Tags


Code Golfing Adventures - 'Eat Out To Help Out' One Liner

Bash and Python golfing is a silly hobby I have, but I always learn a lot from doing it. Either I get practice using a library or tool I rarely use or I get to discover new ways of dissecting a problem. I certainly did both on this golfing jaunt ⛳️

The Task

https://www.tax.service.gov.uk/eat-out-to-help-out/find-a-restaurant/

The UK government have a website listing the restaurants that are currently part of a discount scheme. I wanted an easy way to query these restaurants from the command line. My first hope was that this site had a JSON API I could call to pull out all this juicy info; alas no. I was left to contend with parsing HTML. This was the challenge I set myself, a one liner in bash to display the nearest restaurants based on this HTML content.

And so, I embarked on a frivolous learning adventure.

Show me the Code

To start at the end, here was the result of my golfing journey. I apologise about the rather clunky sed usage; this line and the rest are explained below.

curl -s https://www.tax.service.gov.uk/eat-out-to-help-out/find-a-restaurant/results\?lookup\=BS19ED \
| hxclean \
| hxselect -c -s '\n' .govuk-heading-m \
| recode -f html..ascii \
| sed -e '1d' -e 's/ $//' -e 's/\b\(\w\)\(\w\+\)/\u\1\L\2/g' \
| head -10
The one liner
Leisure Cafe For Shop Limited
Fi Real Caribbean Vegan Restaurant
Pepes Piri Piri
The Ill Repute
The Old Market Assembly
Piri Piri Corner
Real Habesha Restaurant
The Climbing Academy - The Mothership
The Phoenix
The Stag And Hounds
The result

CSS Selecting in Bash

CSS selectors are awesome and I wanted to use them through a unix tool interface, using STDIN and STDOUT. The previous ways I had queried information from an HTML document using CSS selectors in Python and BeautifulSoup felt a bit heavy handed and I hoped there would a way to achieve this using something akin to jq.

After some googling and evaluating other options such as pup, I discovered that the W3 Org had a set of tools, HTML-XML-utils, for manipulating an HTML documents from the command line. These tools even had a brew formulae and could be trivially installed: brew install html-xml-utils.

hxselect was the command we need to use extract text using CSS selectors, however any page I tried to parse had issues.

End tag </head> doesn't match start tag <meta>

hxselect needs valid XML and very few websites produce fully compliant XHTML. After further googling and reading of the very slim W3 documentation, I discovered hxclean did the trick! It fills in missing end tags and fixes other common violations. It's probably not going to work for all cases but did the job here.

Now we can use hxselect and pass it a CSS selector.

URL="https://www.tax.service.gov.uk/eat-out-to-help-out/find-a-restaurant/"
curl -s $URL | hxclean | hxselect -c title
Find a restaurant that’s registered for the scheme - Eat Out to Help Out - GOV.UK

Awesome!

The restaurant names had the class name .govuk-heading-m in the HTML doc so this was the CSS selector we use in the one liner. The -c flag is used to only print the contents of the selected elements and -s '\n' is required to format the output.

Dealing with HTML Entities

This all looked excellent until

McDonald&#x27;s Restaurant

Oh no! I forgot about HTML Entities.

An HTML  entity is a piece of text ("string") that begins with an ampersand (&) and ends with a semicolon (;) . Entities are frequently used to display reserved characters (which would otherwise be interpreted as HTML code)

https://developer.mozilla.org/en-US/docs/Glossary/Entity

I wanted to reach for Python but refrained and continued on with my fatuous challenge.

After some googling I discovered recode

The Recode library converts files between character sets and usages.

Which can be brew installed brew install recode

We can now render the HTML Entities as human readable.

echo 'McDonald&#x27;s Restaurant' | recode html..ascii'
McDonald's Restaurant

recode looks powerful and it's certainly a tool I'll try to remember in future when dealing with encodings.

echo ' ?¾' | recode ..dump
UCS2 Mne Description

0020 SP  space
003F ?   question mark
00BE 34  vulgar fraction three quarters
000A LF  line feed (lf)

Title Case

The last problem I wanted to solve was the rather arbitrary capitalisation of some business names. This output looked unsightly and solving it seemed within reach, though solving this problem probably meant losing true "one liner" status.

portwall tavern
Pasture Restaurant Limited
Friska Victoria Street
Pasty Emporium
Le Vignoble Bristol
Totos By The River
THE COLOSSEUM
FREEDOG BRISTOL

After a short search, I reached for some tools I know well: sed and a liberal helping of regular expressions. I was least happy with this part of the one liner and would love to hear shorter and more elegant alternatives.

The full sed expression used in the one liner contains a few other useful edits on the output too.

sed -e '1d' -e 's/ $//' -e 's/\b\(\w\)\(\w\+\)/\u\1\L\2/g'

The CSS selector matched the heading containing 100 results found which is the first line of output, we want to exclude that.1d deletes the first line. 1 is the line number, d is the delete command.

s/ $// trims any unsightly trailing whitespace.

Now to unpack this mess

s/\b\(\w\)\(\w\+\)/\u\1\L\2/g

The backslashes make this rather unapproachable, these are dropped when using other regex implementations, like Python's.

\b(\w)(\w+)

\b is a word boundary, the start or end of a word, and then we are capturing the first letter in one group and subsequent letters in a second group.

\u\1\L\2

The two capture groups are then reused in the replacement but modified with \u to upper case the following letter and \L to lower case the group.

If your interested in playing with this regex a little more here is a Regex101 editor loaded with this example.

All together now

Putting it together leads to a very long "one liner" of around 256 characters. A few more characters could be easily shaved off at the expense of readability of course. I was pleased to discover some new tools that are well packaged and have simple unix-style interfaces, doing one thing well. I definitely will be going back to hxselect.

A silly adventure with a pleasing journey and a mildly useful result.

View Comments