Bash and Python golfing is a silly hobby I have, but I always learn a lot from doing it. Either I get practice using a library or tool I rarely use or I get to discover new ways of dissecting a problem. I certainly did both on this golfing jaunt ⛳️
The Task
https://www.tax.service.gov.uk/eat-out-to-help-out/find-a-restaurant/
The UK government have a website listing the restaurants that are currently part of a discount scheme. I wanted an easy way to query these restaurants from the command line. My first hope was that this site had a JSON API I could call to pull out all this juicy info; alas no. I was left to contend with parsing HTML. This was the challenge I set myself, a one liner in bash to display the nearest restaurants based on this HTML content.
And so, I embarked on a frivolous learning adventure.
Show me the Code
To start at the end, here was the result of my golfing journey. I apologise about the rather clunky sed
usage; this line and the rest are explained below.
curl -s https://www.tax.service.gov.uk/eat-out-to-help-out/find-a-restaurant/results\?lookup\=BS19ED \
| hxclean \
| hxselect -c -s '\n' .govuk-heading-m \
| recode -f html..ascii \
| sed -e '1d' -e 's/ $//' -e 's/\b\(\w\)\(\w\+\)/\u\1\L\2/g' \
| head -10
Leisure Cafe For Shop Limited
Fi Real Caribbean Vegan Restaurant
Pepes Piri Piri
The Ill Repute
The Old Market Assembly
Piri Piri Corner
Real Habesha Restaurant
The Climbing Academy - The Mothership
The Phoenix
The Stag And Hounds
CSS Selecting in Bash
CSS selectors are awesome and I wanted to use them through a unix tool interface, using STDIN and STDOUT. The previous ways I had queried information from an HTML document using CSS selectors in Python and BeautifulSoup felt a bit heavy handed and I hoped there would a way to achieve this using something akin to jq
.
After some googling and evaluating other options such as pup
, I discovered that the W3 Org had a set of tools, HTML-XML-utils, for manipulating an HTML documents from the command line. These tools even had a brew formulae and could be trivially installed: brew install html-xml-utils
.
hxselect
was the command we need to use extract text using CSS selectors, however any page I tried to parse had issues.
End tag </head> doesn't match start tag <meta>
hxselect
needs valid XML and very few websites produce fully compliant XHTML. After further googling and reading of the very slim W3 documentation, I discovered hxclean
did the trick! It fills in missing end tags and fixes other common violations. It's probably not going to work for all cases but did the job here.
Now we can use hxselect
and pass it a CSS selector.
URL="https://www.tax.service.gov.uk/eat-out-to-help-out/find-a-restaurant/"
curl -s $URL | hxclean | hxselect -c title
Find a restaurant that’s registered for the scheme - Eat Out to Help Out - GOV.UK
Awesome!
The restaurant names had the class name .govuk-heading-m
in the HTML doc so this was the CSS selector we use in the one liner. The -c
flag is used to only print the contents of the selected elements and -s '\n'
is required to format the output.
Dealing with HTML Entities
This all looked excellent until
McDonald's Restaurant
Oh no! I forgot about HTML Entities.
An HTML entity is a piece of text ("string") that begins with an ampersand (&
) and ends with a semicolon (;
) . Entities are frequently used to display reserved characters (which would otherwise be interpreted as HTML code)
https://developer.mozilla.org/en-US/docs/Glossary/Entity
I wanted to reach for Python but refrained and continued on with my fatuous challenge.
After some googling I discovered recode
The Recode library converts files between character sets and usages.
Which can be brew installed brew install recode
We can now render the HTML Entities as human readable.
echo 'McDonald's Restaurant' | recode html..ascii'
McDonald's Restaurant
recode
looks powerful and it's certainly a tool I'll try to remember in future when dealing with encodings.
echo ' ?¾' | recode ..dump
UCS2 Mne Description
0020 SP space
003F ? question mark
00BE 34 vulgar fraction three quarters
000A LF line feed (lf)
Title Case
The last problem I wanted to solve was the rather arbitrary capitalisation of some business names. This output looked unsightly and solving it seemed within reach, though solving this problem probably meant losing true "one liner" status.
portwall tavern
Pasture Restaurant Limited
Friska Victoria Street
Pasty Emporium
Le Vignoble Bristol
Totos By The River
THE COLOSSEUM
FREEDOG BRISTOL
After a short search, I reached for some tools I know well: sed
and a liberal helping of regular expressions. I was least happy with this part of the one liner and would love to hear shorter and more elegant alternatives.
The full sed
expression used in the one liner contains a few other useful edits on the output too.
sed -e '1d' -e 's/ $//' -e 's/\b\(\w\)\(\w\+\)/\u\1\L\2/g'
The CSS selector matched the heading containing 100 results found
which is the first line of output, we want to exclude that.1d
deletes the first line. 1
is the line number, d
is the delete command.
s/ $//
trims any unsightly trailing whitespace.
Now to unpack this mess
s/\b\(\w\)\(\w\+\)/\u\1\L\2/g
The backslashes make this rather unapproachable, these are dropped when using other regex implementations, like Python's.
\b(\w)(\w+)
\b
is a word boundary, the start or end of a word, and then we are capturing the first letter in one group and subsequent letters in a second group.
\u\1\L\2
The two capture groups are then reused in the replacement but modified with \u
to upper case the following letter and \L
to lower case the group.
If your interested in playing with this regex a little more here is a Regex101 editor loaded with this example.
All together now
Putting it together leads to a very long "one liner" of around 256 characters. A few more characters could be easily shaved off at the expense of readability of course. I was pleased to discover some new tools that are well packaged and have simple unix-style interfaces, doing one thing well. I definitely will be going back to hxselect
.
A silly adventure with a pleasing journey and a mildly useful result.