Bash and Python golfing is a silly hobby I have, but I always learn a lot from doing it. Either I get practice using a library or tool I rarely use or I get to discover new ways of dissecting a problem. I certainly did both on this golfing jaunt ⛳️
The UK government have a website listing the restaurants that are currently part of a discount scheme. I wanted an easy way to query these restaurants from the command line. My first hope was that this site had a JSON API I could call to pull out all this juicy info; alas no. I was left to contend with parsing HTML. This was the challenge I set myself, a one liner in bash to display the nearest restaurants based on this HTML content.
And so, I embarked on a frivolous learning adventure.
Show me the Code
To start at the end, here was the result of my golfing journey. I apologise about the rather clunky
sed usage; this line and the rest are explained below.
CSS Selecting in Bash
CSS selectors are awesome and I wanted to use them through a unix tool interface, using STDIN and STDOUT. The previous ways I had queried information from an HTML document using CSS selectors in Python and BeautifulSoup felt a bit heavy handed and I hoped there would a way to achieve this using something akin to
After some googling and evaluating other options such as
pup, I discovered that the W3 Org had a set of tools, HTML-XML-utils, for manipulating an HTML documents from the command line. These tools even had a brew formulae and could be trivially installed:
brew install html-xml-utils.
hxselect was the command we need to use extract text using CSS selectors, however any page I tried to parse had issues.
End tag </head> doesn't match start tag <meta>
hxselect needs valid XML and very few websites produce fully compliant XHTML. After further googling and reading of the very slim W3 documentation, I discovered
hxclean did the trick! It fills in missing end tags and fixes other common violations. It's probably not going to work for all cases but did the job here.
Now we can use
hxselect and pass it a CSS selector.
URL="https://www.tax.service.gov.uk/eat-out-to-help-out/find-a-restaurant/" curl -s $URL | hxclean | hxselect -c title
Find a restaurant that’s registered for the scheme - Eat Out to Help Out - GOV.UK
The restaurant names had the class name
.govuk-heading-m in the HTML doc so this was the CSS selector we use in the one liner. The
-c flag is used to only print the contents of the selected elements and
-s '\n' is required to format the output.
Dealing with HTML Entities
This all looked excellent until
Oh no! I forgot about HTML Entities.
An HTML entity is a piece of text ("string") that begins with an ampersand (
&) and ends with a semicolon (
;) . Entities are frequently used to display reserved characters (which would otherwise be interpreted as HTML code)
I wanted to reach for Python but refrained and continued on with my fatuous challenge.
After some googling I discovered recode
The Recode library converts files between character sets and usages.
Which can be brew installed
brew install recode
We can now render the HTML Entities as human readable.
echo 'McDonald's Restaurant' | recode html..ascii'
recode looks powerful and it's certainly a tool I'll try to remember in future when dealing with encodings.
echo ' ?¾' | recode ..dump
UCS2 Mne Description 0020 SP space 003F ? question mark 00BE 34 vulgar fraction three quarters 000A LF line feed (lf)
The last problem I wanted to solve was the rather arbitrary capitalisation of some business names. This output looked unsightly and solving it seemed within reach, though solving this problem probably meant losing true "one liner" status.
portwall tavern Pasture Restaurant Limited Friska Victoria Street Pasty Emporium Le Vignoble Bristol Totos By The River THE COLOSSEUM FREEDOG BRISTOL
After a short search, I reached for some tools I know well:
sed and a liberal helping of regular expressions. I was least happy with this part of the one liner and would love to hear shorter and more elegant alternatives.
sed expression used in the one liner contains a few other useful edits on the output too.
sed -e '1d' -e 's/ $//' -e 's/\b\(\w\)\(\w\+\)/\u\1\L\2/g'
The CSS selector matched the heading containing
100 results found which is the first line of output, we want to exclude that.
1d deletes the first line.
1 is the line number,
d is the delete command.
s/ $// trims any unsightly trailing whitespace.
Now to unpack this mess
The backslashes make this rather unapproachable, these are dropped when using other regex implementations, like Python's.
\b is a word boundary, the start or end of a word, and then we are capturing the first letter in one group and subsequent letters in a second group.
The two capture groups are then reused in the replacement but modified with
\u to upper case the following letter and
\L to lower case the group.
If your interested in playing with this regex a little more here is a Regex101 editor loaded with this example.
All together now
Putting it together leads to a very long "one liner" of around 256 characters. A few more characters could be easily shaved off at the expense of readability of course. I was pleased to discover some new tools that are well packaged and have simple unix-style interfaces, doing one thing well. I definitely will be going back to
A silly adventure with a pleasing journey and a mildly useful result.