How to extract Open House London building locations from the website

It's the annual OHL challenge because they won't post a Google Map

First we have to establish the index numbers of the current buildings, these are apparent in the link given for each e.g.

https://openhouselondon.open-city.org.uk/listings/3416

From the website we note there are 841 buildings (now 846, 12 withdrawn 17 added) although the index numbers can reach four figures.


Stage One

Let's use PHP to visit 0 to 10,000 while checking the 200 OK return code from HTTP headers. This does not unduly load the OHL server because we are only getting the headers not the whole page with get_headers($url);

Check HTTP headers with PHP

Nano screen shot

vcount.txt holds all the attempted reads while valid.txt lists the successful ones.

The results are held in valid841.txt or valid846.txt as of 23/9/18 (12 withdrawn 17 added)


Stage Two

Next to get the OHL building pages from the file list valid841.txt into a directory ready for the next stage where we find the locations and description. 841 files listed by their index numbers. That's a wget -i filename function. If you want to know what the index numbers are click the file link.

To get the updated OHL building pages for 23/9/18 from the file list valid846.txt into a directory ready for the next stage where we find the locations and description. 846 files listed by their index numbers. That's a wget -i filename function. If you want to know what the index numbers are click the file link.


Stage Three

Use grep and sed to find the latitude, longitude, and location and write to gmap.txt, this is a shell script.

Shell script using grep and sed to find lat, long, and name

Nano screen shot



Stage Four

Running the above program to produce gmap.txt then importing that into Excel and saving as a comma separated value (.csv) file which can be imported into a mapping application .

Download the base data from gmap846.csv and make your own Google or other map.

Running the above program to produce an updated (23/9/18, 12 withdrawn, 17 added) file gmapupd.txt then importing that into Excel and saving as a comma separated value (.csv) file which can be imported into Google as a layer on a Google Map.


Stage Five

There were some good things I found along the way to doing this. One is exec to open a new file for reading and writing, it could not be easier done that way. (see above).

The other was an occasion when I had inadvertently preceded the links file with row numbers. I had achieved this file here valid841.txt which if you click the link will see is just a list of URLs.

Having read the file with cat I then read it with cat -b valid841.txt > newfile.txt which numbers the lines then deleted the original file. Whoops. See newfile.txt

This left me with a problem. I needed the unnumbered links to use in the Google map but had to get rid of 841 index numbers in the left hand column without doing it by hand.

It was at this point I learned about awk and more specifically awk '{print $2}' valid841.txt > newfile.txt (see first example) which ignores the l/h column and prints only the r/h. Problem solved. Unix is wonderful.



Brought to you on a Low End Box courtesy of @otaneimi