Referencing a Webpage Quickly

03 Oct 2019 - Richard Horridge

2019-10-03 Thu 13:54

I have a fairly simple problem I need to solve - getting a reliable bibliography entry from a webpage. I want it to look like this, ideally:

@comment{Template for a bibliography entry for a website}
@comment{Each variable is in angle brackets}
  AUTHOR =   {<author>},
  JOURNAL =  {<website>},
  MONTH =  {<month_published>},
  NOTE =   {Online; posted
  TITLE =  {<title>},
  URL =    {<page_url>},
  YEAR =   {<year_published>},

@comment{Example of a complete entry}
  AUTHOR =   {Ken Milberg},
  JOURNAL =  {IBM Developer },
  MONTH =  09,
  NOTE =   {Online; posted 2009-Sep-29; accessed 2019-Sep-05},
  TITLE =  {IBM and HP virtualization},
  URL =
  YEAR =   2009,

As always, simple problems end up being more complex than first thought.

I wrote the following Bash script which extracts a date from the Wayback Machine version of the URL, which usually updates each time the source website is updated. It isn't as accurate as pulling the date directly from the website (e.g. Published on 3rd October 2019) but is more reliable, as many websites do not list a publish date.

  DATESTR="$(curl -Ls -o /dev/null -w '%{url_effective}'"$1" | \
sed -E 's/.*web\/([0-9]+)\/http.*/\1/')";

  date --iso-8601 --date="${DATEINPUT}";

However, if the website has not been archived by the Wayback machine this obviously does not work.

Another issue with the Wayback Machine is that it occasionally redirects to a URL with save in, rather than a series of digits corresponding to a date. This process is the Machine saving the URL on that date, indicating the page has not been saved previously. I'm not sure why this happens exactly, but in most cases my code works as expected.

Some pages use a <time> tag to indicate when a page was published. This is a better method than using the Wayback Machine and will be preferred.

Getting a page author is less simple - there is no standardised format for an author on a webpage. A useful WikiHow article lists several options; the only way currently active in my program involves splitting the page title on the | or - characters and taking the latter.

Getting the title of a page is, in theory, straightforward - it sits within <title> tags. Matching these can be done using regular expressions, although this becomes more complicated if the tag spans multiple lines and an XML parser is required. In python3 the most commonly used library is BeautifulSoup and this is what I used.

Obtaining the JOURNAL field can be done by matching a HTML meta tag, such as the following examples:

<meta property="og:site_name" content="Docker" />
<meta name="rights" content="© Copyright IBM Corporation 2019" />

These both have potentially useful names for a JOURNAL field.

I rewrote my initial script in Python and the following excerpt produces a Bibtex entry as required:

  print ("""
URL     = {{{0}}},
JOURNAL = {{{1}}},
AUTHOR  = {{{2}}},
TITLE   = {{{3}}},
YEAR    = {{{4}}},
MONTH   = {{{5}}},
NOTE    = {{Online; posted {6}; accessed {7}}},
  @comment{{File at "{8}"}}

  """.format (url, webpage_journal.strip(), title_author[1].strip(),
        published_date.year, published_date.month,
        published_date.strftime ("%Y-%b-%d"), ("%Y-%b-%d"),

Writing the library in Python makes it much more modular and extensible than a (simple?) Bash script. However, this wasn't as simple as would be expected due to various problems with Python's file handling as compared with Bash. In a future blog post, I will rewrite this program again in my preferred language, Common Lisp, and see how the development process compares.

The source code in all of these examples is available on my GitLab.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License .