Referencing a Webpage Quickly
03 Oct 2019 - Richard Horridge
I have a fairly simple problem I need to solve - getting a reliable bibliography entry from a webpage. I want it to look like this, ideally:
@comment{Template for a bibliography entry for a website} @comment{Each variable is in angle brackets} @misc{<label>, AUTHOR = {<author>}, JOURNAL = {<website>}, MONTH = {<month_published>}, NOTE = {Online; posted <year_published>-<month_published>-<day_published>; accessed <year_accessed>-<month_accessed>-<day-accessed>}, TITLE = {<title>}, URL = {<page_url>}, YEAR = {<year_published>}, } @comment{Example of a complete entry} @misc{milberg09_ibm_hp, AUTHOR = {Ken Milberg}, JOURNAL = {IBM Developer }, MONTH = 09, NOTE = {Online; posted 2009-Sep-29; accessed 2019-Sep-05}, TITLE = {IBM and HP virtualization}, URL = {https://www.ibm.com/developerworks/aix/library/au-aixhpvirtualization/index.html}, YEAR = 2009, }
As always, simple problems end up being more complex than first thought.
I wrote the following Bash script which extracts a date from the
Wayback Machine version of the URL, which usually updates each time
the source website is updated. It isn't as accurate as pulling the
date directly from the website (e.g. Published on 3rd October
2019
) but is more reliable, as many websites do not list a publish
date.
DATESTR="$(curl -Ls -o /dev/null -w '%{url_effective}' https://web.archive.org/web/"$1" | \ sed -E 's/.*web\/([0-9]+)\/http.*/\1/')"; DATEINPUT="${DATESTR:0:8}"; date --iso-8601 --date="${DATEINPUT}";
However, if the website has not been archived by the Wayback machine this obviously does not work.
Another issue with the Wayback Machine is that it occasionally redirects
to a URL with save
in, rather than a series of digits
corresponding to a date. This process is the Machine saving the URL
on that date, indicating the page has not been saved previously. I'm
not sure why this happens exactly, but in most cases my code works
as expected.
Some pages use a <time>
tag to indicate when a page was
published. This is a better method than using the Wayback Machine
and will be preferred.
Getting a page author is less simple - there is no standardised
format for an author on a webpage. A useful WikiHow article lists
several options; the only way currently active in my program
involves splitting the page title on the |
or -
characters and
taking the latter.
Getting the title of a page is, in theory, straightforward - it sits
within <title>
tags. Matching these can be done using regular
expressions, although this becomes more complicated if the tag spans
multiple lines and an XML parser is required. In python3
the most
commonly used library is BeautifulSoup and this is what I used.
Obtaining the JOURNAL
field can be done by matching a HTML meta
tag, such as the following examples:
<meta property="og:site_name" content="Docker" /> <meta name="rights" content="© Copyright IBM Corporation 2019" />
These both have potentially useful names for a JOURNAL
field.
I rewrote my initial script in Python and the following excerpt produces a Bibtex entry as required:
print (""" @misc{{label, URL = {{{0}}}, JOURNAL = {{{1}}}, AUTHOR = {{{2}}}, TITLE = {{{3}}}, YEAR = {{{4}}}, MONTH = {{{5}}}, NOTE = {{Online; posted {6}; accessed {7}}}, }} @comment{{File at "{8}"}} """.format (url, webpage_journal.strip(), title_author[1].strip(), title_author[0].strip(), published_date.year, published_date.month, published_date.strftime ("%Y-%b-%d"), datetime.now().strftime ("%Y-%b-%d"), local_path))
Writing the library in Python makes it much more modular and extensible than a (simple?) Bash script. However, this wasn't as simple as would be expected due to various problems with Python's file handling as compared with Bash. In a future blog post, I will rewrite this program again in my preferred language, Common Lisp, and see how the development process compares.
The source code in all of these examples is available on my GitLab.