Extracting Netscape bookmarks
Parsing Netscape HTML bookmark data using the Python computer language, and the importance of XBEL.
Yesterday, Mr. Simone Canaletti has reported an issue with import of bookmarks from an Netscape HTML bookmark file which was generated by the annotation, bookmarks and knowledge management system Shaarli.
This issue was expected; and the Irish gentlemen were aware of this concern, and have refused to prioritize it, because it involved a not-well-formed (i.e. invalid) XML code, and that would require extenssive tests.
So, I am the one to do this task.
Python modules
There are several Python modules that do that task, and all utilize the module BeautifulSoup, and all of those modules appear to ignore the presence of HTML element <DD/>, which encloses bookmark description.
Neglegence of description input
The reason for neglecting HTML element <DD/> is probably because it is not present anymore in a couple of HTML browsers which are not worth to be named.
The decision to eliminate the description field for bookmarks has caused harm to researchers and potential researchers and must be condemned.
Thankfully, people who are responsible for Shaarli other good researching platforms did not betray their people.
Ignoring XBEL
I do wonder whether the continued addoption of the malformed Netscape HTML bookmark file, instead of XBEL or XHTML, was a part of a conspiracy to serve as an excuse to neglect the description input from built-in desktop bookmark managers.
I think so, because, as it will further be revealed, due to the not-well-formed structure of Netscape HTML bookmark file, the description field is the subject of the problem of this article, and this problem could be used by bad people as an excuse to remove description input from bookmark managers of HTML browsers.
If HTML browsers with built-in bookmark managers which are made by "multi-million worth(less) organizations" would utilize XBEL, just as Arora, buku, floccus, Galeon, KBookmarks, Konqueror, Midori, Otter Browser, SiteBar, Spurl, SyncPlaces, and other independent projects do, then this problem would not exist.
In fact, supporting XBEL is fault-proof cheaper, easier, faster and requires less code than supporting the Netscape HTML bookmark file.
If XBEL was denonced, it would have been obvious that there is a concealed motive to sabotage bookmarks and to consequently harm researchers, so instead XBEL has been ignored and Netscape HTML bookmark file was kept being utilized.
Conspiracy
There are reasons to suspect of a conspiracy.
In addition to the ease of generating and parsing of XBEL, another reason to suspect of a conspiracy is by analysing the claim by people of Mozilla for removal of syndication support from its products.
They have claimed, that the removal of its built-in syndication feed renderer and live bookmarks use old code and technology.
If so, then why Netscape HTML bookmark file, which is of old an obsolete technology, is still in use today?
This is interesting, because a lawyer, who is not a trained coder, has managed to implement a multi-format (Atom, OPML, RDF, RSS, SMF) syndication renderer in both XSLT and ECMAScript (i.e. JS).
This is interesting, also because Otter browser, from Mr. Emdek, has an extensive support for syndication.
HTML Parser : First attempt
Because the Netscape bookmark file appears to be parsed by the Dillo Browser as any man would expect, I have decided to utilize an HTML parser, albeit the Netscape HTML bookmark file is not-well-formed.
I have further decided to utilize the module lxml instead of BeautifulSoup, because lxml is already a dependency of Blasta, and it can also parse HTML.
This is the code.
from datetime import datetime
from lxml import etree
def load_data_netscape(html: str) -> dict:
# Parse HTML
parser = etree.HTMLParser()
tree = etree.fromstring(html, parser)
# Initialize a list to hold bookmarks
entries = []
# Find all anchor tags within the DL element
for dt_tag in tree.xpath('//dt'):
a_tag = dt_tag.find('a')
if a_tag is not None:
title = a_tag.text or ""
link = a_tag.get('href', "")
published = a_tag.get('add_date', str(time.time()))
updated = a_tag.get('last_modified', str(time.time()))
tags = a_tag.get('tags', "").split(',')
# Find the corresponding DD tag for summary
sibling = dt_tag.getnext()
summary = sibling.text if sibling is not None and sibling.tag == 'dd' else ""
# Append bookmark dictionary to list
entries.append({
"title": title,
"link": link,
"summary": summary.strip(),
"published": datetime.fromtimestamp(float(published)).isoformat(),
"updated": datetime.fromtimestamp(float(updated)).isoformat(),
"tags": tags
})
return {'entries' : entries}
Due to the not-well-formed (i.e. invalid) structure of Netscape HTML bookmark file, the problem was that upon a lack of element <DD/> for a given element <DT/>, the closest <DD/> element would be detected as if it was affiliated with the element </DT> which lacks a description.
In other attempts, the <DT/> element which actually had an associated <DD/> element was skipped, and the description was allocated to the previous <DT/> element which has lacked a <DD/> element.
This has caused to an improper import with misplaced descriptions.
Regular expression
After over a couple of dozen failed attempts during the span of three hours of coding and writing of observations, including attempts to compile an XHTML file as a subject data to be parsed, I have resorted to RegEx.
I did not think that I would ever attempt to parse HTML with RegEx, after I have read this article about XHTML.
Because the malpractice of generating malformed HTML documents, as if it is a normal thing, appears to be prevalent amongst "multi-million worth(less) organizations", I have exceptionally considered to resort to RegEx.
This is the code.
from datetime import datetime
import re
def load_data_from_html(html: str) -> dict:
bookmarks = []
current_summary = ""
lines = html.splitlines()
for line in lines:
line = line.strip()
# Check for - The function iterates line by line.
- Upon detection of element <DT/> (i.e. a line which starts with <DT>), a dict element (i.e. bookmark) is built in accord to the RegEx rule.
- Upon detection of element <DD/> (i.e. a line which starts with <DD>), the textual content is set as a summary of the most recent dict element (i.e. bookmark).
However, this function works only with Shaarli, which means that it would fail when attribute PRIVATE is missing and upon presence of unexpected attributes.
HTML Parser : Second attempt
By adopting the practice of line iteration, I have resorted to parse single lines with the HTML parser of module LXML.
It feels idiotic to parse each line with an HTML parser, yet I do not think that there is any other viable choice.
It also makes me to despise "multi-million worth(less) organizations", as they appear to sabotage instead of creating, and, I suspect, that most of their employees are idiots that are incapable of managing even the smallest and cheapest of the assests of my family.
This is the code.
from datetime import datetime
from lxml import etree
def load_data_netscape(html: str) -> dict:
bookmarks = []
current_summary = ""
parser = etree.XMLParser(recover=True)
lines = html.splitlines()
for line in lines:
line = line.strip()
if line:
# Parse given line
root = etree.fromstring(line, parser)
# Check for - The function iterates line by line.
- Upon detection of element <DT/> (i.e. a line which starts with <DT>), a dict element (i.e. bookmark) is built by the HTML parser of module LXML.
- Upon detection of element <DD/> (i.e. a line which starts with <DD>), the textual content is set as a summary of the most recent dict element (i.e. bookmark).
Notice that the HTML parser of module LXML is utilized for tag <DT/>, and upon detection of line which starts with <DD>, the first 4 characters <DD> are removed.
It is important to understand, that we assume that lines that begin with <DD>, contain only element <DD/>.
Conclusion
The best manner to parse Netscape files is by iterating line by line and parsing it with an HTML parser.
Upon detection of element <DD/> (i.e. description), the text should be associated with the most recently created bookmark.
Because there are no closing tags, the process of data extraction is mostly based on guessing.
Verdict
The Netscape HTML bookmark file which is not-well-formed is a bad format, and it must be deprecated.
The practice of utilizing Netscape HTML bookmark file causes to a waste of development time, and discourages seriousness of work.
I advise to utilize the standard XBEL as a proper mean to import, exchange and export bookmarks.
Post script
Mr. Simone Canaletti, who is also known by the name roughnecks, is the man who dictates the roadmap of project Blasta, and reviews the general operation, quality and outcomes of Blasta.
While Blasta supports import of bookmarks from Netscape HTML bookmark files, it will not support export of bookmarks to Netscape HTML bookmark files.
Blasta will support XBEL, in addition to JSON and TOML.