Extracting Netscape bookmarks

Parsing Netscape HTML bookmark data using the Python computer language, and the importance of XBEL.

Yesterday, Mr. Simone Canaletti has reported an issue with import of bookmarks from an Netscape HTML bookmark file which was generated by the annotation, bookmarks and knowledge management system Shaarli.

Bookmarks file import issue

This issue was expected; and the Irish gentlemen were aware of this concern, and have refused to prioritize it, because it involved a not-well-formed (i.e. invalid) XML code, and that would require extenssive tests.

So, I am the one to do this task.

Python modules

There are several Python modules that do that task, and all utilize the module BeautifulSoup, and all of those modules appear to ignore the presence of HTML element <DD/>, which encloses bookmark description.

bookmarks-converter

bookmarks-parser

netscape-bookmark-toolbox

Neglegence of description input

The reason for neglecting HTML element <DD/> is probably because it is not present anymore in a couple of HTML browsers which are not worth to be named.

The decision to eliminate the description field for bookmarks has caused harm to researchers and potential researchers and must be condemned.

Thankfully, people who are responsible for Shaarli other good researching platforms did not betray their people.

Ignoring XBEL

I do wonder whether the continued addoption of the malformed Netscape HTML bookmark file, instead of XBEL or XHTML, was a part of a conspiracy to serve as an excuse to neglect the description input from built-in desktop bookmark managers.

I think so, because, as it will further be revealed, due to the not-well-formed structure of Netscape HTML bookmark file, the description field is the subject of the problem of this article, and this problem could be used by bad people as an excuse to remove description input from bookmark managers of HTML browsers.

If HTML browsers with built-in bookmark managers which are made by "multi-million worth(less) organizations" would utilize XBEL, just as Arora, buku, floccus, Galeon, KBookmarks, Konqueror, Midori, Otter Browser, SiteBar, Spurl, SyncPlaces, and other independent projects do, then this problem would not exist.

In fact, supporting XBEL is fault-proof cheaper, easier, faster and requires less code than supporting the Netscape HTML bookmark file.

If XBEL was denonced, it would have been obvious that there is a concealed motive to sabotage bookmarks and to consequently harm researchers, so instead XBEL has been ignored and Netscape HTML bookmark file was kept being utilized.

Conspiracy

There are reasons to suspect of a conspiracy.

In addition to the ease of generating and parsing of XBEL, another reason to suspect of a conspiracy is by analysing the claim by people of Mozilla for removal of syndication support from its products.

They have claimed, that the removal of its built-in syndication feed renderer and live bookmarks use old code and technology.

If so, then why Netscape HTML bookmark file, which is of old an obsolete technology, is still in use today?

Mozilla - Devil Incarnate

This is interesting, because a lawyer, who is not a trained coder, has managed to implement a multi-format (Atom, OPML, RDF, RSS, SMF) syndication renderer in both XSLT and ECMAScript (i.e. JS).

📰 Newspaper - View news feeds inside the browser

This is interesting, also because Otter browser, from Mr. Emdek, has an extensive support for syndication.

Otter Browser - Controlled by the user, not vice versa

HTML Parser : First attempt

Because the Netscape bookmark file appears to be parsed by the Dillo Browser as any man would expect, I have decided to utilize an HTML parser, albeit the Netscape HTML bookmark file is not-well-formed.

I have further decided to utilize the module lxml instead of BeautifulSoup, because lxml is already a dependency of Blasta, and it can also parse HTML.

This is the code.

from datetime import datetime
from lxml import etree

def load_data_netscape(html: str) -> dict:

    # Parse HTML
    parser = etree.HTMLParser()
    tree = etree.fromstring(html, parser)

    # Initialize a list to hold bookmarks
    entries = []

    # Find all anchor tags within the DL element
    for dt_tag in tree.xpath('//dt'):
        a_tag = dt_tag.find('a')
        if a_tag is not None:
            title = a_tag.text or ""
            link = a_tag.get('href', "")
            published = a_tag.get('add_date', str(time.time()))
            updated = a_tag.get('last_modified', str(time.time()))
            tags = a_tag.get('tags', "").split(',')

            # Find the corresponding DD tag for summary
            sibling = dt_tag.getnext()
            summary = sibling.text if sibling is not None and sibling.tag == 'dd' else ""

            # Append bookmark dictionary to list
            entries.append({
                "title": title,
                "link": link,
                "summary": summary.strip(),
                "published": datetime.fromtimestamp(float(published)).isoformat(),
                "updated": datetime.fromtimestamp(float(updated)).isoformat(),
                "tags": tags
            })
    return {'entries' : entries}

Due to the not-well-formed (i.e. invalid) structure of Netscape HTML bookmark file, the problem was that upon a lack of element <DD/> for a given element <DT/>, the closest <DD/> element would be detected as if it was affiliated with the element </DT> which lacks a description.

In other attempts, the <DT/> element which actually had an associated <DD/> element was skipped, and the description was allocated to the previous <DT/> element which has lacked a <DD/> element.

This has caused to an improper import with misplaced descriptions.

Regular expression

After over a couple of dozen failed attempts during the span of three hours of coding and writing of observations, including attempts to compile an XHTML file as a subject data to be parsed, I have resorted to RegEx.

I did not think that I would ever attempt to parse HTML with RegEx, after I have read this article about XHTML.

RegEx match open tags except XHTML self-contained tags

Because the malpractice of generating malformed HTML documents, as if it is a normal thing, appears to be prevalent amongst "multi-million worth(less) organizations", I have exceptionally considered to resort to RegEx.

This is the code.

from datetime import datetime
import re

def load_data_from_html(html: str) -> dict:
    bookmarks = []
    current_summary = ""

    lines = html.splitlines()
    for line in lines:
        line = line.strip()

        # Check for  tag
        if line.startswith("
"):
            # Look for  tag within 

            a_match = re.search(r'(.*?)', line)
            if a_match:
                link, published, updated, private, tags, title = a_match.groups()

                # Convert timestamps from seconds since epoch to ISO format
                published_date = datetime.fromtimestamp(int(published)).isoformat()
                updated_date = datetime.fromtimestamp(int(updated)).isoformat()

                # Create bookmark dictionary
                bookmark = {
                    'title': title,
                    'link': link,
                    'summary': current_summary,
                    'published': published_date,
                    'updated': updated_date,
                    'tags': [tag.strip() for tag in tags.split(',')] if tags else []
                }

                # Append bookmark to the list
                bookmarks.append(bookmark)

                # Reset summary for the next bookmark
                current_summary = ""

        # Check for 
 tag
        elif line.startswith("
"):

            # Extract summary from 

            summary_match = re.search(r'
(.*?)
|(.*?)(?=s*
|$)', line)
            if summary_match:
                bookmarks[len(bookmarks)-1]['summary'] = summary_match.group(2).strip()

    return {'entries': bookmarks}

The function iterates line by line.
Upon detection of element <DT/> (i.e. a line which starts with <DT>), a dict element (i.e. bookmark) is built in accord to the RegEx rule.
Upon detection of element <DD/> (i.e. a line which starts with <DD>), the textual content is set as a summary of the most recent dict element (i.e. bookmark).

However, this function works only with Shaarli, which means that it would fail when attribute PRIVATE is missing and upon presence of unexpected attributes.

HTML Parser : Second attempt

By adopting the practice of line iteration, I have resorted to parse single lines with the HTML parser of module LXML.

It feels idiotic to parse each line with an HTML parser, yet I do not think that there is any other viable choice.

It also makes me to despise "multi-million worth(less) organizations", as they appear to sabotage instead of creating, and, I suspect, that most of their employees are idiots that are incapable of managing even the smallest and cheapest of the assests of my family.

This is the code.

from datetime import datetime
from lxml import etree

def load_data_netscape(html: str) -> dict:
    bookmarks = []
    current_summary = ""
    parser = etree.XMLParser(recover=True)

    lines = html.splitlines()
    for line in lines:
        line = line.strip()
        if line:
            # Parse given line
            root = etree.fromstring(line, parser)

            # Check for  tag
            if line.startswith("
"):
                # Look for  tag within 

                a_element = root.find('.//A')
                if a_element is not None:
                    link = a_element.get('HREF')
                    add_date = a_element.get('ADD_DATE') or time.time()
                    last_modified = a_element.get('LAST_MODIFIED') or time.time()
                    tags = a_element.get('TAGS')
                    title = a_element.text or link

                    # Convert timestamps from seconds since epoch to ISO format
                    added_date = datetime.fromtimestamp(float(add_date)).isoformat()
                    modified_date = datetime.fromtimestamp(float(last_modified)).isoformat()

                    # Create bookmark dictionary
                    bookmark = {
                        'title': title,
                        'link': link,
                        'summary': current_summary,
                        'published': added_date,
                        'updated': modified_date,
                        'tags': [tag.strip() for tag in tags.split(',')] if tags else ['unclassified']
                    }

                    # Append bookmark to the list
                    bookmarks.append(bookmark)

                    # Reset summary for the next bookmark
                    current_summary = ""

            # Check for 
 tag
            elif line.startswith("
"):
                # Extract summary from 

                bookmarks[len(bookmarks)-1]['summary'] = line[4:].strip()

    return {'entries': bookmarks}

The function iterates line by line.
Upon detection of element <DT/> (i.e. a line which starts with <DT>), a dict element (i.e. bookmark) is built by the HTML parser of module LXML.
Upon detection of element <DD/> (i.e. a line which starts with <DD>), the textual content is set as a summary of the most recent dict element (i.e. bookmark).

Notice that the HTML parser of module LXML is utilized for tag <DT/>, and upon detection of line which starts with <DD>, the first 4 characters <DD> are removed.

It is important to understand, that we assume that lines that begin with <DD>, contain only element <DD/>.

Conclusion

The best manner to parse Netscape files is by iterating line by line and parsing it with an HTML parser.

Upon detection of element <DD/> (i.e. description), the text should be associated with the most recently created bookmark.

Because there are no closing tags, the process of data extraction is mostly based on guessing.

Verdict

The Netscape HTML bookmark file which is not-well-formed is a bad format, and it must be deprecated.

The practice of utilizing Netscape HTML bookmark file causes to a waste of development time, and discourages seriousness of work.

I advise to utilize the standard XBEL as a proper mean to import, exchange and export bookmarks.

Post script

Mr. Simone Canaletti, who is also known by the name roughnecks, is the man who dictates the roadmap of project Blasta, and reviews the general operation, quality and outcomes of Blasta.

While Blasta supports import of bookmarks from Netscape HTML bookmark files, it will not support export of bookmarks to Netscape HTML bookmark files.

Extracting Netscape bookmarks

Python modules

Neglegence of description input

Ignoring XBEL

Conspiracy

HTML Parser : First attempt

Regular expression

HTML Parser : Second attempt

Conclusion

Verdict

Post script

Resources