AuraSearch Features
Current State of Features
- Full Text Search of document metadata, with porter stemming.
- + and - operators, for a required term, or excluded term, respectively.
- Title extraction using first apparent heading, regardless of its level.
- Gemsub feed detection.
- Line counts.
- Indexed publication dates based on dates in filenames.
- File size information.
- Indexed Mp3, Ogg, and Flac file metadata (ID3, MP4, and Ogg/Flac).
- Aggregator based on search engine index.
- Wildcards: * and ?
- Crawler: Robots.txt is followed, including "Allow", "Disallow", and "Crawl-Delay" directives. The Slow Down gemini status code is also followed.
- Crawler: 2 second delay between crawling of pages on the same domain.
- Parses gemtext, spartan text, nex listings, scrolltext.
- Partial markdown parsing.
Outdated Features:
- AND, OR, NOT, parentheses grouping, and quotes
- Filters: "TITLE", "URL", "ALBUM", "ARTIST", "ALBUMARTIST", "COPYRIGHT", "CONTENTTYPE", "LANGUAGE", and "PUBLISHDATE". The syntax is "field: term". Field names must be in all capital letters.
- Fuzzy Searching by placing ~ after a search term
- Proximity Searching: if you want to search for two words that are within a distance of 10 words of each other, then query with "term_one term_two"~10
- Range Searching: For searching in ranges of numbers or dates. Can be used with filters, like the PUBLISHDATE filter. An example of filtering based on a publication date range would be, PUBLISHDATE:[20220101 to 20231201]
Features Coming Soon
- PDF and Djvu file metadata indexed
- Image file metadata indexed
- Plain text file full contents indexed
- Backlinks and searching of link text
- Page Metadata Lookup
- Full Markdown, Tinylog, and Twtxt parsing to get links, titles, and heading information.
History
AuraGem was a search engine that I started about 2 years ago under its original name, Ponix Search. It was originally designed to experiment with how I could make search results better. The official announcement of the Search Engine happened on 2021-07-01:
Note that some of the information in the above posts have been recently updated to match the current URL and Ip Address of the crawler and gemini capsule.
One of the first priorities with AuraSearch was to have extraction of file metadata for as many files as possible. Audio files were one of the first to get this feature. PDFs and Djvu files were supposed to be next, and support was added for them on 2022-07-19, but the feature was buggy and never worked, unfortunately. As you can see in the below post, I chose to go with Keyword Extraction (which was later removed and replaced with simple mentions and tags extraction) instead of Full Text Searching on page contents. Part of this was to save space, and part of it was to respect copyright. However, I am rethinking this approach now that the Stats page can determine how large the text-only portion of geminispace is (no more than 5GB total).
In the above article, you can see that I start to play with the notion of different types of searches. I think this idea remains important today:
Another problem that the above process would not catch are names and proper nouns. These are often very important words that people would want to search for (e.g. Mathematics, C++, Celine Dion, FTS). I do not have an easy method for this atm.
The next update on 2022-07-21 added Full Text Searching of link and file metadata, which drastically improved the speed of searches. Yes, this came with stemming because my database's FTS uses Lucene++.
Not long after I wrote an article about FTS, ranking systems, and some of the problems that Search Engines have to handle:
The most important portion of this article, however, is recognizing how people do searches:
This also introduces the argument that the ranking systems are really only important for underspecified queries (broad queries), so the emphasis on the problems with ranking algorithms is unwarranted. This argument hardly makes sense when the majority of searches that people make are broad. I would also argue that broad searches are most used for *discovering* pages, not for getting to a specific page. However, ranking based on popularity prioritizes what it thinks people would want, which is more suited for specific searches using broad queries, at the expense of discovery of broad topics. Broad discovery using broad topic queries and specific searches using proper-noun queries or very specific queries are both much better ways of dealing with searches without relying on popularity.
When making a search engine, one must balance the search results between discovery (broadness) and exact matches (exactness). Relevancy applies to both of these, but is more important for discovery. I continue to think that link analysis assumes that people want exact matches of pages while using broad queries. For example, if someone types in "search engine", a PageRank system would put the most popular search engine at the top along with popular articles about search engines, assuming that the person wanted that specific search engine, when it's more likely they wanted a collection of search engines. Rather, my approach is to return broad relevant discovery-based results with broad queries, and exact pages with exact queries.
Exact queries include words from titles, domain names, capsule names, service names, basically mainly proper nouns or a specific combination of words that matches the page information. Broad queries, however, use category names and common nouns.
When I type "Station", I want an exact match for Station itself. However, when I type "social network", I want search results that give a very broad set of capsules that are social networks. I believe that this is how most people would use search engines, especially if they do not rely much on filtering, and this is the exact methodology that I use for my article analyzing gemini's search engines: