2011-07-06 Regular Expression To Validate Id Attributes
I noticed that my blog no longer validated as XHTML 1.0 and I started investigating. On the Diary page, you can click on the comment links of the various blog posts (such as *Comments on 2011-07-05 Google Plus*) and you’ll get the comments *inlined*. This uses a tiny piece of javascript (and some CSS):
function togglecomments (id) {
var elem = document.getElementById(id);
if (elem.className=="commentshown") {
elem.className="commenthidden";
}
else {
elem.className="commentshown";
}
}
Thus, the HTML source already includes the comments in an appropriate div:
…
Links such as *Comments on 2011-07-05 Google Plus* will simply call the javascript function defined above and pass the *id* of the div to toggle:
Comments on 2011-07-05 Google Plus
That’s why the id attribute is important. The trivial solution is to simply use the blog post title (”2011-07-05 Google Plus”) but soon enough you’ll note that there are some interesting restrictions on the values of id attributes:
- may start with a colon, a letter, or underscore
- the rest of the name may contain the above and dashes, periods, and numbers
- brackets, braces, and parenthesis are not allowed
Now—how exactly is this defined? See the definition of Name in the XML spec:
- `NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]`
- `NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]`
- `Name ::= NameStartChar (NameChar)*`
Now, those are Unicode code points. But sadly, Oddmuse tries to be encoding *agnostic*. (I will have to revisit this decision, soon!)
Here’s a simple beginning of a regular expression that would identify well-formed names: `/^[:_A-Za-z][-.:_A-Za-z0-9]*/`
Now to extend it using the information above:
Unicode Codepoint UTF-8 encoding [#xC0-#xD6] c3 80 - c3 96 [#xD8-#xF6] c3 98 - c3 b6 [#xF8-#x2FF] c3 b8 - cb bf [#x370-#x37D] cd b0 - cd bd [#x37F-#x1FFF] cd bf - e1 bf bf [#x200C-#x200D] e2 80 8c - e2 80 8d [#x2070-#x218F] e2 81 b0 - e2 86 8f [#x2C00-#x2FEF] e2 b0 80 - e2 bf af [#x3001-#xD7FF] e3 80 81 - ed 9f bf [#xF900-#xFDCF] ef a4 80 - ef b7 8f [#xFDF0-#xFFFD] ef b7 b0 - ef bf bd [#x10000-#xEFFFF] f0 90 80 80 - f3 af bf bf
I started writing the following regular expression:
$regexp = "|\xc3[\x80-\x96\x98-\xb6\xb8-\xff]|[\xc4-\xca].|\xcb[\x00-\xbf]"
. "|\xcd[\xb0-\xbd\xbf-\xff]|[\xce-\xDF].|\xe0..|\xe1[\x00-\xbe]."
. "|\xe1\xbf[\x00-\xbf]|\xe2\x80[\x8c\x8d]"
if $HttpCharset eq 'UTF-8';
$id = ":$id" unless $id =~ /^[:_A-Za-z]$regexp/;
return join('', $id =~ m/([-.:_A-Za-z0-9]$regexp)/g);
Then I got tired and though, “if anybody reports an error, I’ll add the rest…”
#Web #XML #Oddmuse
Comments
(Please contact me if you want to remove your comment.)
⁂
You do know RegEx match open tags except XHTML self-contained tags?
– Harald 2011-07-12 12:00 UTC
---
Habe ich schon mal gesehen, ja. Und kennst du Oh Yes You Can Use Regexes to Parse HTML!?
In meinem Fall geht es aber nicht um Parsen von HTML sondern um die Transformation von Wiki Seitentiteln zu id Werten, welche ich im generierten HTML dann verwenden kann.
– Alex Schroeder 2011-07-12 12:15 UTC