Does the XML file for import to IE8’s InPrivate filter actually use regex?

September 6, 2009

3 Views 0

SaveSavedRemoved 0

The InPrivate Filter of IE8 has an install-time default of Off.
However, a registry edit can change its behavior to default to On when
you start IE8. That way, you don’t have to keep remembering to turn it
on when you load IE8.

Always enable InPrivate Filtering:
http://www.pcmag.com/article2/0,2817,2346892,00.asp
http://blogs.msdn.com/dmart/archive/…y-default.aspx

Its watching and recording of same content providers that are repeated
at the sites that you visit gets recorded with this filter mode enabled.
Even if a common content provider is found at the sites that you visit,
they aren’t eligible for blocking unless the occurrence exceeds the
threshold that you configure for this filter (the default is 10 sites
for that same content provider). As you decrease this threshold, more
content providers will probably show up in the list (i.e., more become
eligible for blocking). However, that blocking is NOT a reasonable
ad-blocker since it merely looks for common content across multiple
sites, not what is the content.

So I found where others had mentioned using the InPrivate Filter (after
configuring it to always start On) as an ad-block filter. One guy
converted the AdBlock list (a plug-in for Firefox) to an XML file that
you can import into IE8 (Internet Options -> Programs -> Manage Add-ons
-> InPrivate Filtering). Alas, Microsoft didn’t let users directly
enter what URL strings on which they want to filter. You have to create
an XML list that you then import. I’m not keen on the huge list for the
AdBlock filter and decided to prune it down while also adding my prior
list of URL strings on which I was blocking (in the URL filter in
Avast’s Web Shield but there is other software, like some firewalls,
that let you block on URLs). His list is mentioned at:

http://www.dslreports.com/forum/r221…lock-plus-list

I started to wonder about the syntax of his CDATA strings. You can use
anything you want for the description but the URL string should
supposedly follow some regex syntax. Well, Microsoft doesn’t explain
much other than what I found at:

http://msdn.microsoft.com/en-us/libr…20(VS.85).aspx

I can understand why you need to escape the period character (if you’re
actually testing for a period character at that position rather than for
any 1 character at that position) but I don’t see why the forward
slashes in the path have to be escaped. So I have to wonder how valid
is Microsoft’s implementation of regex.

In regex, ".*\.adbrite\.com.*" would match on zero or more of any
characters followed by a period followed by "adbrite" followed by a
period followed by "com" and followed by zero or more of any characters.
You need to escape the period character to use it as that character
rather than its regex use of "1 of any character". Since the function
seems to look for substrings (I’m not sure of this), the .* at the
beginning and end of the URL string are probably not needed, so I could
use "\.adbrite\.com" to find that substring anywhere in the URL string.

You need the escape the backslash character, as in "\\" is for one
backslash character but I don’t see why you have to escape forward slash
characters since they are *not* use in URLs.
"http://www.intel.com/index.htm" has no backslash characters that need
to be escaped, and forward slashes don’t need to be escaped in regex.

I have to wonder if the author of the rules.xml file (converted from
AdBlock’s list) used legitimate regex syntax since none of the period
characters are escaped in the CDATA strings in his entries. They
probably work well enough since a period character at that position
qualifies as any 1 character at that position; however, "adbrite.com"
would also match on "adbritexcom".

Since Microsoft hasn’t been keen on embracing regular expressions, it is
likely that they don’t follow the PCRE standard. Perhaps the forward
slashes do need to be escaped by backslashes but that’s not true in PCRE
(Perl Core Regular Expressions). By Microsoft’s article, anony101’s
converted AdBlock list might happen to work but syntax is invalid. It
looks like anony101 used the old DOS wildcarding syntax rather than a
valid regex syntax.

Even Microsoft’s example of:

<wf:blockRegex> <![CDATA[ads.contoso.com\/.*]]> </wf:blockRegex>

is not a valid regular expression unless its author actually intended to
match on ANY character where the first 2 period characters show up. The
above regex would also match on "adsXcontosoYcom\" followed by any
characters. Any why is the backslash shown in the URL which is invalid.
The delimiter is the forward slash? You do not go to
http:\\www.intel.com\index.html. You go to
http://www.intel.com/index.html. There’s something goofy in how
Microsoft says URLs are syntaxed.

When I’ve looked at some other XML RSS feed files, they’ll specify the
CDATA as something like:

a.*\.contoso\.com.*

which is "a" followed by zero or more characters, a period, "contoso", a
period, and followed by zero or more characters. The periods in the URL
are properly escaped (I’m not sure the .* is needed at the end, or at
the beginning as used in some regex strings that I’ve seen; that is,
what’s the difference between ".*host\.domain\.tld.*" and
"host\.domain\.tld"? Is a substring search performed to find it
anywhere in the URL? Or is there an assumption that the regex string is
anchored after the URL scheme (as if http:// or https:// were implied
since I did read this filtering only works on those URL schemes)? In
regex, if I were to anchor the left side of the string, I’d use
"^a.*\.contoso\.com.*" if this string follows immediately after the
http:// protocol prefix and must span the entire searched string rather
than look for a substring (and why the trailing .* would be needed).

Is there better documentation on the XML RSS feed file (used for
subscriptions for the InPrivate Filter)? I’d like to know that what I’m
specifying to search on is what IE8 actually uses. I don’t see a
problem in the XML used in the RSS feed file that anony101 came up with
but is looks like he didn’t employ proper regex syntax for the CDATA
values (which are the URL substrings on which to block).