[aseek-devel] patch parse.cpp

From: Jens Thoms Toerring (no email)
Date: Tue Sep 02 2003 - 07:31:11 EDT


  I now (hopefully finally) found the reason why I came up with
the patch for parse.cpp, which Matt told me shouldn't be necessary.
And when I tried indexing without the patch it seemed to work -
until I now again found the server where it doesn't work...

  The problem is that in both in CUrl::HTTPGetUrlAndStore() and
in ParseHtml() (and perhaps also in other places) the function
CUrl::ParseHtml() is invoked on the URL in order to decide if the
URL is to be indexed. To do so it splits the string with the URL
into two parts at the first ':' in the string, and the first part
is treated as the protocol and the address. This works obviously
well with URLs like "http://www.xxx.yyy.com/bla/index.html". But
it fails for example when you have a link in an HTML page like

<a href="/de:w/index.html>

because the URL is now split into "/de" and "w/index.html", which
of course doesn't make too much sense and results in an "Unsupported
protocol" error for the URL. A solution seems to be to check that
the second part really starts with two slashes before accepting
that the first part to contain a protocol name.

                                          Regards, Jens

 Freie Universitaet Berlin     Jens Thoms Toerring
 Webteam                       Tel: 0049 30 838 56055
 Garystrasse 39                Fax: 0049 30 838 53738
 14195 Berlin                  e-mail: 
--- aspseek-orig/src/parse.cpp  2003-08-27 13:06:46.000000000 +0200
+++ aspseek-my/src/parse.cpp   2003-09-02 13:19:00.000000000 +0200
@@ -274,7 +317,8 @@
    m_path = new char[len]; m_path[0] = 0;
    m_filename = new char[len]; m_filename[0] = 0;
-   if (splitstr(s, m_schema, m_specific, ':', 0) != 2)
+   if (splitstr(s, m_schema, m_specific, ':', 0) != 2 ||
+       ( m_specific[ 0 ] != '/' && m_specific[ 1 ] == '/' ) )
        if (base)

Hosted Email Solutions

Invaluement Anti-Spam DNSBLs

Powered By FreeBSD   Powered By FreeBSD