From: Jens Thoms Toerring (no email)
Date: Tue Sep 02 2003 - 07:31:11 EDT
Hi,
I now (hopefully finally) found the reason why I came up with
the patch for parse.cpp, which Matt told me shouldn't be necessary.
And when I tried indexing without the patch it seemed to work -
until I now again found the server where it doesn't work...
The problem is that in both in CUrl::HTTPGetUrlAndStore() and
in ParseHtml() (and perhaps also in other places) the function
CUrl::ParseHtml() is invoked on the URL in order to decide if the
URL is to be indexed. To do so it splits the string with the URL
into two parts at the first ':' in the string, and the first part
is treated as the protocol and the address. This works obviously
well with URLs like "http://www.xxx.yyy.com/bla/index.html". But
it fails for example when you have a link in an HTML page like
<a href="/de:w/index.html>
because the URL is now split into "/de" and "w/index.html", which
of course doesn't make too much sense and results in an "Unsupported
protocol" error for the URL. A solution seems to be to check that
the second part really starts with two slashes before accepting
that the first part to contain a protocol name.
Regards, Jens
--
Freie Universitaet Berlin Jens Thoms Toerring
Universitaetsbibliothek
Webteam Tel: 0049 30 838 56055
Garystrasse 39 Fax: 0049 30 838 53738
14195 Berlin e-mail:
--- aspseek-orig/src/parse.cpp 2003-08-27 13:06:46.000000000 +0200
+++ aspseek-my/src/parse.cpp 2003-09-02 13:19:00.000000000 +0200
@@ -274,7 +317,8 @@
m_path = new char[len]; m_path[0] = 0;
m_filename = new char[len]; m_filename[0] = 0;
- if (splitstr(s, m_schema, m_specific, ':', 0) != 2)
+ if (splitstr(s, m_schema, m_specific, ':', 0) != 2 ||
+ ( m_specific[ 0 ] != '/' && m_specific[ 1 ] == '/' ) )
{
if (base)
{
|
|
|