From: Jens Thoms Toerring (no email)
Date: Thu Sep 04 2003 - 09:12:49 EDT
Hi,
I found a problem with the handling of robots.txt files in the
FindRobots() function in parse.cpp. If I don't misunderstand the
standard for robot exclusion completely and entry in robots.txt
for http://www.foo.bar like
Disallow: /foo.html
should forbid robots to index the URL
but *not*
http://www.foo.bar/xxx/yyy/foo.html
Unfortunately, this is what happens currently because in FindRobots()
the path is com/pared unconditonally with entries in robots and not
just the start of the path. Below is a patch to rectify the problem.
(I also added a bit to get debug output in case of denied access
due to robots.txt.)
Regards, Jens
--
Freie Universitaet Berlin Jens Thoms Toerring
Universitaetsbibliothek
Webteam Tel: 0049 30 838 56055
Garystrasse 39 Fax: 0049 30 838 53738
14195 Berlin e-mail:
--- aspseek-orig/src/parse.cpp 2003-08-27 13:06:46.000000000 +0200
+++ aspseek-my/src/parse.cpp 2003-09-04 15:03:21.000000000 +0200
@@ -96,8 +96,10 @@
sprintf(fpath, "%s%s", path, name);
for (CStringVector::iterator s = v.begin(); s != v.end(); s++)
{
- if (strstr(fpath, s->c_str()))
+ if ( ! strncmp( fpath, s->c_str( ),s->length( ) ) )
{
+ logger.log( CAT_ALL, L_DEBUG, "Denying %s in %s (because of %s)\n",
+ fpath, host, s->c_str( ) );
return 1;
}
}
|
|
|