[aseek-devel] Another parse.cpp patch

From: Jens Thoms Toerring (no email)
Date: Thu Sep 04 2003 - 09:12:49 EDT


Hi,

  I found a problem with the handling of robots.txt files in the
FindRobots() function in parse.cpp. If I don't misunderstand the
standard for robot exclusion completely and entry in robots.txt
for http://www.foo.bar like

Disallow: /foo.html

should forbid robots to index the URL

http://www.foo.bar/foo.html

but *not*

http://www.foo.bar/xxx/yyy/foo.html

Unfortunately, this is what happens currently because in FindRobots()
the path is com/pared unconditonally with entries in robots and not
just the start of the path. Below is a patch to rectify the problem.
(I also added a bit to get debug output in case of denied access
due to robots.txt.)
                                 Regards, Jens

-- 
 Freie Universitaet Berlin     Jens Thoms Toerring
 Universitaetsbibliothek
 Webteam                       Tel: 0049 30 838 56055
 Garystrasse 39                Fax: 0049 30 838 53738
 14195 Berlin                  e-mail: 
--- aspseek-orig/src/parse.cpp	2003-08-27 13:06:46.000000000 +0200
+++ aspseek-my/src/parse.cpp	2003-09-04 15:03:21.000000000 +0200
@@ -96,8 +96,10 @@
 	sprintf(fpath, "%s%s", path, name);
 	for (CStringVector::iterator s = v.begin(); s != v.end(); s++)
 	{
-		if (strstr(fpath, s->c_str()))
+		if ( ! strncmp( fpath, s->c_str( ),s->length( ) ) )
 		{
+			logger.log( CAT_ALL, L_DEBUG, "Denying %s in %s (because of %s)\n",
+					    fpath, host, s->c_str( ) );
 			return 1;
 		}
 	}







Hosted Email Solutions

Invaluement Anti-Spam DNSBLs



Powered By FreeBSD   Powered By FreeBSD