Re: [aseek-users] Why aren't I indexing .pdf files?

From: Kir Kolyshkin (no email)
Date: Tue Jan 14 2003 - 11:50:09 EST


KEVIN ZEMBOWER wrote:
> I'm trying to get my first installation of aspseek working. It seems to index HTML documents fine, but now I'm trying to expand into .pdf documents.
>
> My aspseek.conf file looks like this:
> aspseek at www:~$ grep -v '^[[:space:]]*$' etc/aspseek.conf |grep -v "^#"
> Include db.conf
> Include ucharset.conf
> Include stopwords.conf
> Converter application/pdf text/html /usr/local/bin/pdftohtml -i -noframes -stdout $in > $out
> DeleteNoServer no
> Server http://www.jhuccp.org/
> DeltaBufferSize 64
> Disallow /cgi-bin/ \.cgi /nph
> Disallow \.tif$ \.au$ \.mov$ \.jpe$ \.cur$ \.qt$
> Disallow \.b$ \.sh$ \.md5$ \.rpm$
> Disallow \.arj$ \.tar$ \.zip$ \.tgz$ \.gz$
> Disallow \.lha$ \.lzh$ \.tar\.Z$ \.rar$ \.zoo$
> Disallow \.gif$ \.jpg$ \.jpeg$ \.bmp$ \.tiff$ \.xpm$ \.xbm$
> Disallow \.vdo$ \.mpeg$ \.mpe$ \.mpg$ \.avi$ \.movie$
> Disallow \.mid$ \.mp3$ \.rm$ \.ram$ \.wav$ \.aiff$ \.ra$
> Disallow \.vrml$ \.wrl$ \.png$
> Disallow \.exe$ \.cab$ \.dll$ \.bin$ \.class$
> Disallow \.tex$ \.texi$ \.xls$ \.doc$ \.texinfo$
> Disallow \.rtf$ \.cdf$ \.ps$
> Disallow \.ai$ \.eps$ \.ppt$ \.hqx$
> Disallow \.cpt$ \.bms$ \.oda$ \.tcl$
> Disallow \.o$ \.a$ \.la$ \.so$ \.so\.[0-9]$
> Disallow \.pat$ \.pm$ \.m4$ \.am$
> Disallow \?D=A$ \?D=A$ \?D=D$ \?M=A$ \?M=D$ \?N=A$ \?N=D$ \?S=A$ \?S=D$
> Disallow [^:]//
> Disallow mmc/.*\.php
> Disallow PHPTEST
> aspseek at www:~$
>
> I've got links to .pdf files in my .shtml files which seem to be indexed fine:
> aspseek at www:~$ find /var/www/main/htdocs/ -iname "*.*htm*" -o -iname "*.stm"|xargs fgrep .pdf |head
> //var/www/main/htdocs/popreporter/2002/08-19.shtml: | PDF</p>

The first thing I notice is document is named J52.pdf while it is available
as j52.pdf from your server. Notice the case!

> <snip>
>
> There are 14 rows in the urlword table which end in '.pdf':
> mysql> select * from urlword where url like '%pdf';
> +--------+---------+---------+-----------------------------------------------------------+-----------------+--------+----------------------------------+-------------------------------+--------------------------+-----------------+----------+-----+------+-------+--------+
> | url_id | site_id | deleted | url | next_index_time | status | crc | last_modified | etag | last_index_time | referrer | tag | hops | redir | origin |
> +--------+---------+---------+-----------------------------------------------------------+-----------------+--------+----------------------------------+-------------------------------+--------------------------+-----------------+----------+-----+------+-------+--------+
> | 5244 | 1 | 0 | http://www.jhuccp.org/pr/j52/j52.pdf | 1043164839 | 200 | d41d8cd98f00b204e9800998ecf8427e | Fri, 03 Jan 2003 17:06:16 GMT | "20d0ae-1328a5-3e15c308" | 1042496187 | 2794 | 0 | 5 | 0 | 0 |
> <snip>
> 14 rows in set (0.06 sec)
>
> The "200" in the status column indicates that it was found.
>
> For this first .pdf document, I computed the urlwords table name as 'urlwords12' (5244 mod 16)

That is right answer, although ASPseek uses 'urlid & 15', which is the same
but much more efficient ;)

, but there's no entry in that table for this url_id:
> mysql> select * from urlwords12 where url_id="5244";
> Empty set (0.00 sec)
>
> This leads me to believe that .pdf documents are being checked, but not indexed.
>
> When I run this document, http://www.jhuccp.org/pr/j52/j52.pdf, through pdftohtml, I get HTML output, so pdftohtml seems to be working okay.
>
> Can anyone suggest any other diagnostics that could help me solve this problem? Any thoughts or comments?
>
> Thank you all in advance for your help.

Hmm...

Try index -T http://www.jhuccp.org/pr/j52/j52.pdf and see what happens.

-- 
== kir_at_asplinux.ru == 7551596_at_ICQ == 6722750_at_sms.beemail.ru ==
Dream like you'll live forever...Love like you've never been hurt...
Work like you don't need the money...and Dance like nobody is watching!
        -- Satchel Paige







Hosted Email Solutions

Invaluement Anti-Spam DNSBLs



Powered By FreeBSD   Powered By FreeBSD