Re: [aseek-users] Time frame thoughts

From: Karen Barnes (no email)
Date: Tue Oct 29 2002 - 18:53:54 EST

Hi Mike,

> >That will tell you how to stop a running index. From another shell:
> >../index -E
>That's what I wasn't sure about, from another shell. Now I know, thanks.
> >That will safely terminate an already running index. This will NOT
> >update your search engine with the newly index sites. To do that you have
> >to do this:
> >
> >index -D
>What comes to mind is my wondering if I should kill the process now and
>then, run
>Index -D, then restart the indexing again? Do you think that would have any
>benefit over running a huge process all the way through?

I don't know if it has any benefits and probably doesn't, but at least your
indexing that you have been doing will be availble for searching without
having to wait for the indexing to fully complete.

If you'll notice I index 1.5 million URLs a day. I'm a little concerned why
you've been running the index for so long. I wonder if your settings in
aspseek.conf are correct. Are you leaving enought time between re-indexing
using the "Period" command? For example; if you have this set like the

Period 14d

then you have set a reindex every 14 days and if you run the indexer for 14
days non stop the process is going to start all over again and never finish.
When I run an initial crawl I set this to a very large number like this:

Period 1y

That prevents the indexer from reindexing already fetched URLs for one year.
Once my indexing is complete I might do something like this:

Period 1m14d

one month 14 days and run the indexer based on the URLs that are already in
the index.

This allows me to add new URLs very quickly and have them ready for
searching and then let the re-indexing commence later.

I don't know how you are indexing, but in my case I don't want to index the
entire Web or follow all URLs found on every page I index. I already had 3
million URLs that I wanted to index so I created 15 different files each
containing 200,000 URLs and inserted one file every 4 hours: Example;

./index -i -f ./urls.txt

Then I made sure I'm not hoping so I change this in aspseek.conf:

Maxhops 0

Then I run the indexer:

./index -N 80 -R 64

and within about 3-4 hours indexer will have indexed the 200,000 URLs and
those will be ready for searching. Of coures doing things this way will
prevent indexer from creating the so called "popularity" ranks which is
calculated when it finds a link from site a to site b during the indexing
process. It would be nice if index had the ability to index a single URL and
then see how many documents in the index link to this page. That way you can
have the best of both worlds.


Surf the Web without missing calls! Get MSN Broadband.

Hosted Email Solutions

Invaluement Anti-Spam DNSBLs

Powered By FreeBSD   Powered By FreeBSD