Welcome to the IBM OmniFind Yahoo! Edition Forum
July 23, 2008, 08:50:48 PM *
Welcome, Guest. Please login or register.

Login with username and password
News:
 
   Home   Help Search Login Register  
« previous next »
Pages: [1] Print
Author Topic: Can I omit specific page content from the web crawler?  (Read 832 times)
jschlatter
Newbie
*
Posts: 5


View Profile
« on: December 07, 2007, 11:21:54 AM »

I currently have a section of my pages that pull a random data set of 'showcase' listings from my directory. Because this part of the page is a random pull the web crawler looks at the page as new content a re-crawls it each day. The actual 'content' does not change and I have seen my indexes grow very quickly on the server due to this issue.

Is it possible for me to set certain meta-tags within a page that would be ignored? This would help me on two levels.
1. The index would not grow due to content that is not truly relavant.
2. The search results would not take the data in that meta-tag in to consideration when returning results. This would greatly increase the accuracy of the results that I am currently getting.

I realize that this may add in too much complexity for the scope in which this product is meant to fit, but I felt I should ask anyway.

Any ideas on how I might address this?

Logged
jschlatter
Newbie
*
Posts: 5


View Profile
« Reply #1 on: December 14, 2007, 05:10:02 PM »

With the lack of response, can I assume that this must have been a silly question? If someone would at least acknowledge that I am missing the boat it would be appreciated...
Logged
cheikkila
Newbie
*
Posts: 1


View Profile
« Reply #2 on: April 30, 2008, 01:44:18 PM »

You're not off-base - I have the same question, but for a different reason. 

I want to exclude the navigation portion of my webpages from search results.  For example, if I search for 'forms' I'll get every page on the site in the search results, because it's in the site navigation on every single page.  I still want to crawl the navigation at least once, of course, but I don't want every page returned.  I have used htdig until now, and they use html comments to bracket sections of code that shouldn't be indexed:
Code:
<!--htdig_noindex-->
Stuff that doesn't get indexed goes here
<!--/htdig_noindex-->
This works perfectly. 

Can OmniFind support this?
Thanks,
Christina
Logged
Kevin
Newbie
*
Posts: 1


View Profile
« Reply #3 on: May 06, 2008, 07:05:13 PM »

I have the same need as Christina. Does OmniFind support the no index comment or something similar?

<--BeginNoIndex-->
Anything not to be indexed
<--EndNoIndex-->


Thanks,
Kevin
Logged
rberi
Newbie
*
Posts: 5


View Profile
« Reply #4 on: May 07, 2008, 12:45:27 PM »

I've a similar problem(s)...

1. I have a dynamic page where the page content mostly remains the same except for timestamp. Due to this the search crawler keeps on adding this page to the search index and my document size has crossed 300,000 documents ;-)

2. I cannot find a way to exclude header and footer content from indexing

Anyone on this forum can give some inputs?Huh

=Rajesh=
Logged
thecoolone
Full Member
***
Posts: 104



View Profile
« Reply #5 on: Today at 02:49:53 PM »

There has been a lot of requests for this feature in OYE. Unfortunately OYE doesn't have any custom <noindex> tag to avoid crawling in the current version.

But there is a work around, which is you can have the dynamic content as a PHP include and name the OYE user-agent to something specific. Then in PHP depending on the "user-agent" you can deny displaying the content to the crawler if it matches that "user-agent".

Simple SSI includes are impossible to hide from the crawler as Apache pre-processes the SSI request internally.

A similar thread can be found here http://omnifind.ibm.yahoo.net/forums/index.php/topic,514.msg3143.html#msg3143
Logged
Pages: [1] Print 
« previous next »
Jump to:  

IBM OmniFind Yahoo! Edition Forum | Powered by SMF 1.1.2.
© 2005, Simple Machines LLC. All Rights Reserved.