web crawler - Nutch regex-urlfilter syntax

Question

Welcome To Ask or Share your Answers For Others

web crawler - Nutch regex-urlfilter syntax

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

web crawler - Nutch regex-urlfilter syntax

I am running Nutch v. 1.6 and it is crawling specific sites correctly, but I can't seem to get the syntax correct for the file NUTCH_ROOT/conf/regex-urlfilter.txt.

The site I want to crawl has a URL similar to this:

http://www.example.com/foo.cfm

On that page there are numerous links that match the following pattern:

http://www.example.com/foo.cfm/Bar_-_Foo/Extra/EX/20817/ID=6976

I want to crawl links that match second example above as well. In my regex-urlfilter.txt I have the following:

+^http://www.example.com/foo.cfm$
+^http://www.example.com/foo.cfm/(.+)*$

Nutch matches on the first one and crawls it correctly, but does not seem to pick up links using the other filter. How can I get Nutch to crawl URL's like the second one above?

I have tried the following with no luck:

+^http://www.example.com/foo.cfm/(.+)*$
+^http://www.example.com/foo.cfm/(.)*$
+^http://www.example.com/foo.cfm/.+$
+^http://www.example.com/foo.cfm/(.*)*$

In my NUTCH_ROOT/urls/nutch I have:

http://www.example.com/foo.cfm/

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T21:28:46+0000

According to http://wiki.apache.org/nutch/FAQ#What_happens_if_I_inject_urls_several_times.3F you can't have multiple URLs (they will be ignored). What about to put only:

+^http://www.example.com/foo.cfm/(.+)*$

which should cover your first line: +^http://www.example.com/foo.cfm$ as well, or, if there are problems with /, try:

+^http://www.example.com/foo.cfm//?(.+)*$

Where //? should stand for character / or

Categories

web crawler - Nutch regex-urlfilter syntax

web crawler - Nutch regex-urlfilter syntax

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags