java - Exclude Some URL from getting crawled -



java - Exclude Some URL from getting crawled -

i writing crawler , in crawler not want crawl page(exclude link not crawl). wrote exclusions page. wrong code.. http://www.host.com/technology/ url getting called despite writing exclusions.. not want url starts url http://www.host.com/technology/ crawled..

public class mycrawler extends webcrawler { pattern filters = pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4" + "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$"); list<string> exclusions; public mycrawler() { exclusions = new arraylist<string>(); //add here exclusions //i not want url crawled.. exclusions.add("http://www.host.com/technology/"); } public boolean shouldvisit(weburl url) { string href = url.geturl().tolowercase(); system.out.println(href); if (filters.matcher(href).matches()) { system.out.println("noooo"); homecoming false; } if (exclusions.contains(href)) {//why loop not working?? system.out.println("yes2"); homecoming false; } if (href.startswith("http://www.host.com/")) { system.out.println("yes1"); homecoming true; } system.out.println("no"); homecoming false; } public void visit(page page) { int docid = page.getweburl().getdocid(); string url = page.getweburl().geturl(); string text = page.gettext(); list<weburl> links = page.geturls(); int parentdocid = page.getweburl().getparentdocid(); system.out.println("============="); system.out.println("docid: " + docid); system.out.println("url: " + url); system.out.println("text length: " + text.length()); system.out.println("number of links: " + links.size()); system.out.println("docid of parent page: " + parentdocid); system.out.println("============="); } }

if don't want crawl url starts exclusions, you'd have this:

for(string exclusion : exclusions){ if(href.startswith(exclusion)){ homecoming false; } }

also, if statement not loop.

java web-crawler

Comments

Popular posts from this blog

iphone - Dismissing a UIAlertView -

c# - Can ProtoBuf-Net deserialize to a flat class? -

javascript - Change element in each JQuery tab to dynamically generated colors -