Showing posts with label recursion. Show all posts
Showing posts with label recursion. Show all posts

Tuesday, October 8

Website Crawler with fork and Join Framework

                Website Crawler with fork and Join Framework 
Here are the classes involved in writing code for this exercise . It can be directly copied and executed using java 7 as fork and Join libraries are available in java only version 1.7 onwards.

Along with these classes you would need HTMLParser jar file , which is used to retrieve links available in a page linked to a particular link. 

Please download htmlparser-1.6.jar file and include in the class path to execute below code


WebsiteCrawler class initiates the logic . It create ForkJoinPool which is used to contain the threads to take up and execute the work stealing work is divided among these threads and is executed is parallel . Thus overall processing is executed faster and multiple processor/core hardware is effectively utilized

import java.util.Collection;
import java.util.Collections;
import java.util.HashSet;
import java.util.concurrent.ForkJoinPool;

 * @author Manoj

public class WebsiteCrawler implements LinkTracker {

    private final Collection linksCrawled = Collections.synchronizedSet(new HashSet());
    private String inputUrl;
    private ForkJoinPool pool;

    public WebsiteCrawler(String inputUrl, int maxThreadCoulnt) {
        this.inputUrl = inputUrl;
        pool = new ForkJoinPool(maxThreadCoulnt);

    private void init() {
        pool.invoke(new LinkSearcher(this.inputUrl, this));


    public void addVisited(String s) {

    public boolean visited(String s) {
        return linksCrawled.contains(s);

    public static void main(String[] args) throws Exception {
        new WebsiteCrawler("", 50).init();



LinkTracker interface provides the basic methods required to execute the link search logic
 * @author Manoj
public interface LinkTracker {

    boolean visited(String link);

    void addVisited(String link);

import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.RecursiveAction;

import org.htmlparser.Parser;
import org.htmlparser.filters.NodeClassFilter;
import org.htmlparser.tags.LinkTag;
import org.htmlparser.util.NodeList;


This is the class where core recursive logic is executed . To divide ,assign and execute the logic recursively this class extends RecursiveAction class and overrides compute() method. compute method is invoked recursively and execute the logic for every link . After visit ,visited link is added to the set and all child URLS found for current URL are added as recursiveAction in the list to be executed by compute() method. 

To understand the code further Please execute this code in debug mode and walk through the flow.

 public class LinkSearcher extends RecursiveAction {

    private String url;
    private LinkTracker tracker;

    public LinkSearcher(String url, LinkTracker tracker) {
        this.url = url;
        this.tracker = tracker;

    public void compute() {
        if (!tracker.visited(url)) {
            try {
                List actions = new ArrayList();
                URL uriLink = new URL(url);
                Parser parser = new Parser(uriLink.openConnection());
                NodeList list = parser.extractAllNodesThatMatch(new NodeClassFilter(LinkTag.class));

                for (int i = 0; i < list.size(); i++) {
                    LinkTag extracted = (LinkTag) list.elementAt(i);

                    if (!extracted.extractLink().isEmpty() && !tracker.visited(extracted.extractLink())) {

                        actions.add(new LinkSearcher(extracted.extractLink(), tracker));

            } catch (Exception e) {