Java代写:CSIS460 Threaded Web Crawler

将一个单线程的Web Crawler改成多线程运行。

Web Crawler

Threaded Web Crawler

I have completed a simple single-threaded Web Crawler that uses this class from a 310 assignment. Your task is to

  1. Replace the non-recursive breadth-first graph traversal with a multi-threaded implementation using a fixed- size Java ThreadPool (i.e., instead of enqueing URLs you’d submit a new task to the pool that creates a new WebCrawler object that submits more tasks…) Note that it may require you to rip apart the HTMLLinks class and refactor the solution entirely.
  2. Modify the program so that it displays the 10 most commonly occurring URLs (local or remote) encountered, in descending order, together with a count of many times that URL appeared.

Implementation Notes

Consider what data structure might be useful in displaying the top 10 URLs without having to a) perform a brute-force linear search or 2), sort all the URLs. (hint: it can be done in O(n + klgn) time)
Here is a simple example of how one can use a ThreadPool to limit the number of active threads:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import java.util.concurrent.*;

public class ThreadPoolExample
{

private static final int MAX_THREADS = 10; // Try Different values!!!

public static void main(String args[])
{

ExecutorService service = Executors.newFixedThreadPool(MAX_THREADS);
for (int i=0; i<100000; i++) // Create lots of threads
{
service.submit(new Task(i));
}
}
}

final class Task implements Runnable
{

private int p_taskId;

public Task(int id)
{

p_taskId = id;
}

@Override
public void run()
{

System.out.println("Task ID : " + p_taskId +" performed by "
+ Thread.currentThread().getName());
}
}

Note that the run() method of a Runnable can not be parameterized. So we typically implement a parameterized constructor that is passed references to objects shared by various threads (i.e., a shared memory model of communication); obviously you need to use “thread-safe” objects and data structures (can you say “Vector” instead of “ArrayList”; “ConcurrentHashMap”, etc?). Consider what shared “global” data or data structures need to be accessible by each thread (and by “global” I mean passed to the thread via the constructor). You must be careful to avoid “race conditions” (e.g., testing and subsequently setting) that might result in inconsistency; correct programs must guarantee correctness, not depend on “luck.”

Also note that we never explicitly call run(); this is the method that will be invoked when a Thread is started (explicitly by calling start() or implicitly when the ThreadPool wraps one of its limited number of Thread objects around a Runnable object and starts the thread).

The main() above is somewhat deficient in that it never terminates. Obviously we want our main program to wait for all threads to complete before we can go on to display the most common URLs and then exit. One way to do that is to “shutdown” the thread pool and wait (indefinitely) for it to “drain”:

1
2
3
4
5
6
service.shutdown();
try
{
service.awaitTermination(Long.MAX_VALUE, TimeUnit.DAYS);
} catch (InterruptedException e) { }
...

If all tasks to be executed have been submitted before we call shutdown() this works perfectly. However, this wont work in our case as threads will attempt to submit new tasks at the same time the main thread is calling shutdown(), preventing new tasks from being added as shutdown “bars the door” to the service and then waits for the existing threads to complete. (Note, however, that this is the approach signal-handlers should take).

Another approach is to have the last task “turn out the lights”, as it were, by shutting down the thread pool; main() would still awaitTermination() but would rely on a thread shutting the service down after it determines there is no more work to be done. Can this work and, if so, how? I’m leaving this as an open-question for discussion in class; ASK!

Analysis

I want you to test non-threaded and threaded versions of your program.; execute each a number of times and average the runtimes. You’ll need to minimally modify the non-threaded version so that both versions report the top 10 most common links. Your threaded version might experiment with various sizes for your thread-pool.

In addition to turning in the program code for your threaded version, answer the following questions for the remaining 15 points:

  • Which version ran faster and why?
  • How did the thread-pool size affect the threaded version, if at all?
  • Do you think the results would be different if the crawler crawled beyond the initial URL’s host? Why or why not?

Submit your analysis and your multi-threaded WebCrawler.java program only (do not include HTMLLinks.java unless you’ve made changes). Do not include your analysis in the code…