Programming Forums

Programming Forums (http://www.programmingforums.org/forumindex.php)
-   Java (http://www.programmingforums.org/forum17.html)
-   -   Network Programming Help. (http://www.programmingforums.org/showthread.php?t=14415)

equinox Nov 11th, 2007 9:37 AM

Network Programming Help.
 
Hi all.

I'm a second year soft ware engineering student and we've been given an assignment in network programming. The assignment specification is to take a URL as a command line argument, download it scorce, examine all it's links and finally, print out a summary of how many broken links occurred.

I've only just started this and I have a problem. I'm un-sure as to how I would find out weather a piece of HTML is actually a link or not. Our lecturer gave us a link to this example.

http://www.exampledepot.com/egs/java.../GetLinks.html

Now I've never done network programming before and this seems very hard to understand to me as the writer didn't comment the code or list the packages he used. Could anyone give me a brief explanation of how I would test a page to find links.??

I've been able to figure out how to read in a url and print out it's scorce without any problem and I've included this code below.

Thanks :).

:

//THIS IS JUST A TEST!!!!!!!!


import java.net.*;
import java.io.*;

public class WebProject
{

        public static void main(String [] args)
        {       
                List links = new ArrayList();        //will be used to store any links


                try
                {
                        URL address = new URL(args[0]);  //take in the url to be searched from args
                       
                        BufferedReader in = new BufferedReader(new InputStreamReader(address.openStream()));    //open up a "link" to the url
                        PrintWriter myFile = new PrintWriter("output.HTML");  //create a printwroter to write the contents to a file, output.html
                       
                        String input = "";

                        while((input = in.readLine()) != null) //continue while the is still data to read
                        {
                                myFile.println(input);  //print out what has just been read
                        }

                        in.close();  //close the link
                        myFile.close();  //close and save output

                }
       
                catch(MalformedURLException e)
                {
                        System.out.println("Error!!, the supplied URL doesn't exist. ");  //an error message
                        System.out.println(args[0]);
                }
       
                catch(IOException e)
                {
                        System.out.println("Error!!, data handeling problem");  //another warning message
                }
        }

}


Jabo Nov 11th, 2007 6:12 PM

Re: Network Programming Help.
 
I would think a normal ping would work, as in if you ping the url, it either is tied to the server or it's not. But then, I don't know that much about networking.

DaWei Nov 11th, 2007 6:24 PM

Re: Network Programming Help.
 
You could connect to the server and still have a broken link. That is, the server might return a 404 page, for instance.

Your first step is to parse the page for all links. That's the emphasis of the link you posted. The next step is to follow all the links and see if you get a valid page returned (200 OK, for instance).

ReggaetonKing Nov 11th, 2007 10:34 PM

Re: Network Programming Help.
 
HttpURLConnection has a method called getResponseCode() and returns 200 if its a valid page. Once you parse the links from the HTML source, you can put them into a list and iterate through the list of urls to see if they are valid. Here is a method that you could use. My Java skills are a bit rusty so bare with me.
:

public bool isValidUrl(String urlStr)
{
        try
        {
                java.net.URL url = new java.net.URL(urlStr);
                java.net.HttpURLConnection httpConn = (java.net.HttpURLConnection)url.openConnection();
                httpConn.connect();
                if(httpConn.getResponseCode() != 200)
                        return false;
                else
                        return true; //it does return 200 and is a valid link
        }
        catch(Exception e)
        {
                e.printStackTrace();
        }

}


equinox Nov 15th, 2007 1:25 PM

Re: Network Programming Help.
 
Thanks guys, I got that code in the link to work (after a few hours and about 5 pints of cofee !! :) ). I have all my methods worked out so putting the program together should be a breeze. Thanks :).

null_ptr0 Nov 21st, 2007 9:48 PM

Re: Network Programming Help.
 
:

  1. import java.net.URL;
  2. import java.net.URLConnection;
  3. import java.net.HttpURLConnection;
  4. import java.util.Vector;
  5. import java.util.regex.Pattern;
  6. import java.util.regex.Matcher;
  7.  
  8. class URLCrawler {
  9.         public static void main(String[] argv) {
  10.                 if(argv.length != 1)
  11.                         System.out.println("Arguments: <address (String)>");
  12.                 else
  13.                         checkAddresses(parseAddressese(downloadSource(address))));
  14.         }
  15.  
  16.         private String downloadSource(String url) {
  17.                 byte[] read;
  18.                 try {
  19.                         URL url = new URL(address);
  20.                         URLConnection urlc = url.openConnection();
  21.                         InputStream is = urlc.getInputStream();
  22.                         read = new byte[urlc.getContentLength()];
  23.                         is.read(read);
  24.                         is.close();
  25.                 } catch(IOException ioex) {
  26.                         ioex.printStackTrace();
  27.                         System.exit(1);
  28.                 }
  29.                 return new String(read);
  30.         }
  31.  
  32.         private String[] parseAddresses(String html) {
  33.                 String regex = "a\\s+[^>]*?class=l\\s+[^>]*?href\\s?=[\\s'\"]+(.*?)['\"]+.*?>[^<]*</a>";
  34.                 Pattern p = Pattern.compile(regex);
  35.                 Matcher m = p.matcher(html);
  36.                 Vector addresses = new Vector<String>();
  37.                 while(m.find())
  38.                         addresses.addElement(m.group());
  39.                 addresses.trimToSize();
  40.                 return addresses.toArray(new String[0]);
  41.         }
  42.  
  43.         private void checkAddresses(String[] urls) {
  44.                 int i = 0;
  45.                 for(String url : urls)
  46.                         i = (isBroken(url) ?  i + 1 : i);
  47.                 System.console().format("%s urls extracted were broken and %s were in tact, out of %s urls", i, urls.length - i, urls.length);
  48.         }
  49.  
  50.         private boolean isBroken(String url) {
  51.                 try {
  52.                         URL u = new URL(url);
  53.                         HttpURLConnection huc = (HttpURLConnection) u;
  54.                         return (!huc.getResponseCode() == 200) ? true : false;
  55.                 } catch(IOException ioex) {
  56.                         ioex.printStackTrace();
  57.                 }
  58.                 return true;
  59.         }
  60. }

That's what I programmed in 5 minutes.


All times are GMT -5. The time now is 3:33 PM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC