HELP: parsing unicode web sites

  • 9 years ago
    I need help in parsing unicode webpages & downloading jpeg image files via Perl scripts. I read http://www.cs.utk.edu/cs594ipm/perl/crawltut.html about using LWP or HTTP or get($url) functions & libraries. But the content returned is always garbled. I have used get($url) on a non-unicode webpage and the content is returned in perfect ascii. But now I want to parse http://www.tom365.com/movie_2004/html/5507.html and the page I get back is garbled encoded. I have read about Encode but don't know how to use it. I need a Perl script to parse that above page and extract the URL for the image in this pattern:
    If anyone knows how to do this parsing unicode webpages then I'd be very grateful. Thank you
  • 9 years ago

    Thanks to those who helped. Here's my working script:

     


     #!/usr/bin/perl
    # tom365crawl2.pl
    # http://www.cs.utk.edu/cs594ipm/perl/crawltut.html
    # http://perldoc.perl.org/Encode.html
    # http://juerd.nl/site.plp/perluniadvice
    # http://www.perlmonks.org/?node_id=620068

    use warnings;
    use strict;

    use File::stat;
    use Tie::File;

    use LWP::Simple;
    use LWP::UserAgent;
    use HTTP::Request;
    use HTTP::Response;
    use HTML::LinkExtor; # Allows you to extract the links off of an HTML page.
    #use File::Slurp;

    use Encode;

    my $site1 = "http://www.tom365.com/"; # Full url like http://www.tom365.com/movie_2004/html/????.html
    my $delim1a = "\<div class=\"movie\"\>\<img src=\"";
    my $delim1b = "\" class=\"mp\" \/\>";
    my $folder1 = "movie_2004/html/";
    my $url1;
    my $start1 = 1000;
    my $end1 = 1000;
    my $contents1;
    my $image1;

    my $browser1 = LWP::UserAgent->new();
    $browser1->timeout(10);
    my $request1;
    my $response1;

    my $count;
    for ($count=$start1; $count<=$end1; $count++) {
      $url1 = $site1 . $folder1 . $count . ".html";
      printf "Downloading %s\n", $url1;

      # Method 1
      #$contents1 = get($url1);

      # Method 2
      $request1 = HTTP::Request->new(GET => $url1);
      $response1 = $browser1->request($request1);
      if ($response1->is_error()) {
        printf "%s\n", $response1->status_line;
      }
      $contents1 = $response1->decoded_content();

      #open(NEWFILE1, "> Debug.txt");
      #(print NEWFILE1 $contents1)    or die "Can't write to Debug.txt: $!";
      #close(NEWFILE1);

      #print $contents1;

      if ($contents1 =~ /\<div class=\"movie\"\>\<img src=\"(.*)\" class=\"mp\" \/\>/m) {
        $image1 = "$1";
        printf "Downloading %s\n", $image1;
        `wget -q -O $count.jpg $image1`;

        #if ($image1 =~ /\/([^\/]*)$/m) {
        #  printf "Renaming %s to $count.jpg\n", $1;
        #} else {
        #  printf "Could not rename %s to $count.jpg\n", $image1;
        #}
      } else {
        #open(NEWFILE1, "> $count.txt");
        #(print NEWFILE1 "Download failed.\n")    or die "Can't write to $image1: $!";
        #close(NEWFILE1);
      }
    }


Post a reply

Enter your message below

Sign in or Join us (it's free).

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“Beware of bugs in the above code; I have only proved it correct, not tried it.” - Donald Knuth