HELP: parsing unicode web sites

  • 10 years ago
    I need help in parsing unicode webpages & downloading jpeg image files via Perl scripts. I read http://www.cs.utk.edu/cs594ipm/perl/crawltut.html about using LWP or HTTP or get($url) functions & libraries. But the content returned is always garbled. I have used get($url) on a non-unicode webpage and the content is returned in perfect ascii. But now I want to parse http://www.tom365.com/movie_2004/html/5507.html and the page I get back is garbled encoded. I have read about Encode but don't know how to use it. I need a Perl script to parse that above page and extract the URL for the image in this pattern:
    If anyone knows how to do this parsing unicode webpages then I'd be very grateful. Thank you
  • 10 years ago

    Thanks to those who helped. Here's my working script:

     


     #!/usr/bin/perl
    # tom365crawl2.pl
    # http://www.cs.utk.edu/cs594ipm/perl/crawltut.html
    # http://perldoc.perl.org/Encode.html
    # http://juerd.nl/site.plp/perluniadvice
    # http://www.perlmonks.org/?node_id=620068

    use warnings;
    use strict;

    use File::stat;
    use Tie::File;

    use LWP::Simple;
    use LWP::UserAgent;
    use HTTP::Request;
    use HTTP::Response;
    use HTML::LinkExtor; # Allows you to extract the links off of an HTML page.
    #use File::Slurp;

    use Encode;

    my $site1 = "http://www.tom365.com/"; # Full url like http://www.tom365.com/movie_2004/html/????.html
    my $delim1a = "\<div class=\"movie\"\>\<img src=\"";
    my $delim1b = "\" class=\"mp\" \/\>";
    my $folder1 = "movie_2004/html/";
    my $url1;
    my $start1 = 1000;
    my $end1 = 1000;
    my $contents1;
    my $image1;

    my $browser1 = LWP::UserAgent->new();
    $browser1->timeout(10);
    my $request1;
    my $response1;

    my $count;
    for ($count=$start1; $count<=$end1; $count++) {
      $url1 = $site1 . $folder1 . $count . ".html";
      printf "Downloading %s\n", $url1;

      # Method 1
      #$contents1 = get($url1);

      # Method 2
      $request1 = HTTP::Request->new(GET => $url1);
      $response1 = $browser1->request($request1);
      if ($response1->is_error()) {
        printf "%s\n", $response1->status_line;
      }
      $contents1 = $response1->decoded_content();

      #open(NEWFILE1, "> Debug.txt");
      #(print NEWFILE1 $contents1)    or die "Can't write to Debug.txt: $!";
      #close(NEWFILE1);

      #print $contents1;

      if ($contents1 =~ /\<div class=\"movie\"\>\<img src=\"(.*)\" class=\"mp\" \/\>/m) {
        $image1 = "$1";
        printf "Downloading %s\n", $image1;
        `wget -q -O $count.jpg $image1`;

        #if ($image1 =~ /\/([^\/]*)$/m) {
        #  printf "Renaming %s to $count.jpg\n", $1;
        #} else {
        #  printf "Could not rename %s to $count.jpg\n", $image1;
        #}
      } else {
        #open(NEWFILE1, "> $count.txt");
        #(print NEWFILE1 "Download failed.\n")    or die "Can't write to $image1: $!";
        #close(NEWFILE1);
      }
    }


Post a reply

Enter your message below

Sign in or Join us (it's free).

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“If debugging is the process of removing software bugs, then programming must be the process of putting them in.” - Edsger Dijkstra