PerlでのWebスクレイピング - Web::Scraper と Mojo::UserAgent

Webスクレイピングは、Webページから必要な情報を抽出する技術です。Perlには、使いやすく強力なスクレイピングライブラリが揃っています。

Webスクレイピングの基本

スクレイピングの基本的な流れ:

HTTPリクエスト: Webページを取得
HTML解析: DOM構造を解析
データ抽出: 必要な情報を取り出す
データ処理: 取得したデータを加工・保存

Web::Scraper の使い方

Web::Scraperは、CSSセレクタやXPathで要素を指定できる、直感的なスクレイピングライブラリです。

基本的な使い方

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
use Web::Scraper;
use URI;

my $scraper = scraper {
    # タイトルを取得
    process 'h1', 'title' => 'TEXT';
    
    # 段落を配列で取得
    process 'p', 'paragraphs[]' => 'TEXT';
    
    # リンクのURLとテキストを取得
    process 'a', 'links[]' => {
        url  => '@href',
        text => 'TEXT',
    };
};

my $result = $scraper->scrape(URI->new('https://example.com/'));

print "Title: $result->{title}\n";
print "Paragraphs: ", scalar @{$result->{paragraphs}}, "\n";

for my $link (@{$result->{links}}) {
    print "Link: $link->{text} -> $link->{url}\n";
}

CSSセレクタの活用

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
use Web::Scraper;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;

my $scraper = scraper {
    # クラス指定
    process '.article-title', 'title' => 'TEXT';
    
    # ID指定
    process '#content', 'content' => 'TEXT';
    
    # 属性セレクタ
    process 'a[rel="nofollow"]', 'nofollow_links[]' => '@href';
    
    # 子孫セレクタ
    process 'div.post > h2', 'post_titles[]' => 'TEXT';
    
    # 擬似クラス
    process 'li:first-child', 'first_item' => 'TEXT';
    
    # 複数条件
    process 'div.article', 'articles[]' => scraper {
        process 'h2', 'title' => 'TEXT';
        process '.author', 'author' => 'TEXT';
        process '.date', 'date' => 'TEXT';
        process '.content', 'content' => 'TEXT';
    };
};

my $res = $ua->get('https://example.com/blog');
my $result = $scraper->scrape($res->decoded_content, $res->base);

for my $article (@{$result->{articles}}) {
    print "Title: $article->{title}\n";
    print "Author: $article->{author}\n";
    print "Date: $article->{date}\n";
    print "---\n";
}

ネストしたデータの抽出

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
use Web::Scraper;

my $scraper = scraper {
    process 'div.product', 'products[]' => scraper {
        process 'h3.name', 'name' => 'TEXT';
        process 'span.price', 'price' => 'TEXT';
        process 'img', 'image' => '@src';
        
        # さらにネスト
        process 'ul.specs', 'specs' => scraper {
            process 'li', 'items[]' => 'TEXT';
        };
        
        # 評価情報
        process 'div.rating', 'rating' => scraper {
            process 'span.stars', 'stars' => 'TEXT';
            process 'span.count', 'count' => 'TEXT';
        };
    };
};

my $result = $scraper->scrape(URI->new('https://shop.example.com/'));

for my $product (@{$result->{products}}) {
    print "Product: $product->{name}\n";
    print "Price: $product->{price}\n";
    print "Rating: $product->{rating}{stars} ($product->{rating}{count} reviews)\n";
    print "Specs: ", join(', ', @{$product->{specs}{items}}), "\n";
    print "\n";
}

Mojo::UserAgent + Mojo::DOM

Mojo::UserAgentとMojo::DOMを組み合わせると、モダンで強力なスクレイピングができます。

基本的な使い方

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
use Mojo::UserAgent;

my $ua = Mojo::UserAgent->new;
my $tx = $ua->get('https://example.com/');

# タイトルを取得
my $title = $tx->res->dom->at('title')->text;
print "Title: $title\n";

# すべてのリンクを取得
$tx->res->dom->find('a')->each(sub {
    my $link = shift;
    printf "Link: %s -> %s\n", 
        $link->text, 
        $link->attr('href') // '';
});

CSSセレクタでデータ抽出

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
use Mojo::UserAgent;

my $ua = Mojo::UserAgent->new;
my $dom = $ua->get('https://news.ycombinator.com/')->res->dom;

# Hacker Newsの記事を取得
$dom->find('tr.athing')->each(sub {
    my $row = shift;
    
    my $title = $row->at('.titleline > a')->text;
    my $url = $row->at('.titleline > a')->attr('href');
    
    print "Title: $title\n";
    print "URL: $url\n";
    print "---\n";
});

JSONレスポンスの処理

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
use Mojo::UserAgent;
use Mojo::JSON qw(decode_json);

my $ua = Mojo::UserAgent->new;

# JSONを取得
my $tx = $ua->get('https://api.github.com/users/miyagawa/repos');
my $repos = $tx->res->json;

for my $repo (@$repos) {
    printf "%s: %s\n", 
        $repo->{name}, 
        $repo->{description} // 'No description';
}

非同期スクレイピング

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
use Mojo::UserAgent;
use Mojo::IOLoop;

my $ua = Mojo::UserAgent->new;

my @urls = (
    'https://www.perl.org/',
    'https://metacpan.org/',
    'https://perldoc.perl.org/',
);

my $count = 0;
for my $url (@urls) {
    $ua->get($url => sub {
        my ($ua, $tx) = @_;
        
        my $title = $tx->res->dom->at('title')->text;
        print "$url: $title\n";
        
        $count++;
        Mojo::IOLoop->stop if $count == @urls;
    });
}

Mojo::IOLoop->start unless Mojo::IOLoop->is_running;

実用例: ニュースサイトのスクレイピング

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
use Mojo::UserAgent;
use Mojo::DOM;
use DBI;

my $ua = Mojo::UserAgent->new;

# データベース接続
my $dbh = DBI->connect('dbi:SQLite:dbname=news.db', '', '', {
    RaiseError => 1,
    AutoCommit => 1,
});

# テーブル作成
$dbh->do(q{
    CREATE TABLE IF NOT EXISTS articles (
        id INTEGER PRIMARY KEY,
        title TEXT,
        url TEXT UNIQUE,
        published_at TEXT,
        scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
    )
});

sub scrape_news {
    my $url = shift;
    
    my $dom = $ua->get($url)->res->dom;
    
    # 記事を抽出（サイト構造に応じて調整）
    $dom->find('article.post')->each(sub {
        my $article = shift;
        
        my $title = $article->at('h2.title')->text;
        my $link = $article->at('a')->attr('href');
        my $date = $article->at('time')->attr('datetime');
        
        # データベースに保存
        eval {
            $dbh->do(
                'INSERT INTO articles (title, url, published_at) VALUES (?, ?, ?)',
                undef,
                $title,
                $link,
                $date
            );
            print "Saved: $title\n";
        };
        if ($@) {
            # 既に存在する場合（UNIQUE制約）
            print "Skipped: $title (already exists)\n";
        }
    });
}

# スクレイピング実行
scrape_news('https://example.com/news');

ページネーション対応

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
use Mojo::UserAgent;

my $ua = Mojo::UserAgent->new;
my $base_url = 'https://example.com/articles';

my @all_articles;

for my $page (1..10) {
    my $url = "$base_url?page=$page";
    print "Scraping page $page...\n";
    
    my $dom = $ua->get($url)->res->dom;
    
    my @articles = $dom->find('article')->map(sub {
        my $article = shift;
        return {
            title => $article->at('h2')->text,
            url   => $article->at('a')->attr('href'),
        };
    })->each;
    
    last unless @articles;  # 記事がなくなったら終了
    
    push @all_articles, @articles;
    
    # サーバーに負荷をかけないように待機
    sleep 2;
}

print "Total articles: ", scalar @all_articles, "\n";

ログイン認証

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
use Mojo::UserAgent;

my $ua = Mojo::UserAgent->new;

# ログインフォームを送信
my $login_url = 'https://example.com/login';
my $tx = $ua->post($login_url => form => {
    username => 'myuser',
    password => 'mypassword',
});

# クッキーが保存されているので、認証が必要なページにアクセス可能
my $protected = $ua->get('https://example.com/members')->res->dom;
print "Protected content: ", $protected->at('h1')->text, "\n";

robots.txt の尊重

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
use Mojo::UserAgent;
use WWW::RobotRules;

my $ua = Mojo::UserAgent->new;
my $rules = WWW::RobotRules->new('MyBot/1.0');

# robots.txt を取得
my $robots_url = 'https://example.com/robots.txt';
my $robots_txt = $ua->get($robots_url)->res->body;
$rules->parse($robots_url, $robots_txt);

# スクレイピング可能かチェック
my $target_url = 'https://example.com/products';
if ($rules->allowed($target_url)) {
    print "Allowed to scrape $target_url\n";
    my $dom = $ua->get($target_url)->res->dom;
    # スクレイピング処理
} else {
    print "Not allowed to scrape $target_url\n";
}

エラーハンドリング

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
use Mojo::UserAgent;
use Try::Tiny;

my $ua = Mojo::UserAgent->new(
    max_redirects => 5,
    request_timeout => 10,
);

sub scrape_with_retry {
    my ($url, $max_retries) = @_;
    $max_retries //= 3;
    
    for my $attempt (1..$max_retries) {
        try {
            my $tx = $ua->get($url);
            
            if (my $err = $tx->error) {
                die "HTTP error: $err->{code} $err->{message}\n" if $err->{code};
                die "Connection error: $err->{message}\n";
            }
            
            return $tx->res->dom;
        } catch {
            warn "Attempt $attempt failed: $_";
            if ($attempt < $max_retries) {
                sleep 2 ** $attempt;  # 指数バックオフ
            }
        };
    }
    
    die "Failed to scrape $url after $max_retries attempts\n";
}

my $dom = scrape_with_retry('https://example.com/');

スクレイピングのマナー

robots.txt を尊重: スクレイピングが許可されているか確認
リクエスト間隔: サーバーに負荷をかけないよう、適切な間隔を空ける
User-Agent: ボット名とコンタクト情報を設定

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
use Mojo::UserAgent;

my $ua = Mojo::UserAgent->new;
$ua->transactor->name('MyBot/1.0 (contact@example.com)');

# リクエスト間隔を空ける
sub polite_get {
    my $url = shift;
    sleep 2;  # 2秒待機
    return $ua->get($url);
}

キャッシュ: 同じページを何度も取得しない
並行リクエスト制限: 同時リクエスト数を制限

1
2
3
4
use Mojo::UserAgent;
use Mojo::Promise;

my $ua = Mojo::UserAgent->new(max_connections => 4);  # 最大4接続

動的コンテンツの扱い

JavaScriptでレンダリングされるページは、Seleniumやヘッドレスブラウザが必要です:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Selenium::Remote::Driver を使用
use Selenium::Remote::Driver;

my $driver = Selenium::Remote::Driver->new(
    browser_name => 'chrome',
    platform => 'ANY',
);

$driver->get('https://example.com/spa');

# JavaScript実行後の内容を取得
my $source = $driver->get_page_source;
my $dom = Mojo::DOM->new($source);

my $title = $dom->at('h1')->text;
print "Title: $title\n";

$driver->quit;

まとめ

Web::Scraper: CSSセレクタで直感的にスクレイピング
Mojo::UserAgent + Mojo::DOM: モダンで強力、非同期対応
マナー: robots.txt、リクエスト間隔、User-Agent設定
エラーハンドリング: リトライとタイムアウト
認証: クッキーを使ったログイン
動的コンテンツ: Seleniumでブラウザ自動化

Webスクレイピングは便利ですが、利用規約を確認し、サーバーに負荷をかけないよう注意しましょう。可能であれば、APIの利用を優先してください。