PHPでXpathでスクレイピングできるモジュール

● PHPでXpathでスクレイピングできるモジュール

stil/xpath-selector ( https://packagist.org/packages/stil/xpath-selector )
vdb/php-spider ( https://packagist.org/packages/vdb/php-spider )
querypath/querypath( https://packagist.org/packages/querypath/querypath )

● php-spide でスクレイピング

composer require vdb/php-spider

<?php

use VDB\Spider\Spider;
use VDB\Spider\Discoverer\XPathExpressionDiscoverer;
use VDB\Spider\StatsHandler;

require_once "./vendor/autoload.php";
$spider = new Spider('https://google.co.jp/');
// $spider->getDiscovererSet()->set(new XPathExpressionDiscoverer("//div"));
$spider->getDiscovererSet()->maxDepth = 3;
$spider->getQueueManager()->maxQueueSize = 10;
// $statsHandler = new StatsHandler();
// $spider->getQueueManager()->getDispatcher()->addSubscriber($statsHandler);
// $spider->getDispatcher()->addSubscriber($statsHandler);
$spider->crawl();
foreach ($spider->getDownloader()->getPersistenceHandler() as $resource) {
    echo "\n" . $resource->getCrawler()->filterXpath('//div')->text() ."\n";
}

● querypath をインストールしWEBサイトのタイトルとdescriptionを取得する

・1. querypath をインストール

インストールするディレクトリ（CodeIgniterの場合は /codeigniter/application/）に移動しターミナルから

composer require querypath/querypath

でインストール。

・2. querypath を読み込み

CodeIgniterを使用している場合は config/config.php の

$config['composer_autoload'] = TRUE;

で自動的に読み込まれます。

Laravel を使用している場合は何もしなくても自動的に読み込まれます。

フレームワークを使用していない場合は

require_once "vendor/autoload.php";

・3. querypath でWEBページの情報を取得する（例:YahooトップページのタイトルとDescriptionを表示します）

require_once "vendor/autoload.php";
$url = 'http://yahoo.co.jp/';
$html = file_get_contents($url);
$qp = html5qp($html);
print qp($html, 'title')->text();
print qp($html, 'meta[name=description]')->attr("content");

No.1055

01/23 12:22

edit

Xpath
CodeIgniter

PHPでHTMLタグの閉じ忘れを修正する　PECL::Tidy

TidyとはHTMLタグの閉じ忘れを直したりHTML修正を行う便利なソフトです。

Xpathでスクレイピングを行うときにHTMLが完璧でないと正しく取得できないので事前に整形を行います。

なおPECL拡張なのでインストールにはサーバー管理者権限が必要です。

インストールは yum で

yum install php-tidy

apache再起動

apachectl graceful

実際のソースは下記のように記述します。

$html = '<html><body><p>タグの閉じ忘れテスト</body></html>';
if ( ! in_array('tidy',get_loaded_extensions(), true )){
 die('このサーバではtidyが使用できません');
}
$config = array('indent' => false,
                'output-xhtml' => TRUE,
                'wrap' => 200);
$tidy = tidy_parse_string($html, $config, 'UTF8');
$h = tidy_get_html($tidy);
$html = $h->value;

以上の簡単なコードで実現できますが、整形するソースファイルが大きいとメモリ、プロセス共に大量に使用するので注意。

$configに設定できるオプションはこちら

http://tidy.sourceforge.net/docs/quickref.html

No.758

07/13 10:10

edit

Xpath

PHPでXMLのXSLT変換を行う。

XMLファイルからHTMLファイルを作成したい時、XSLTを使うと早い場合があります。

そのサンプル

<?php
$xml = new DomDocument();
$xml->load('test.xml');
$xsl = new DomDocument();
$xsl->load('sample01.xsl');
$processor = new xsltprocessor();
$processor->importStyleSheet($xsl);
echo $processor->transformToXML($xml);
?>

XSLT書式

http://vosegus.org/guideline/xslt.html

例：そのノードのテキストが hoge の場合のみ ZZZZZZZZZ を表示する

<xsl:if test="contains(./text() , 'hoge')">
ZZZZZZZZZ
</xsl:if>

襟：デフォルト値をセットする

<xsl:param name="contents">デフォルト値</xsl:param>

No.741

04/08 10:00

edit

Xpath
XML

phpでxpathを使ってスクレイピング（WEBページの取得）とXpathの書式例

■ 1. まず　php-xml　のインストール

yum install php-xml

■ 2. 実際のサイトからスクレイピングを行って Xpath で要素を取得のPHPコード

test.server.com から WEBページを取得してきて＜div id="myid"＞の要素を取得します。

$url='http://test.server.com';
// file_get_contents を使うより高速、ただしメモリは食う
require_once 'HTTP/Client.php';
$client =& new HTTP_Client();
$client->get($url);
$response = $client->currentResponse();
$dom = @DOMDocument::loadHTML( $response['body']);
$xml = simplexml_import_dom($dom);
$t = $xml->xpath('id("myid")');
if (! $t){ die('xpath error'); }
print_r( $t );

■ 3. Xpath書式例

全要素

//*	または　/descendant::*

全 div 要素

//div	または　/descendant::div

HTMLページのタイトル

//html/head/title

全 li または div 要素

//*[name()='li' or name()='div' ]

class 属性が 'hoge' な div 要素（完全一致）

//div[@class='hoge']	/descendant::div[@class='hoge']

class 属性が 'hoge fuga' な div 要素（完全一致）

//div[@class='hoge fuga']
//div[contains(@class ,'hoge') and contains(@class ,'fuga')]

class 属性に 'list' を含む div 要素（部分一致）

//li[contains(@class,'list')]

そのノードのテキストの取得

//div[@class='hoge']/text()

そのノード以下の全てのテキストの取得

//div[@class='hoge']/.

id 属性が 'hoge' な要素 →　id('hoge')と書くのが高速ですがPHPではうまく取得できないこともあります

id('hoge')
//*[@id='hoge']
/descendant::*[@id='hoge']

テキストが 'hogehoge' なdiv要素（完全一致）　　例：＜div＞hogehoge＜/div＞

//div[text()='hogehoge']

テキストが 'fuga' を含むdiv要素（部分一致）

//div[contains(text(), "fuga")]

【thタグ内のテキストが'fuga'】なthを持つ tr

//table//tr[th[text()='fuga']]

title 属性が 'hoge' で class 属性が 'fuga' でない要素

//*[@title='hoge' and @class!='fuga']
/descendant::*[@title='hoge' and @class!='fuga']

form 要素の 3 番目の input 要素

//form/descendant::input[3]	/descendant::form/descendant::input[3]

5番目以降の p 要素

//p[position() >=5]

チェックされたチェックボックスの親要素

***//input[@checked='checked']/.. //input[@checked='checked']/parent::node()

RSSフィードのURL

//link[@rel="alternate" and @type="application/rss+xml"]/@href

src が 'images/test.gif' の要素

//*[@src='images/test.gif' ]

img タグで src に文字列 .gif を含む要素

//img[contains(@src, '.gif')]

Firefox xpath アドオン（右クリックで xpath を表示）

https://addons.mozilla.org/en-US/firefox/addon/xpath-checker/

Xpathの書式

http://itref.fc2web.com/xml/xpath.html

Xpath仕様

http://www.w3.org/TR/xpath/

No.723

12/08 14:07

edit

Xpath