Simple PHP class for parsing markup


In the Zend Client tutorial I touched upon a little bit of code I’ve written recently for retrieving poker player statistics from various backend affiliate sites.

It’s a simple class that wraps PHP’s various regex functions in a fluent interface. It’s a far cry from the fluent strings and arrays tutorial in complexity. It’s been growing the past few days and I’m thinking it deserves its own piece so here we go from top to bottom and then some use cases:

class Parser{

	public function rArr($cont){
		 $this->content = $cont;
		 return $this;
	}
	
	public function setContent($cont){
		return $this->rArr($cont);
	}
	
	public function getTable($content = '', $reg = "|<table[^>]*>(.*?)</table>|si"){
		preg_match($reg, (empty($content) ? $this->content : $content), $match);
		return $this->rArr($match[1]);
	}

The rArr method can be recognized both in name and functionality from the fluent arrays and strings tutorial and setContent is just a more friendly alias. We could’ve made rArr private for some extra clarity.

GetTable() is usually the main entry, it’s the method that kind of filters out everything in the markup we don’t need. Note the default regex, we use a non greedy (?) expression with the modifiers s (multi line) and i (case insensitive). This means we will grab everything between the first table tag, no matter which attributes it has, and the first closing table tag. If we have another table within this table we will stop at the seconds end tag and not get the whole table we meant to grab.

public function getRows($reg = '|<tr[^>]*>|i', $content = ''){
		$result = preg_split($reg, (empty($content) ? $this->content : $content));
		return $this->rArr($result);
	}
	
	public function getColumns($reg = "|<td[^>]*>(.*?)</td>|si"){
		$rarr = array();
		foreach($this->content as $player){
			preg_match_all($reg, $player, $matches);
			$rarr[] = $matches[1];
		}
		return $this->rArr($rarr);
	}

These two are typically chained directly after getTable, we split by all table rows, no matter which attributes they have. The contents of $this->contents will now be an array with a lot of table divisions presumably. That’s why we can call getColumns directly after getRows to get the contents inside each td.

public function shift(){
		array_shift($this->content);
		return $this;
	}
	
	public function sum($bykey){
		$rarr = array();
		foreach($this->content as $player){
			$cur_val = $player[ $bykey ];
			if(!isset($rarr[ $cur_val ]))
				$rarr[ $cur_val ] = $player;
			else{
				foreach($player as $key => $value){
					if(is_numeric($value))
						$rarr[ $cur_val ][ $key ] += $value;
				}
			}
		}
		return $this->rArr($rarr);
	}

Shift() is a good one to get rid of table headers that will make up a first row containing nothing or bullshit.

Sum() will sum all sub arrays having a certain value in a certain key, for instance all sub arrays whose key username corresponds to sussexbuffet. All sub arrays matching that criteria will then be merged into one and all numerical values will be added to each other. I use this one when player statistics are not compounded because that’s the way I need them.

public function convertChar($from, $to, $avoidKeys = array()){
		foreach($this->content as &$player){
			foreach($player as $key => &$value){
				if(!in_array($key, $avoidKeys))
					$value = str_replace($from, $to, $value);
			}
		}
		return $this;
	}
	
	public function removeHtml($reg){
		$rarr = array();
		foreach($this->content as $player){
			$rarr[] = trim(preg_replace($reg, '', $player));
		}
		return $this->rArr($rarr);
	}

Here we have two cleaning up functions, one that will convert one character into another in all fields in all sub arrays (players). An optional array containing field names to avoid can be used to spare certain fields from the conversion.

The other one, removeHtml() is incorrectly named when you think about it. It just uses a regular expression to remove something in all players. Note how preg_replace copes with arrays too, we pass an array and it will loop through it, performing replacements on all the contained strings.

public function getInputValue($name, $content){
		preg_match("|<input( [^>]*name=['\"]{$name}['\"][^>]*)>|", $content, $match);
		return $this->getMatch("|value=['\"](.*?)['\"]|", $match[1]);
	}

	public function getMatch($reg, $content){
		preg_match($reg, $content, $match);
		return $match[1];
	}

The first method here is an important one, I use it to retrieve certain values out of .NET forms to be able to submit them properly. It’s a pain in the ass but this one make life easier, I will post an example later on how this can be done. We perform two matches, first one to get the whole attribute area on an input tag containing a certain name. Then we use that match to get the value we want.

GetMatch() is just a simple wrapper around preg_match.

public function getTag($tag, $content, $end_tag = ''){
		$end_tag = empty($end_tag) ? $tag : $end_tag;
		return $this->getMatch("|<$tag>(.+?)</$end_tag>|si", $content);
	}
	
	public function getOptionValue($content, $label, $modifier = ''){
		return $this->getAttrValue($content, 'option', 'value', $label, $modifier);
	}
	
	public function getAttrValue($content, $tag, $attr_name, $tag_content = '[^<]*', $modifier = ''){
		return $this->getMatch("|<$tag [^>]*$attr_name=['\"](.*?)['\"][^>]*>$tag_content</$tag>|$modifier", $content);
	}

Some convenience functions, getAttrValue() is the most complex one. As opposed to the getInputValue method it will get the value of the attribute straight away.

public function getContent(){
		return $this->content;
	}
	
	public function pop(){
		array_pop($this->content);
		return $this;
	}
	
	public function trim(){
		foreach($this->content as &$player){
			foreach($player as &$field)
				$field = trim($field);
		}
		return $this;
	}
	
	public function addTo($add, $to){
		foreach($this->content as &$player){
			$player[ $to ] += trim($player[ $add ]);
		}
		return $this;
	}

A simple getter and two wrappers. Then we have addTo which will add the content in one key to another which is controlled by the arguments.

public function rmDollarTrim($value){
		return trim(str_replace('$', '', $value));
	}
	
	public function isNumeric($value){
		return is_numeric( str_replace(',', '.', $this->rmDollarTrim($value)) );
	}

	public function clean(){
		$rarr = array();
		foreach($this->content as $player){
			$ok = false;
			foreach($player as $field){
				if($this->isNumeric($field))
					$ok = true;
			}
			if($ok)
				$rarr[] = $player;
		}
		return $this->rArr($rarr);
	}
	
	public function keepWhere($keep_key, $keep_value){
		$rarr = array();
		foreach($this->content as $player){
			if(is_array($keep_value)){
				if(in_array($player[ $keep_key ], $keep_value))
					$rarr[] = $player;
			}else if($player[ $keep_key ] == $keep_value)
				$rarr[] = $player;
		}
		return $this->rArr($rarr);
	}

Since we know that all players have some kind of numerical values we can easily clean out more complex sub arrays that might be left after the whole getTable()->getRows()->getColumns() chain has done its magic, with the help of clean() of course.

Finally we have keepWhere() which is a good way of filtering out players belonging to unwanted poker rooms when a service returns a mix of players.

Let’s take a look at a few examples, starting with a .NET based service, most examples use Zend Client to access pages:

$stats_form = $client->request('GET');
		
		$client->setParameterPost(array(
			'__VIEWSTATE' 				=> $this->getInputValue('__VIEWSTATE', $stats_form),
			'__EVENTVALIDATION' 		=> $this->getInputValue('__EVENTVALIDATION', $stats_form),
			'ctl00$ContentPlaceHolder1$affiliateCodesDDL' => 4565
		));
		
		$response = $client->request('POST');

		return $this->getTable($response->getBody(), '|<table[^>]*id="ctl00_ContentPlaceHolder1_subPlayersGridView"[^>]*>(.*?)</table>|s')
		->getRows()->removeHtml(array('|<span[^>]*>|', '|</span>|'))
		->getColumns()->convertChar('$', '')->clean()
		->trim()->keepWhere(0, array('affiliate', 4565))->getStats(array(1, 3));

Note the use of getInputValue to get the hidden control fields __VIEWSTATE and __EVENTVALIDATION.

while($ok){
			$post_arr['player'] = $player_num;
			$client->setParameterPost($post_arr);
			$response = $client->request('POST');
			$temp = $this->getTable($response->getBody(), "|<table width='100%' rules='all' border='0' style='border:solid 1px black;' cellspacing='1' cellpadding='3'>(.*?)</table>|s")
				->getRows()->removeHtml(array('|&nbsp;|'))
				->getColumns()->convertChar('$', '')->clean()
				->trim()->content;
			
			if(in_array($temp[0], $players) || empty($temp))
				$ok = false;
			else
				$players = array_merge($players, $temp);
				
			$player_num++;
		}

Here is an interesting one where we have to loop through a paginated result, one wonders why the hell the result is paginated for given the fact that the affiliate in question might be interested in pulling the data. Anyway, we shall overcome, and we did.

return $this->getRows('|<data[^>]*>|s', $client->getContent())
->getColumns("|<[^>]*>([^>]*)</[^>]*>|si")
->keepWhere($this->keep_key, $this->keep_value)
->sum(26)->getStats(array(26, 8));

This is an example of some homegrown broken XML, well we easily handle that too.

And that’s that, grab the script here.

Related Posts

Tags: , , , ,