HTML Strip of comments, whitespace and crap

As you develop your super-high-power web site, you may find that over the years pages become saturated with comments and formatting whitespace, designed to improve your personal readability. Well, those spaces can account for an extra 20% of transmission cost… even with gzip compression. So here are some PHP routines that I wrote for StudentsReviewto strip the unnecessary content, while respecting scripts, stylesheets, and necessary whitespace.

Basically, after we “protect” the script|styles there are space ignoring tags (like <tr> …spaces… <td>), which can be cleaned. As always, there’s room to eek out a little more, but this takes care of 98% of it.

In PHP:

// (c) Beracah Yankama for StudentsReview
// beracah mit edu  beracah@studentsreview.com
// www.StudentsReview.com

// Grab the buffer  (make sure to start capture with ob_start)
$buffer = ob_get_contents();
// Grab the buffer
$buffer = ob_get_contents();

// Stop buffering
ob_end_clean();

// now clean the html of excess whitespace and comments.
// (can greatly reduce space/traffic usage)

// first protect our style/script <!--
$buffer = preg_replace("/(<(style|script)[^>]*>)\s*<!--/is","\\1 P<<!!--P",$buffer);
// what we really want to do is remove and protect all style & script, and reinsert at end.

$buffer = preg_replace('/<!--(.+?)-->/s',"",$buffer);   // this will strip all style/script in <!-- tags...

// restore the script.
$buffer = preg_replace("/P<<!!--P/is","<!-- ",$buffer);

$buffer = preg_replace("/[\n\r][\n\r]+/s","\n",$buffer);

// try to remove excess space between tags
$buffer = preg_replace('/>[\n\r\t\s]+</',"> <",$buffer);

// remove excess space within tag content
$buffer = preg_replace('/(<[^>]+>)[\n\r\t\s]+([^\n\r\t]+)[\n\r\t\s]+(<\/[^>]+>)/',"$1 $2 $3",$buffer);

// space ignoring tags
$spc_ign_tags = "body|div|form|head|html|input|link|meta|option|script|select|style|table|td|title|tr|textarea";
$buffer = preg_replace("/(<\/?($spc_ign_tags)[^>]*>)[\\n\\r\\t\\s]+(<\/?($spc_ign_tags)[^>]*>)/i","$1$3",$buffer);
// do it again to catch all of the position shifted ones.  (twice = that's all that's necessary).
$buffer = preg_replace("/(<\/?($spc_ign_tags)[^>]*>)[\\n\\r\\t\\s]+(<\/?($spc_ign_tags)[^>]*>)/i","$1$3",$buffer);

// send to user before we add variable persistence.
echo $buffer;

Alternatively, you can use our Apache perl Stripper.pm module:

Anyway, there’s lots of code online about stripping whitespaces from HTML, but few of them embed knowledge of “whitespace respecting” renderable tags. For instance, a whole line of whitespaces _____________ can be removed, but what about all those tabs and spaces between tags that never appear?

… etc. Well, I’ve been using this code (with a million visitors a month, over 30,000 pages) without incident and with good results. It strips about 12K off of my 64K pages (saving about 5GB on our 30GB page cache), and then after compression, saves 1-2K in network transmission $2-3$ packets). It also only costs (adds) about 3-5ms per page to run (tested with ab (apachebench)).

package Stripper;
# use Apache::Constants qw(:common);

#use strict;
#use warnings;

use Apache2::RequestRec ();
use Apache2::RequestIO ();
use Apache2::Filter ();
use Apache2::Const -compile => qw(OK);

use constant BUFFER => 4096;

sub handler {
my $f = shift;
my $buffer = '';

#       $buffer = $f->content_type();

while( $f->read(my $buf, BUFFER) ) {

# strip excess whitespace/etc.
$buffer .= $buf;
}

# we really need to test headers for text/html, but this should protect pngs/other binary data
if (is_binary_data($buffer))
{
# $buffer = "BINARY" . $buffer;
$f->print($buffer);
return( Apache2::Const::OK );
}

# cant figure out how to exclude adcodes in apache, so we'll do it here.
# just don't waste time F around with content that fits in one packet.
if (length($buffer) < 1000)
{
$f->print($buffer);
return( Apache2::Const::OK );
}

# don't operate on non-html; I think this is a problem bcs html has an output filter
# but other files just read the filesize and send
if ( $buffer =~ m/<(html|iframe|ilayer|table|script)[^>]*>/gis )
{

}
else {
$f->print($buffer);
return( Apache2::Const::OK );
}

# convert script/style within this chunk.
# not-perfect..doesn't crossover chunks...
# first protect our style/script <!--

$buffer =~ s/(<(style|script)[^>]*>)\s*<!--/$1 P<<!!--P/gis;
#  what we really want to do is remove and protect all style & script, and reinsert at end.

$buffer =~ s/<!--(.+?)-->//gs;   # this will strip all style/script in <!-- tags...

# restore the script.
$buffer =~ s/P<<!!--P/<!-- /gis;

# get rid of excess newlines...
$buffer =~ s/[\n\r][\n\r]+/\n/gs;

# try to remove excess space between tags
$buffer =~ s/>[\n\r\t\s]+</> </g;

# remove excess space within tag content
$buffer =~ s/(<[^>]+>)[\n\r\t\s]+([^\n\r\t]+)[\n\r\t\s]+(<\/[^>]+>)/$1 $2 $3/gs;

# clean out un-renderable whitespace
my $spc_ign_tags = q^body|div|form|head|html|input|link|meta|option|script|select|style|table|td|title|tr|textarea^;

#       $f->print($spc_ign_tags);

$buffer =~ s/(<\/?($spc_ign_tags)[^>]*>)[\n\r\t\s]+(<\/?($spc_ign_tags)[^>]*>)/$1$3/gis;
pos($buffer) = 0;
$buffer =~ s/(<\/?($spc_ign_tags)[^>]*>)[\n\r\t\s]+(<\/?($spc_ign_tags)[^>]*>)/$1$3/gis;

#       $buffer =~ s/<\/?($spc_ign_tags)[^>]*>/<...noway...>/gis;

$f->print($buffer);

#       $f->print("hey2\n");

return( Apache2::Const::OK );
}

sub is_binary_data {
local $_ = shift;
return 1 if not length;

(tr/ -~//c / length) >= .3;
}

1;

Add the following lines to your Apache mod_perl.conf file (perhaps in /etc/apache2/modules.d/75_mod_perl.conf:)

PerlModule Stripper
<Files ~ "\.(php3?|s?html)" >
PerlOutputFilterHandler Stripper
</Files>