Introduction

Regular expressions is a powerful tool for the incoming data processing. The problem demanding replacement or text search, can be successfully solved by this “language inside language”. And thought maximum effect from the regular expressions might be achieved by using server languages, it is not no point to undervalue the opportunities of this client’s utility.

Basic concepts

Regular expression is a tool for lines or symbols strings processing, which determines text template.

Modifier is designed for regular expression “supervision”.

Metacharacters are special symbols, which serve as commands of regular expressions language.

Regular expression is worked out as usual variable, just it is used slash instead of quotes, e.g.:var reg=/reg_expression/

Simple templates are such templates, which do not need any special symbols.

Let us say, our task is to change all the letters “p” (both capital and lowercase) to Latin letter “R” in phrase Regular expressions.

We create template var reg=/p/ and realize it with replace method

<script language=”JavaScript”>
var str=”Regular expressions”
var reg=/r/
var result=str.replace(reg, “R”)
document.write(result)
</script>

As a result we receive <<RegulaR expressions>>, change was made only in first occurrence of letter “r” according to match-case.
But this result doesn’t suit under conditions of our task… Here we need modifiers “g” and “i”, which may be used both separately and altogether. These modifiers are put inside the template of regular expression, after slash. They have the following values:

modifier “g” specifies line search as “global”, that is to say in our case change happens with all the occurrences of letter “p”. Now the template looks this way: var reg=/r/g. If we insert it into our code:

<script language=”JavaScript”>
var str=”Regular expressions”
var reg=/r/g
var result=str.replace(reg, “R”)
document.write(result)
</script>

So we receive <<RegulaR expRetions>>.

modifier “i” specifies case-insensitive line search, so if we add this modifier into our template var reg=/r/gi, after script processing we receive required result of our task:<<regular expretions>>.

Special symbols (Metacharacters)

Metacharacters specify symbols type of the required line, type of external environment of line in text, as well as quantity of the symbols of separate type in browseable text. That is why it is possible to divide metacharacters into three groups:

  • Metacharacters of coincidence search.
  • Quantified metacharacters.
  • Metacharacters of position.
  • Metacharacters of coincidence search:

    \b word board, which specifies condition, under which the template is to be processed at the end or at the beginning of the words.

    \B not word board, which specifies condition, under which the template is not processed at the end or at the beginning of the words.

    \d digit from 0 to 9.

    \D not a digit.

    \s single empty symbol, corresponding to space symbol.

    \S single nonempty symbol, any symbol other than space.

    \w digit, letter of flatworm.

    \W not digit, letter of flatworm.

    . any symbol, any signs, letter, digits, etc.

    [ ] symbol series, specifies condition, under which template has to be processed under any symbols coincidence put into square brackets.

    [^ ], set of non-occurransing symbols, specifies condition, under which template has not to be processed under any symbols coincidence put into square brackets.

    Quantified metacharacters:

    * Zero or more times.

    ? Zero or one time

    + One or more times.

    {n} exactly n times.

    {n,} n or more times.

    {n,m} at least n times but less than m times.

    Metacharacters of position:

    ^ at the line beginning.

    $ at the end of line.

    Some methods of work with templates

    replace — this method we had already used at the beginning of the article, it is assigned for the sample search and change of found subchain to new subchain.

    test — this methods checks out whether coincidence in the line takes place towards the template, and retrieves false, if comparison with the sample failed, otherwise true.

    For example:

    <script language=”JavaScript”>
    var str=”JavaScript”
    var reg=/PHP/
    var result=reg.test(str)
    document.write(result)
    </script>

    As a result displays false, as the line “JavaScript” is not equal to line “PHP“.

    Also method test may retrieve any other line assigned by the programmer to true or false.

    For example:

    <script language=”JavaScript”>
    var str=”JavaScript”
    var reg=/PHP/
    var result=reg.test(str) ? “Line coincided”: “Line did not coincide”
    document.write(result)
    </script>

    In this case a result will be displayed as “Line did not coincide”.

    exec — this method processes matching of the line with the sample, assigned by the template. If correlation with the sample failed, the meaning null is retrieved. Otherwise the result is subchains array, corresponding to specified sample. /*The first element of the array will be equal to initial line complying with the specified template*/

    for example:

    <script language=”JavaScript”>
    var reg=/(\d+).(\d+).(\d+)/
    var arr=reg.exec(”I was born on 15.09.1980″)
    document.write(”Date of birth: “, arr[[ ,]] “< br>”)
    document.write(”Day of birth: “, arr, “< br>”)
    document.write(”Month of birth: “, arr, “< br>”)
    document.write(”Year of birth: “, arr, “< br>”)
    </script>

    As a result we receive four lines:

    Date of birth: 15.09.1980
    Day of birth: 15
    Month of birth: 09
    Year of birth: 1980

    Conclusion

    Far not all the abilities and advantages of the regular expressions are described in this article.
    For deeper studying of this matter I’d recommend you to learn RebExp object.
    Also I’d like to point out that syntax of regular expressions is the same in JavaScript and PHP.
    For example if you need to check out the correctness of e-mail input, the regular expression looks the same
    way for JavaScript and PHP: /[0-9a-z_]+@[0-9a-z_^.]+.[a-z]{2,3}/i.

General description

Regular expressions are represented as samples for searching set symbols combinations in text lines (this search refers to comparison to the sample). There are two ways of assignment of variables regular expressions, namely:

  • Usage of the object initializer: var re = /pattern/switch?.
  • Using builder RegExp:
  • var re = new
    RegExp(”pattern”[,”switch”]?).
    //Here pattern - regular expression, switch – unrequired search options.

    Object initializers, eg, var re = /ab+c/, are to be applied in cases, when the value of regular expression stays constant in script service. Such regular expressions compile during script loading, so they execute faster.
    Builder call-up, eg, var re = new RegExp(”ab+c”), is to be used in cases when value of variable is going to change. If you are intended to use regular expression several times, then there has sense to compile it by “compile” method for much more effective samples search.

    When creating regular expression it is necessary to take into account, that putting it into quotes implies necessity to use escape-consistency, as well as in any other string constant.

    For example, two following expressions are equivalent:

    var re = /\w+/g;
    var re = new RegExp(”\w+”, “g”); // In the line “” changes to “”

    Note: regular expression cannot be empty: two symbols // at a run fix the comment beginning. So in order to create empty regular expression use /.?/.

    Regular expressions are used by methods ‘exec’ and ‘test’ of RegExp object and by methods ‘match’, ‘replace’, ’search’ and ’split’ of String object. If we need just to check whether assined string contains substring, relevant to the sample, the following methods are used:a href=”/cgi-bin/print.pl?id=118-4.html#505″ mce_href=”/cgi-bin/print.pl?id=118-4.html#505″ class=intext>test or ’search’. But if we need to extract subchain (or subchains) relevant to the sample, we need to use methods ‘exec’ or ‘match’. ‘Replace’ method provides search of assigned subchain and its change into another chain, and ’split’ method allows to break chain into several subchains, being based on regular expression or common text chain. Detailed information about using of regular expressions is displayed in descriptions of corresponding methods.

    Syntax of regular expressions

    Regular expression may consist of common symbols; in this case it will correspond to assigned combination of symbols in chain. For example, expression /com/ corresponds to marked subchains in the following chains: «clum», «sweet-tooth», «fleet headquater». Though, flexibility and power of regular expressions gives possibility to use special symbols, which are listed in the following table.

    Special symbols in regular expressions:

    For the symbols, which are usually interpreted by literal, means that next symbol is special. For example, /n/corresponds to letter n, and /\n/ corresponds to symbol of line advance.
    For the symbols, which are usually interpreted as special, means that symbol has to be interpreted by literal. For example, /^/ means the beginning of the chain, and /\^/ just corresponds to symbol ^. /\/ corresponds to reverse slash.

    ^ - Corresponds to the chain beginning.

    $ - Corresponds to the chain end.

    * - Corresponds to iterate of previous symbol zero or more times.

    + - Corresponds to iterate of previous symbol one or more times.

    ? - Corresponds to iterate of previous symbol zero or one times.

    . - Corresponds to any symbol, besides symbol of new chain.

    (pattern) - Corresponds to chain pattern and memorizes found correspondence.

    (?:pattern) - Corresponds to chain ‘pattern’, but does not memorize found correspondence. Is used for grouping of sample’s parts, e.g. /ca(?:t|ttle)/ -
    is a short writing of the expression /cat|cattle/.

    (?=pattern) - Corresponding with “foreseeing”, happens under matching of line pattern without memorizing of found matching. For example /Windows (?=95|98|NT|2000)/ corresponds to “Windows ” in chain “Windows 98″, but mismatches in chain “Windows 3.1″. After correlation search continues from the position coming next after found match without foreseeing.

    (?!pattern) - Corresponding with «foreseeing», happens by mismatching of chain ‘pattern’ without memorizing of found correspondence. For example, /Windows (?!95|98|NT|2000)/ corresponds to “Windows ” in chain “Windows 3.1″, but mismatches in chain “Windows 98″. After correlation search continues from the position coming next after found mismatch, without foreseeing.

    x|y - Corresponds to x or y.

    {n} - n - nonnegative number. Corresponds equally to n occurrences of previous symbol.

    {n,} - n - nonnegative number. Corresponds to n or more occurrences of previous symbol. /x{1,}/ is equivalent to /x+/. /x{0,}/ is equivalent to
    /x*/.

    {n,m} - n and m – nonnegative number. Corresponds not less than to n but not more than to m occurrences of previous symbol. /x{0,1}/ is equivalent to /x?/.

    [xyz] - Corresponds to any symbol put into square brackets.

    [^xyz] - besides ones put into square brackets.

    [a-z] - Corresponds to any symbol in the indicated extend.

    [^a-z] - Corresponds to any symbol, besides those in the indicated extend.

    \b - Corresponds to word bounder that is position between word and space or line advance.

    \B - Corresponds to any position besides word bounder.

    \ñX - Corresponds to symbol Ctrl+X. E.g., /\cI/ is equivalent to /\t/.

    \d - Corresponds to digit. Equivalent to [0-9].

    \D - Corresponds to non-numerical character. Equivalent to [^0-9].

    \f - Corresponds to format transfer symbol (FF).

    \n - Corresponds to line feed symbol (LF).

    \r - Corresponds to carriage return symbol (CR).

    \s - Corresponds to space symbol. Equivalent to /[ \f\n\r\t\v]/.

    \S - Corresponds to any non-space symbol. Equivalent to /[^ \f\n\r\t\v]/.

    \t - Corresponds to tabulation symbol (HT).

    \v - Corresponds to vertical tabulation symbol (VT).

    \w - Corresponds to Latin letter, digit or flatworm. Equivalent to /[A-Za-z0-9_] /.

    \W - Corresponds to any symbol, besides Latin letter, digit or flatworm. Equivalent to /[^A-Za-z0-9_] /.

    \n - n - positive number. Corresponds to n memorized chain. Is calculated by counting left round brackets. It is equivalent to \0n, if quantity of left round brackets is less than n.

    \0n - n - octal number, less than 377. Corresponds to symbol with octal code n. E.g., /\011/ is equivalent to /\t/.

    \xn - n - hex number, consisting of two digits. Corresponds to symbol with hex code n. E.g., /\x31/ corresponds to /1/.

    \un - n - hex number, consists of four digits. Corresponds to symbol Unicode with the hex number n. For example, /\u00A9/ is equivalent to /©/.

    Regular expressions are calculated the same way as other JavaScript expressions, that is to say with account of operations priority: the operations with higher priority are performed first. If the operations have equal priority, they are performed from left to right. In the following table the operations of regular expressions are listed in descending order of their priority. The operations located in one chain have equal priority.

    Operations:

    \
    () (?:) (?=) (?!) []
    * + ? . {n} {n,} {n,m}
    ^ $ \metacharacter
    |

    Search options:

    While creating of regular expression we can indicate additional search options:

    * i (ignore case). Not to recognize lowercase and capital letters.

    * g (global search). All sample occurrences global search.

    * m (multiline). Multi-line search.

    * Any combinations of these three options, e.g. ig or gim.

    Let us give few examples. Regular expressions recognize lowercase and capital letters. So the following script

    var s = “Learning JavaScript language”;
    var re = /JAVA/;
    var result = re.test(s) ? “” ” : “” not “;
    document.write(”Chain “” + s + result + “corresponds to sample ” + re);

    displays the following text on the screen:

    Line “Learning JavaScript language” mismatches /JAVA/ sample

    Now if we change the second line of the example to var re = /JAVA/i;, the following text appears on the screen:

    Line “Learning JavaScript language” corresponds to /JAVA/i sample

    Now let’s analyze global search option. Usually it is used by ‘replace’ method in sample search and changing found subchain to new one. The matter is that on default this method changes only fir found subchain and retrieves the received result. Let’s examine the following script:

    var s = “We write script on JavaScript, ” +
    “but JavaScript is not a unique script language.”;
    var re = /JavaScript/;
    document.write(s.replace(re, “VBScript”));

    It displays the text, which for certain mismatches with the desired result:

    We write scripts on VBScript, but JavaScript is not a unique script language.

    In order to change all the occurrences of “JavaScript” chain to
    “VBScript”, we need to change the meaning of regular expression to var
    re = /JavaScript/g;. The resulting line looks as follows:

    We write scripts in VBScript, but VBScript is not a unique script language.

    At last, the multi-line search option allows making comparison with the line expression sample, connected by break line symbols. On default comparison with the sample stops, if break line symbol is found. This option overcomes specified limitation and provides sample search throughout all the initial line. It also influences some special symbols interpretation in regular expressions, namely:

    * Usually symbol ^ is associated only with the first line element. If multi-line search option is included, it is compared with any line element, staying after break line symbol.

    * Usually symbol $ is associated only with the last line element. If multi-line search option is included, it is compared with any line element, which is break line symbol.

    Memorizing of found subchains

    If the part of regular expression is put in round brackets, corresponding subchain is memorized for further use. For the access to memorized subchains use the attributes $1, :, $9 of RegExp object or elements of array variable, retrieved by exec and match methods. In the last case the quantity of found and memorized lines is unlimited.

    For example, the following script uses replace method for derangement of the words in line. Attributes $1 and $2 are used to change found text.

    var re = /(\w+)\s(\w+)/;
    var str = “Regular Expressions”;
    document.write(str.replace(re, “$2, $1″))

    This script will display the following text on the screen:

    Regular, Expressions

    Read Part 2

Sent by Brian Strassberg

One of the best things about Firefox and Thunderbird is that they have a well defined extension mechanism. If there’s some feature you feel is completely missing, you can go ahead and add it. It’s relatively easy to do — you don’t have to fiddle about with a C compiler because extensions are mostly written in a combination of XML and ECMAScript.

I’ve recently been getting up to speed with the mechanics of writing an extension for Firefox and Thunderbird. I thought it might be a good idea to share what I’m learning in my blog. Hopefully the information might be useful to others trying to learn how to write Mozilla extensions. In this blog, I’ll take a look at installing a really simple extension into Firefox that adds a “Hello World” menu item to the Tools menu.

Hello World item in the Firefox Tools Menu

It doesn’t do anything useful yet, but over subsequent blogs, I’ll introduce more advanced and useful techniques.

Creating contents.rdf

contents.rdf is a Resource Description Framework (RDF) file that describes the contents of an extension. RDF is an XML grammar that provides a data model that can be easily processed by an application. You don’t need to know much about RDF to write simple extensions, but if you’re interested, you can get more information at the W3C.

To get started, create a directory called content. As its name suggests, this will contain the main content of your extension. contents.rdf should live inside this directory. You should end up with a directory structure something like this:

c:myextensions
+- helloworld
   +- content
      +- contents.rdf

Here’s the code for contents.rdf.

<?xml version="1.0"?>
<RDF:RDF xmlns:RDF="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:chrome="http://www.mozilla.org/rdf/chrome#">  

  <RDF:Seq about="urn:mozilla:package:root">
    <RDF:li resource="urn:mozilla:package:helloworld“/>
  </RDF:Seq> 

  <RDF:Description about=”urn:mozilla:package:helloworld”
    chrome:displayName=”Hello World”
    chrome:author=”Brian Duff”
    chrome:authorURL=”http://modev.dubh.org/helloworld”
    chrome:name=”helloworld”
    chrome:extension=”true”
    chrome:description=”A simple demonstration firefox extension.“>
  </RDF:Description> 

  <RDF:Seq about=”urn:mozilla:overlays”>
    <RDF:li resource=”chrome://browser/content/browser.xul”/>
  </RDF:Seq> 

  <RDF:Seq about=”chrome://browser/content/browser.xul”>
    <RDF:li>chrome://helloworld/content/helloworld-Overlay.xul</RDF:li>
  </RDF:Seq> 

</RDF:RDF>

The important parts of the file (and the parts that would usually change for each extension) have been highlighted in bold. First, we provide a package for our extension. This just distinguishes it from other extensions. For our simple extension, we choose the package helloworld:

<RDF:Seq about="urn:mozilla:package:root">
  <RDF:li resource="urn:mozilla:package:helloworld“/>
</RDF:Seq>

Next, we provide a description of our extension:

<RDF:Description about="urn:mozilla:package:helloworld”
  chrome:displayName=”Hello World”
  chrome:author=”Brian Duff”
  chrome:authorURL=”http://modev.dubh.org/helloworld”
  chrome:name=”helloworld”
  chrome:extension=”true”
  chrome:description=”A simple demonstration firefox extension.“>
</RDF:Description>

Next, we tell mozilla which parts of the product we want to extend. All of the user interface elements of Firefox and Thunderbird are described in a user interface definition language called XUL. Collectively, these user interface components are known as “chrome”. You can extend most of the user visible parts of the two products. In this case, we want to extend the main browser interface of Firefox, defined in chrome://browser/content/browser.xul.
Here’s how:

<RDF:Seq about="urn:mozilla:overlays">
  <RDF:li resource="chrome://browser/content/browser.xul"/>
</RDF:Seq>

Now we’ve described what we want to extend, we have to provide a XUL file that actually installs our custom user interface into the browser window. We will define this later in a separate XML file called helloworld-Overlay.xul. We must tell mozilla where this file is and what it extends:

<RDF:Seq about="chrome://browser/content/browser.xul">
  <RDF:li>chrome://helloworld/content/helloworld-Overlay.xul</RDF:li>
</RDF:Seq>

We’ve completed the first step of creating an extension. The next task is to describe the user interface elements we want to install into the main window.

Overlaying User Interface Elements

XUL is the XML User interface Language. XUL provides a mechanism called dynamic overlays that allows you to modify the user interface elements of a window or control without having to modify the original XUL files. This way, the definition of extension user interface is de-coupled from the XUL files used to define the main interface elements in Firefox and Thunderbird.

We start of by creating helloworld-Overlay.xul. This should live in the same directory as contents.rdf.

<?xml version="1.0"?> 

<overlay id="helloworldOverlay"
  xmlns="http://www.mozilla.org/keymaster/gatekeeper/there.is.only.xul"> 

  <menupopup id="menu_ToolsPopup">
    <menuitem label="Hello World" position="1" />
  </menupopup> 

</overlay>

This simple overlay installs a “Hello World” menu item at the top of the Tools menu. Our menu item is defined inside a menupopup element. In XUL, a menupopup represents a container of menu items, for example a popup menu or the drop down of a main menu. Here, we put our item inside a menupopup with the id menu_ToolsPopup. This id, menu_ToolsPopup, is a predefined in browser.xul and corresponds to the drop down part of the Tools menu.

Creating an Install Manifest

Recent versions of Firefox and Thunderbird have a new extension manager which can be used to easily install and manage extensions. To tell this extension manager about your extension, you must write another RDF file called install.rdf. This file should be at the same level in the directory tree as the contents directory, i.e. you should have a structure like this:

c:myextensions
+- helloworld
   +- install.rdf
   +- content
      +- contents.rdf
      +- helloworld-Overlay.xul

Here’s the code for the install manifest with the interesting parts highlighted:

<?xml version="1.0"?> 

<RDF xmlns="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:em="http://www.mozilla.org/2004/em-rdf#"> 

  <Description about="urn:mozilla:install-manifest"> 

    <em:name>Hello World</em:name>
    <em:id>{12a1584b-2123-473d-8752-e82e74e3cb1b}</em:id>
    <em:version>0.1</em:version> 

    <em:targetApplication>
      <Description>
        <em:id>{ec8030f7-c20a-464f-9b0e-13a3a9e97384}</em:id>
        <em:minVersion>0.9</em:minVersion>
        <em:maxVersion>1.0</em:maxVersion>
      </Description>
    </em:targetApplication>  

    <em:file>
      <Description about=”urn:mozilla:extension:file:helloworld.jar“>
        <em:package>content/</em:package>
      </Description>
    </em:file> 

  </Description> 

</RDF>

First, we provide a description of the extension. This information will be displayed in Firefox’s extension manager:

<em:name>Hello World</em:name>
<em:id>{12a1584b-2123-473d-8752-e82e74e3cb1b}</em:id>
<em:version>0.1</em:version>

The content of the em:id element is a globally unique identifier (GUID). This is used to distinguish your extension from every other extension. When writing your own extensions, you should aways generate a new GUID for each distinct extension you write. Andy Hoskinson provides a GUID generator web service you can use for this.

Next, we describe which application we are extending:

<em:targetApplication>
  <Description>
    <em:id>{ec8030f7-c20a-464f-9b0e-13a3a9e97384}</em:id>
    <em:minVersion>0.9</em:minVersion>
    <em:maxVersion>1.0</em:maxVersion>
  </Description>
</em:targetApplication>

Each extensible application released by Mozilla has its own GUID: you must specify the correct one here for the application you are extending. In this example, we’re using the GUID for Firefox. We also describe the minimum and maximum versions of Firefox this extension will work with.

Finally, we tell the extension manager which files to install. Later, I’ll describe how we package the extension up so that it can be automatically installed into Firefox:

<em:file>
  <Description about="urn:mozilla:extension:file:helloworld.jar“>
    <em:package>content/</em:package>
  </Description>
</em:file>

Packaging an Extension Installer

Our extension is now complete. But to make it easy for users to install the extension, we should package it up in such a way that Firefox can install it easily. To do this, we bundle our files up into an XPI (Cross platform installer) file. An XPI file is just a normal zip file with files organized in a special way. Our XPI file should contain the following structure:

helloworld.xpi
+- install.rdf
+- chrome/
   +- helloworld.jar

helloworld.jar is another zip file containing all the files we created in the content directory:

helloworld.jar
  +- content/
     +- contents.rdf
     +- helloworld-Overlay.xul

You can easily use this information to create helloworld.xpi using your favorite zip tool. Alternatively, you can use a build utility like Ant to assemble the files correctly. Here’s an example Ant buildfile I used:

<?xml version="1.0"?> 

<project name="helloworld" default="createxpi"> 

  <target name="createjar">
    <zip destfile="helloworld.jar" basedir="."
         includes="content/**" />
  </target> 

  <target name="createxpi" depends="createjar">
    <zip destfile="helloworld.xpi">
      <zipfileset dir="." includes="helloworld.jar"
                  prefix="chrome" />
           <zipfileset dir="." includes="install.rdf" />
         </zip>
  </target> 

</project>

If you create this buildfile named build.xml and place it at the same level as install.rdf, you will be able to just run ant in this directory to generate your .xpi. Once you have an .xpi file, it can be easily installed using File->Open… in Firefox.

Hello World item in the Firefox Extension manager

Download the source code (24KB ZIP file).

Links:

Top 10 Firefox Extensions to Improve your Productivity
Top Firefox 2 config tweaks
20 must-have Firefox extensions
How to Make Firefox Look Exactly Like Internet Explorer

http://google-code-updates.blogspot.com/

Though you may think of us as simply a company with a big search index, Google uses MySQL, the open source relational database, in some of the applications that we build that are not search related.

We think MySQL is a fantastic data storage solution, and as our projects push the requirements for the database in certain areas, we’ve made changes to enhance MySQL itself, mainly in the areas of high availability and manageability.

We would love for the some of these changes to be merged with the official MySQL release, but until then we felt strongly that anyone should have access to them, thus we have released the changes with a GPL license for the MySQL community to use and review.

What have we added and enhanced?

The high availability features include support for semi-synchronous replication, mirroring the binlog from a master to a slave, quickly promoting a slave to a master during failover, and keeping InnoDB and replication state on a slave consistent during crash recovery.

The manageability features include new SQL statements for monitoring resource usage by table and account. This includes the ability to count the number of rows fetched or changed per account or per table. It also includes the number of seconds of database time an account uses to execute SQL commands.

More details:

  • SemiSyncReplication - block commit on a master until at least one slave acknowledges receipt of all replication events.
  • MirroredBinlogs - maintain a copy of the master’s binlog on a slave
  • TransactionalReplication - make InnoDB and slave replication state consistent during crash recovery
  • UserTableMonitoring - monitor and report database activity per account and table
  • InnodbAsyncIo - support multiple background IO threads for InnoDB
  • FastMasterPromotion - promote a slave to a master without restart

The current patches are for version 4 of MySQL, with version 5 support coming shortly.

We look forward to hearing from the large MySQL community.

Original Post: Beware of XHTML

If you’re a web developer, you’ve probably heard about XHTML, the markup language developed in 1999 to implement HTML as an XML format. Most people who use and promote XHTML do so because they think it’s the newest and hottest thing, and they may have heard of some (usually false) benefits here and there. But there is a lot more to it than you may realize, and if you’re using it on your website, even if it validates, you are probably using it incorrectly.

I should make it clear that I hope XHTML has a bright future on the Web. That is precisely why I have written this article. The state of XHTML on the Web today is more broken than the state of HTML, and most people don’t realize because the major browsers aren’t even treating those pages like real XHTML. If you hope for XHTML to succeed on the Web, you should read this article carefully.

Some of the issues discussed in this article are complicated and technical. If you find it difficult to follow, I suggest at least taking a look at the myths of XHTML, examples of latent compatibility issues, and the list of standards-related XHTML sites that break when treated properly.

Some quotes from prominent people/vendors:

Microsoft (Internet Explorer):
“If we tried to support real XHTML in IE 7 we would have ended up using our existing HTML parser (which is focused on compatibility) and hacking in XML constructs. It is highly unlikely we could support XHTML well in this way”
Mozilla (Firefox):
“If you are using the usual HTML features […] serving valid HTML 4.01 as text/html ensures the widest browser and search engine support.”
Apple (Safari):
“On today’s web, the best thing to do is to make your document HTML4 all the way. Full XHTML processing is not an option, so the best choice is to stick consistently with HTML4.”
Håkon Wium Lie (from Opera, W3C):
“I don’t think XHTML is a realistic option for the masses. HTML5 is it.”
Anne van Kesteren (from Opera):
“I’m an advocate of using XHTML only in the correct way, which basically means you have to use HTML. Period.”
Ian Hickson (from Opera, Google, W3C):
“Authors intending their work for public consumption should stick to HTML 4.01″

 

Table of Contents

  1. What is XHTML?
  2. Myths of XHTML
  3. Benefits of XML
  4. Content type is everything
  5. HTML compatibility guidelines
  6. Internet Explorer incompatibility
  7. Content negotiation
  8. Null End Tags (NET)
  9. Firefox and other problems
  10. Conclusion
  11. List of standards-related sites that break as XHTML
  12. List of standards-related sites that stick with HTML
  13. Related sites
  14. See also

 

What is XHTML?

Up

XHTML is a markup language hoped to eventually (in the distant future) replace HTML on the Web. For the most part, an XHTML 1.0 document differs from an HTML 4.01 document only in the lexical and syntactic rules: HTML is written in its own unique subset of SGML, while XHTML is written in a different subset of SGML called XML. SGML subsets are differentiated by the sets of characters that delimit tags and other constructs, whether or not certain types of shorthand markup may be used (such as minimized attributes, omitted start/end tags, etc.), whether or not tag names or character entities are case sensitive, and so on.

The Document Type Definition (DTD, which is referenced by the doctype) then defines which elements, attributes, and character entities exist in the language and where the elements may be in the document. The DTDs of XHTML 1.0 and HTML 4.01 are nearly identical, meaning that, as far as things like elements and attributes go, XHTML 1.0 and HTML 4.01 are basically the same language. The only added benefit of XHTML is that it uses XML’s subset of SGML and shares the benefits XML has over HTML’s subset.

 

Myths of XHTML

Up

There are many false benefits of XHTML promoted on the Web. Let’s clear up some of them at a glance (with details and other pitfalls provided later):

  • XHTML does not promote separation of content and presentation any more than HTML does. XHTML has all of the same elements and attributes (including presentational ones) that HTML has, and it doesn’t offer any additional CSS features. Semantic markup and separation of content and presentation is absolutely possible in HTML and is equally easy.
  • Most XHTML pages on the Web are not parsed as XML by today’s web browsers. The vast majority of XHTML pages on the Web cannot be parsed as XML. Even many valid XHTML pages cannot be parsed as XML. See the Validity and Well-Formedness article for details and examples.
  • HTML is not deprecated and is not being phased out at this time. In fact, the World Wide Web Consortium recently renewed the HTML working group which is working to develop HTML 5.
  • XHTML does not have good browser support. Most browsers simply treat XHTML pages as regular HTML (which presents a number of problems). Some major browsers like Firefox, Opera, and Safari may attempt to handle the page as proper XHTML, but usually only if you include a certain special HTTP header. However, when you do so, Internet Explorer and a number of other user agents will choke on it and won’t display a page at all. Even when handled as XHTML, the supporting browsers have a number of additional bugs.
  • Browsers do not parse valid XHTML dramatically faster than valid HTML, even when they’re parsing XHTML correctly. Although the browser can lose certain shorthand logic, it now has to use extra logic to confirm that the document is well-formed. Although XHTML, when parsed with an XML parser, may be somewhat faster to parse than typical HTML, the difference usually isn’t very significant. And either way, download speed is usually the bottleneck when it comes to document parsing, so users won’t notice any speed improvement.
  • XHTML is not extensible if you hope to support Internet Explorer or the number of other user agents that can’t parse XHTML as XML. They will handle the document as HTML and you will have no extensibility benefit.
  • XHTML source does not necessarily look much different from HTML source. If you prefer making sure all of your non-empty elements have close tags, you may use close tags in HTML, too. The only real markup differences between an HTML document and an XHTML document following the legacy compatibility guidelines are the doctype, html element, and the /> tag ends (which are just XML shorthand constructs like so many people claim to dislike about HTML).

 

Benefits of XML

Up

XML has a number of improvements over HTML’s subset of SGML:

  • Although HTML’s subset allowed for a lot of shorthand markup and other flexibility, it proved too difficult to write a correct and fully-featured parser for it. As a result, most user agents, including all of today’s major web browsers, make many technically unsound assumptions about the lexical format of HTML documents and don’t support a number of shorthand features like Null End Tags (<tag/Content/), unclosed start/end tags (<tag<tag>), and empty tags (<>). XML was designed to eliminate these extra features and restrict documents to a tight set of rules that are more straight-forward for user agents to implement. In effect, XML defines the assumptions that user agents are allowed to make, while still resulting in a file that a theoretical fully-featured SGML user agent could parse once pointed to XML’s SGML declaration.It should be noted that an XML parser for the most part is not dramatically easier to write than the level of HTML support offered by most HTML parsers. Most of the features that would make HTML more difficult to write a parser for, such as custom SGML declarations, additional marked sections, and most of the shorthand constructs, have negligible use on the Web anyway and generally have poor or absent support in major web browsers. The most significant difference is XML’s lack of support for omitted start and end tags, which in theory could amount to complicated logic in HTML for elements not defined as empty. Even still, most browsers have those rules hard-coded rather than derived from the DTD, so this isn’t a major difference in difficulty either.
  • To minimize the occurrence of nasty surprises when parsing the document, XML user agents are told to not be flexible with error handling: if a user agent comes upon a problem in the XML document, it will simply give up trying to read it. Instead, the user will be presented with a simple parse error message instead of the webpage. This eliminates the compatibility issues with incorrectly-written markup and browser-specific error handling methods by requiring documents to be “well-formed”, while giving webpage authors immediate indication of the problem. This does, however, mean that a single minor issue like an unescaped ampersand (&) in a URL would cause the entire page to fail, and so most of today’s public web applications can’t safely be incorporated in a true XHTML page.While user agents are supposed to fail on any page that isn’t well-formed (in other words, one that doesn’t follow the generic XML grammar rules), they do not have to fail on a page that is well-formed but invalid. For example, although it is invalid to have a span element as an immediate child of the body element, most XML-supporting web browsers won’t provide indication of the error because the page is still well-formed — that is, the DTD is violated, but not the fundamental rules of XML itself. Some user agents may choose to be “validating” agents and will also fail on validity errors, but they aren’t common.Despite popular assumption, even if an XML page is perfectly valid, it still might not be well-formed.
  • Unlike HTML’s subset, which was specifically made for HTML, XML is a common subset used in many different languages. This means that a single simple parser can easily be written to support a number of different languages. It also paved the way for the Namespaces in XML standard which allows multiple documents in different XML formats to be combined in a single XML document, so that you can have, for example, an XHTML page that contains one or more SVG images that use MathML inside them.

 

Content type is everything

Up

When your website sends a document to the visitor’s browser, it adds on a special content type header that lets the browser know what kind of document it’s dealing with. For example, a PNG image has the content type image/png and a CSS file has the content type text/css. HTML documents have the content type text/html. Web servers typically send this content type whenever the file extension is .html, and server-side scripting languages like PHP also typically send documents as text/html by default.

XHTML does not have the same content type as HTML. The proper content type for XHTML is application/xhtml+xml. Currently, many web servers don’t have this content type reserved for any file extension, so you would need to modify the server configuration files or use a server-side scripting language to send the header manually. Simply specifying the content type in a meta element will not work over HTTP.

When a web browser sees the text/html content type, regardless of what the doctype says, it automatically assumes that it’s dealing with plain old HTML. Therefore, rather than using the XML parsing engine, it treats the document like tag soup, expecting HTML content. Because HTML 4.01 and simple XHTML 1.0 are often very similar, the browser can still understand the page fairly well. Most major browsers consider things like the self-closing portion of a tag (as in <br />) as a simple HTML error and strip it out, usually ending up with the HTML equivalent of what the author intended.

However, when the document is treated like HTML, you get none of the benefits XHTML offers. The browser won’t understand other XML formats like MathML and SVG that are included in the document, and it won’t do the automatic validation that XML parsers do. In order for the document to be treated properly, the server would need to send the application/xhtml+xml content type.

The problems go deeper. Comment markers are sometimes handled differently depending on the content type, and when you enclose the contents of a script or style element with basic SGML-style comments, it will cause your script and style information to be completely ignored when the document is treated like XML. Also, any special markup characters used in the inline contents of a style or script element will be parsed as markup instead of being treated as character data like in HTML. To solve these problems, you must use an elaborate escape sequence described in the article Escaping Style and Script Data, and even then there are situations in which it won’t work.

Furthermore, the CSS and DOM specifications have special provisions for HTML that don’t apply to XHTML when it’s treated as XML, so your page may look and behave in unexpected ways. The most common problem is a white gap around your page if you have a background on the body, no background on the html element, and any kind of spacing between the elements, such as a margin, padding, or a body height under 100% (browsers typically have some combination of these by default). In scripting, tag names are returned differently and document.write() doesn’t work in XHTML treated as XML. Table structure in the DOM is different between the two parsing modes. These are only a select few of the many differences.

The following are some examples of differing behavior between XHTML treated as HTML and XHTML treated as XML. The anticipated results are based on the way Internet Explorer, Firefox, and Opera treat XHTML served as HTML. Some other browsers are known to behave differently. Also note that Internet Explorer doesn’t recognize the application/xhtml+xml content type (see below for an explanation), so it will not be able to view the examples in the second column.

Example 1 Example 1
Example 2 Example 2
Example 3 Example 3
Example 4 Example 4
Example 5 Example 5
Example 6 Example 6
Example 7 Example 7
Example 8 Example 8
Example 9 Example 9
Example 10 Example 10

 

HTML compatibility guidelines

Up

When the XHTML 1.0 specification was first written, there were provisions that allowed an XHTML document to be sent as text/html as long as certain compatibility guidelines were followed. The idea was to ease migration to the new format without breaking old user agents. However, these provisions are now viewed by many as a mistake. The whole point of XHTML is to be an XML alternative to HTML, yet due to the allowance of XHTML documents to be sent as text/html, most so-called XHTML documents on the Web now would break if they were treated like XML (see the real-world examples below). Aware of the problem, the W3C had these provisions removed in the first revision of the XHTML specification. In XHTML 1.1 and onward, the W3C now clearly says that an XHTML document should not be sent as text/html. XHTML should be sent as application/xhtml+xml or one of the more elaborate XHTML content types.

 

Internet Explorer incompatibility

Up

Internet Explorer does not support XHTML. Like other web browsers, when a document is sent as text/html, it treats the document as if it was a poorly constructed HTML document. However, when the document is sent as application/xhtml+xml, Internet Explorer won’t recognize it as a webpage; instead, it will simply present the user with a download dialog. This issue still exists in Internet Explorer 7.

Although all other major web browsers, including Firefox, Opera, Safari, and Konqueror, support XHTML, the lack of support in Internet Explorer as well as major search engines and web applications makes use of it very discouraged.

 

Content negotiation

Up

Content negotiation is the idea of sending different content depending on what the user agent supports. Many sites attempt to send XHTML as application/xhtml+xml to those who support it, and either XHTML as text/html or real HTML to those who don’t.

There are two methods generally used to determine what the user agent supports, using the Accept HTTP header: most often, sites use the incorrect method where they simply look for the string “application/xhtml+xml” in the header value; although some sites will use the correct method, where they actually parse the header value, supporting wildcards and ordering by q value.

Unfortunately, neither of these methods works reliably.

The first method doesn’t work because not all XHTML-supporting user agents actually have the text “application/xhtml+xml” in the Accept header. Safari and Konqueror are two such browsers. The application/xhtml+xml content type is implied by a wildcard value instead. Meanwhile, not all HTML-supporting user agents have “text/html” in the header. Internet Explorer, for example, doesn’t mention this content type. Like Safari and Konqueror, it implies this support by using a wildcard. Even among those user agents that support XHTML and mention application/xhtml+xml in the header, it may have a lower q value than text/html (or a matching wildcard), which implies that the user agent actually prefers text/html (in other words, its XHTML support may be experimental or broken).

The second method (the correct, 100% standards-complaint one) doesn’t work because most major browsers have inaccurate Accept headers:

  • Firefox 2 and below have application/xhtml+xml listed with a higher q value than text/html, even though Mozilla has posted an official recommendation on its site saying that websites should use text/html for these versions if they can, for reasons described below.
  • Internet Explorer doesn’t list either text/html or application/xhtml+xml in its Accept header. Instead, both content types are covered by a single wildcard value (which implies that every content type in existence is supported equally well, which is obviously untrue). So Internet Explorer is saying that it supports both text/html and application/xhtml+xml equally, even though it actually doesn’t support application/xhtml+xml at all. In the case that a user agent claims to support both equally, the site is supposed to use its own preference. A possible workaround is for the site to “prefer” sending text/html or, in a toss-up situation, only send application/xhtml+xml if it’s actually mentioned explicitly in the header. However…
  • Safari and Konqueror, which support XHTML, also gives text/html and application/xhtml+xml the same q value (in fact, like Internet Explorer, they also claim to support everything in existence equally well). But they don’t mention application/xhtml+xml explicitly — it’s implied by a wildcard. So if you use the above workaround, Safari and Konqueror will receive text/html even though they really do support application/xhtml+xml.

As disappointing as it may be, content negotiation simply isn’t a reliable approach to this problem.

 

Null End Tags (NET)

Up

In XHTML, all elements are required to be closed, either by an end tag or by adding a slash to the start tag to make it self-closing. Since giving empty elements like img or br an end tag would confuse browsers treating the page like HTML, self-closing tags tend to be promoted. However, XML self-closing tags directly conflict with a little-known and poorly supported HTML/SGML feature: Null End Tags.

A Null End Tag is a special shorthand form of a tag that allows you to save a few characters in the document. Instead of writing <title>My page</title>, you could simply write <title/My page/ to accomplish the same thing. Due to the rules of Null End Tags, a single slash in an empty element’s start tag would close the tag right then and there, meaning <br/ is a complete and valid tag in HTML. As a result, if you have <br/> or <br />, a browser supporting Null End Tags would see that as a br element immediately followed by a simple > character. Therefore, an XHTML page treated as HTML could be littered with unwanted > characters.

This problem is often overlooked because most popular browsers today are lacking support for Null End Tags, as well as some other SGML shorthand features. However, there are still some smaller user agents that properly support Null End Tags. One of the more well-known user agents that support it is the W3C validator. If you send it a page that uses XHTML self-closing tags, but force it to parse the page as HTML/SGML like most user agents do for text/html pages, you can see the results in the outline: immediately after each of the self-closing elements, there is an unwanted > character that will be displayed on the page itself.

(It should be noted that the W3C Validator is unusual in that it generally determines the parsing mode from the doctype, rather than from the content type as most other user agents do. Therefore, an HTML doctype was used in the above example just so the validator would attempt to parse the page using the HTML subset of SGML as all major browsers will for text/html pages regardless of the doctype. The Null End Tag rules are actually set in the SGML subset definition, not the DTD, so this example is accurate to what you should expect in a fully compliant SGML user agent even with an XHTML doctype.)

Technically, a restricted and altered form of Null End Tags exists in XML and is frequently used: the self-closing portion of the start tag. While Null End Tags are defined as / … / in HTML’s subset of SGML, they are specially defined as / … > in XML with the added restriction that it must close immediately after it is opened, meaning the element must have no content. This was designed to look similar to a regular start tag for web developers who are unfamiliar with typical Null End Tags. However, in the process it creates inherent incompatibility with HTML’s subset of SGML for all empty elements.

In summary, although this issue doesn’t show in most popular web browsers, a user agent that more fully supports SGML would see unwanted > characters all over XHTML pages that are sent with the text/html content type. If the goal of using XHTML is to help promote standards, then it’s quite counterproductive to cause unnecessary problems for user agents that more correctly comply to the SGML standard.

 

Firefox and other problems

Up

Although Firefox supports the parsing of XHTML documents as XML when sent with the application/xhtml+xml content type, its performance in versions 2.0 and below is actually worse than with HTML. When parsing a page as HTML, Firefox will begin displaying the page while the content is being downloaded. This is called incremental rendering. However, when it’s parsing XML content, Firefox 2.0 and below will wait until the entire page is downloaded and checked for well-formedness before any of the content is displayed. This means that, although in theory XML is supposed to be faster to parse than HTML, in reality these versions of Firefox usually display HTML content to the user much faster than XHTML/XML content. Thankfully, this issue is expected to be resolved in Firefox 3.0.

However, there are also issues in other browsers, such as certain HTML-specific provisions in the CSS and DOM standards being mistakenly applied to XHTML content parsed as XML. For example, if there is a background set on the body element and none on the html element, Opera will apply the background to the html element as it would in HTML. So even when dealing exclusively with XHTML parsed as XML, you still run into a number of the same problems that you do when trying to serve XHTML either way.

All in all, true XHTML support in major user agents is still very weak. Because a key user agent — namely, Internet Explorer — has made no visible effort to support XHTML, other major user agents have continued to see it as a relatively low priority and so these bugs have lingered. HTML is recommended over XHTML by both Mozilla and Safari and is generally better supported than XHTML by all major browsers.

 

Conclusion

Up

XHTML is a very good thing, and I certainly hope to see it gain widespread acceptance in the future. However, it simply isn’t widely supported in its proper form. XHTML is an XML format, and to force a web browser to treat it like HTML is going against the whole purpose of XHTML and also inevitably causes other complications. Assuming you don’t want to dramatically limit access to your information, XHTML can only be used incorrectly, be interpretted as invalid markup by most user agents, cause unwanted results in others, and offer no added benefit over HTML. HTML 4.01 Strict is still what most user agents and search engines are most accustomed to, and there’s absolutely nothing wrong with using it if you don’t need the added benefits of XML. HTML 4.01 is still a W3C Recommendation, and the W3C has even announced plans to further develop HTML alongside XHTML in the future.

 

List of standards-related sites that break as XHTML

Up

The following are just a few of the countless sites that use an XHTML doctype but, as of this moment of writing, completely fail to load or otherwise work improperly when parsed as XML, thus missing the whole point of XHTML. The authors of most of these sites are quite prominent in the web standards community — many are involved in the Web Standards Project (WaSP) — yet they have still fallen victim to the pitfalls of current use of XHTML. In fact, I have found that nearly all XHTML websites owned by WaSP members have failures when parsed as XML.

You could consider this a “shame list” of sorts. These are the same people who are supposed to be teaching others how to use web standards properly, yet they have written markup that basically depends on browsers treating it incorrectly. But the main point of this list isn’t to pick on individuals; it’s to reinforce the fact that even so-called experts at web standards have trouble juggling the different ways XHTML will inevitably be handled on the Web. And what benefit does it bring? None of the following sites make use of anything XHTML offers over HTML.

You can test a page’s actual XHTML rendering in Firefox using the Force Content-type extension and setting the new content-type to application/xhtml+xml.

Accessify - WaSP Steering Committee, Accessibility Task Force
Displayed as generic XML, not interpretted as XHTML. The XML namespace was omitted.
all in the <head> - WaSP Steering Committee
Page doesn’t load. Not well-formed. (Note: this page is valid according to the XHTML DTD and XML’s subset of SGML, but XML has additional rules to define well-formed pages which this page breaks, observed in the Textpattern and the Technorati Link Count Widget post. A similar test case is available.)
And all that Malarkey - WaSP Accessibility Task Force
Page doesn’t load. Not well-formed.
CSS Zen Garden - WaSP
Top background doesn’t display. The page relies on HTML-specific background behavior. Numerous designs have errors with a similar cause.
dean.edwards.name/weblog/ - WaSP DOM Scripting Task Force, Microsoft Task Force
For browsers that support behavior binding (including Firefox) for the dynamic syntax highlighting of the code snippits, most of the code boxes fail to load the contents, resulting in many empty boxes where code snippits should be.
dog or higher
Page doesn’t load. Not well-formed.
Elly Thompson’s Weblog
Page doesn’t load. Not well-formed.
g9g.org - WaSP Steering Committee
There is a thick white gap around the page. The page relies on HTML-specific background behavior.
holly marie - WaSP Steering Committee
Page doesn’t load. Not well-formed.
Jeffrey Veen - WaSP emeritus
Page doesn’t load. Not well-formed.
KuraFire - WaSP
Page doesn’t load. Not well-formed.
Meriblog
Background appears white instead of purple. The page relies on HTML-specific background behavior.
mezzoblue - WaSP
Displayed as generic XML, not interpretted as XHTML. The XML namespace was omitted. Also, individual post pages don’t load. Not well-formed.
microformats
Page doesn’t load. Not well-formed.
molly.com - WaSP Group Lead
Flickr script fails to initialize because the script contents are commented out.
Off the Top - WaSP Steering Committee
Page doesn’t load. Not well-formed.
unadorned.org - WaSP Steering Committee
Stylesheet doesn’t load because the import rule is commented out.
WordPress - WaSP
Page doesn’t load. Not well-formed.

 

List of standards-related sites that stick with HTML

Up

The following are some significant sites relevant to web standards that continue to use HTML rather than XHTML.

  • 456 Berea Street
  • Anne van Kesteren
  • Bite Size Standards
  • David Baron’s Homepage
  • Hixie’s Natural Log
  • Jonathan Snook’s Blog
  • meyerweb.com
  • Mozilla
  • Web Devout
  • WebKit

This work is copyright © /2007/ David Hammond and is licensed under a Creative Commons Attribution Share-Alike License. It may be copied, modified, and distributed freely as long as it attributes the original author and maintains the original license. See the license for details.

Tutorial: Exploring Programming Language Architecture in Perl: Visit

Article: Perl 6 and Parrot: Things I Probably Shouldn’t Say But No One Else Seems To: Visit

Article: Using Java Classes in Perl: Visit

Perl Tutorial: The Beauty of Perl 6 Parameter Passing: Visit

Perl Tutorial: Using Bloom Filters: Visit

Building a repl in modern OO perl: Visit

Perl Lessons: Learn Perl in 10 easy lessons: Visit

Perl Lessons: Teach Yourself Perl 5 in 21 days: Visit

How to: Apache/Perl/MySQL/PHP for windows: Visit

Convert your perl scripts to PHP: Visit

Article: Secure Web site access with Perl: Visit

Ping Your Blog Automatically, Using This Perl Script: Visit

Tutorials: A Complete tutorials for PHP, Perl, HTML, ASP, VBScript, CSS, JavaScript: Visit

Building a Vector Space Search Engine in Perl: Visit

Why That List Sucks

  1. He’s swinging for the top of the trees

    The rule in any situation where you want to opimize some code is that you first profile it and then find the bottlenecks. Mr. Silverton, however, aims right for the tippy top of the trees. I’d say 60% of database optimization is properly understanding SQL and the basics of databases. You need to understand joins vs. subselects, column indices, how to normalize data, etc. The next 35% is understanding the performance characteristics of your database of choice. COUNT(*) in MySQL, for example, can either be almost-free or painfully slow depending on which storage engine you’re using. Other things to consider: under what conditions does your database invalidate caches, when does it sort on disk rather than in memory, when does it need to create temporary tables, etc. The final 5%, where few ever need venture, is where Mr. Silverton spends most of his time. Never once in my life have I used SQL_SMALL_RESULT.

  2. Good problems, bad solutions

    There are cases when Mr. Silverton does note a good problem. MySQL will indeed use a dynamic row format if it contains variable length fields like TEXT or BLOB, which, in this case, means sorting needs to be done on disk. The solution is not to eschew these datatypes, but rather to split off such fields into an associated table. The following schema represents this idea:

    1. CREATE TABLE posts (
    2.     id int UNSIGNED NOT NULL AUTO_INCREMENT,
    3.     author_id int UNSIGNED NOT NULL,
    4.     created timestamp NOT NULL,
    5.     PRIMARY KEY(id)
    6. );
    7.  
    8. CREATE TABLE posts_data (
    9.     post_id int UNSIGNED NOT NULL.
    10.     body text,
    11.     PRIMARY KEY(post_id)
    12. );
  3. That’s just…yeah

    Some of his suggestions are just mind-boggling, e.g., “remove unnecessary paratheses.” It really doesn’t matter whether you do SELECT * FROM posts WHERE (author_id = 5 AND published = 1) or SELECT * FROM posts WHERE author_id = 5 AND published = 1. None. Any decent DBMS is going to optimize these away. This level of detail is akin to wondering when writing a C program whether the post-increment or pre-increment operator is faster. Really, if that’s where you’re spending your energy, it’s a surprise you’ve written any code at all

My list

Let’s see if I fare any better. I’m going to start from the most general.

  1. Benchmark, benchmark, benchmark!

    You’re going to need numbers if you want to make a good decision. What queries are the worst? Where are the bottlenecks? Under what circumstances am I generating bad queries? Benchmarking is will let you simulate high-stress situations and, with the aid of profiling tools, expose the cracks in your database configuration. Tools of the trade include supersmack, ab, and SysBench. These tools either hit your database directly (e.g., supersmack) or simulate web traffic (e.g., ab).

  2. Profile, profile, profile!

    So, you’re able to generate high-stress situations, but now you need to find the cracks. This is what profiling is for. Profiling enables you to find the bottlenecks in your configuration, whether they be in memory, CPU, network, disk I/O, or, what is more likely, some combination of all of them.

    The very first thing you should do is turn on the MySQL slow query log and install mtop. This will give you access to information about the absolute worst offenders. Have a ten-second query ruining your web application? These guys will show you the query right off.

    After you’ve identified the slow queries you should learn about the MySQL internal tools, like EXPLAIN, SHOW STATUS, and SHOW PROCESSLIST. These will tell you what resources are being spent where, and what side effects your queries are having, e.g., whether your heinous triple-join subselect query is sorting in memory or on disk. Of course, you should also be using your usual array of command-line profiling tools like top, procinfo, vmstat, etc. to get more general system performance information.

  3. Tighten Up Your Schema

    Before you even start writing queries you have to design a schema. Remember that the memory requirements for a table are going to be around #entries * size of a row. Unless you expect every person on the planet to register 2.8 trillion times on your website you do not in fact need to make your user_id column a BIGINT. Likewise, if a text field will always be a fixed length (e.g., a US zipcode, which always has a canonical representation of the form “XXXXX-XXXX”) then a VARCHAR declaration just adds a superfluous byte for every row.

    Some people poo-poo database normalization, saying it produces unecessarily complex schema. However, proper normalization results in a minimization of redundant data. Fundamentally that means a smaller overall footprint at the cost of performance — the usual performance/memory tradeoff found everywhere in computer science. The best approach, IMO, is to normalize first and denormalize where performance demands it. Your schema will be more logical and you won’t be optimizing prematurely.

  4. Partition Your Tables

    Often you have a table in which only a few columns are accessed frequently. On a blog, for example, one might display entry titles in many places (e.g., a list of recent posts) but only ever display teasers or the full post bodies once on a given page. Horizontal vertical partitioning helps:

    1. CREATE TABLE posts (
    2.     id int UNSIGNED NOT NULL AUTO_INCREMENT,
    3.     author_id int UNSIGNED NOT NULL,
    4.     title varchar(128),
    5.     created timestamp NOT NULL,
    6.     PRIMARY KEY(id)
    7. );
    8.  
    9. CREATE TABLE posts_data (
    10.     post_id int UNSIGNED NOT NULL,
    11.     teaser text,
    12.     body text,
    13.     PRIMARY KEY(post_id)
    14. );

    The above represents a situation where one is optimizing for reading. Frequently accessed data is kept in one table while infrequently accessed data is kept in another. Since the data is now partitioned the infrequently access data takes up less memory. You can also optimize for writing: frequently changed data can be kept in one table, while infrequently changed data can be kept in another. This allows more efficient caching since MySQL no longer needs to expire the cache for data which probably hasn’t changed.

  5. Don’t Overuse Artificial Primary Keys

    Artificial primary keys are nice because they can make the schema less volatile. If we stored geography information in the US based on zip code, say, and the zip code system suddenly changed we’d be in a bit of trouble. On the other hand, many times there are perfectly fine natural keys. One example would be a join table for many-to-many relationships. What not to do:

    1. CREATE TABLE posts_tags (
    2.     relation_id int UNSIGNED NOT NULL AUTO_INCREMENT,
    3.     post_id int UNSIGNED NOT NULL,
    4.     tag_id int UNSIGNED NOT NULL,
    5.     PRIMARY KEY(relation_id),
    6.     UNIQUE INDEX(post_id, tag_id)
    7. );

    Not only is the artificial key entirely redundant given the column constraints, but the number of post-tag relations are now limited by the system-size of an integer. Instead one should do:

    1. CREATE TABLE posts_tags (
    2.     post_id int UNSIGNED NOT NULL,
    3.     tag_id int UNSIGNED NOT NULL,
    4.     PRIMARY KEY(post_id, tag_id)
    5. );
  6. Learn Your Indices

    Often your choice of indices will make or break your database. For those who haven’t progressed this far in their database studies, an index is a sort of hash. If we issue the query SELECT * FROM users WHERE last_name = ‘Goldstein’ and last_name has no index then your DBMS must scan every row of the table and compare it to the string ‘Goldstein.’ An index is usually a B-tree (though there are other options) which speeds up this comparison considerably.

    You should probably create indices for any field on which you are selecting, grouping, ordering, or joining. Obviously each index requires space proportional to the number of rows in your table, so too many indices winds up taking more memory. You also incur a performance hit on write operations, since every write now requires that the corresponding index be updated. There is a balance point which you can uncover by profiling your code. This varies from system to system and implementation to implementation.

  7. SQL is Not C

    C is the canonical procedural programming language and the greatest pitfall for a programmer looking to show off his database-fu is that he fails to realize that SQL is not procedural (nor is it functional or object-oriented, for that matter). Rather than thinking in terms of data and operations on data one must think of sets of data and relationships among those sets. This usually crops up with the improper use of a subquery:

    1. SELECT a.id,
    2.     (SELECT MAX(created)
    3.     FROM posts
    4.     WHERE author_id = a.id)
    5. AS latest_post
    6. FROM authors a

    Since this subquery is correlated, i.e., references a table in the outer query, one should convert the subquery to a join.

    1. SELECT a.id, MAX(p.created) AS latest_post
    2. FROM authors a
    3. INNER JOIN posts p
    4.     ON (a.id = p.author_id)
    5. GROUP BY a.id
  8. Understand your engines

    MySQL has two primary storange engines: MyISAM and InnoDB. Each has its own performance characteristics and considerations. In the broadest sense MyISAM is good for read-heavy data and InnoDB is good for write-heavy data, though there are cases where the opposite is true. The biggest gotcha is how the two differ with respect to the COUNT function.

    MyISAM keeps an internal cache of table meta-data like the number of rows. This means that, generally, COUNT(*) incurs no additional cost for a well-structured query. InnoDB, however, has no such cache. For a concrete example, let’s say we’re trying to paginate a query. If you have a query SELECT * FROM users LIMIT 5,10, let’s say, running SELECT COUNT(*) FROM users LIMIT 5,10 is essentially free with MyISAM but takes the same amount of time as the first query with InnoDB. MySQL has a SQL_CALC_FOUND_ROWS option which tells InnoDB to calculate the number of rows as it runs the query, which can then be retreived by executing SELECT FOUND_ROWS(). This is very MySQL-specific, but can be necessary in certain situations, particularly if you use InnoDB for its other features (e.g., row-level locking, stored procedures, etc.).

  9. MySQL specific shortcuts

    MySQL provides many extentions to SQL which help performance in many common use scenarios. Among these are INSERT … SELECT, INSERT … ON DUPLICATE KEY UPDATE, and REPLACE.

    I rarely hesitate to use the above since they are so convenient and provide real performance benefits in many situations. MySQL has other keywords which are more dangerous, however, and should be used sparingly. These include INSERT DELAYED, which tells MySQL that it is not important to insert the data immediately (say, e.g., in a logging situation). The problem with this is that under high load situations the insert might be delayed indefinitely, causing the insert queue to baloon. You can also give MySQL index hints about which indices to use. MySQL gets it right most of the time and when it doesn’t it is usually because of a bad scheme or poorly written query.

  10. And one for the road…

    Last, but not least, read Peter Zaitsev’s MySQL Performance Blog if you’re into the nitty-gritty of MySQL performance. He covers many of the finer aspects of database administration and performance.

Saved From http://20bits.com/

roshco roast lf12 flasher box of pampers & luvs. xhilaration stockins red and white striped mud pie baby shoes www.gpx pills.com Customer contact for RCA camcorder squeak no more screws kit Mattel UNO Spin Hannah Montana electronic brand list bluewave water bottle Avon Ideal shade smooth mineral mineral makeup review motorola PREPAID refill sylvania light bulbs automotive reviews boho holiday dessert plates ladies cable knit gloves eztec radio control trains reviews barry's boot camp men izod crew socks 100% cashmere topcoat pittsburgh penguin car seat covers single load liquid laundry detergent how to connect a pantech matrix to the computer a christmas carol reginald owen colorized DVD collegiate crocs university of florida books aztec woman Music Wire - .047in OD, Spring Steel Music Wire, Straight ladies blouse bodysuit weedeater bv2000 gas blower junior fiesta ware spoon rest coby 7587 7 in. digital photo frame reviews littlest pet shop planner MAKITA T220D CORDLESS STAPLER Josephine Bib Aprons pinzon mandoline johnsons baby oil packaging sports magazines basketball instant immersion french v3.0 reviews windjammer window caulk Teen Hardcore porn free videos Exterior Accessories Shop Adult dating group fucking girl squirting hairy pussy BBW hardcore bisexual teen sex anime girls Online adult dating
© 2007 - 2021 Web Development | iKon Wordpress Theme by TextNData | Powered by Wordpress | rakCha web directory
XHTML CSS Links RSS