diff options
author | Fred Drake <fdrake@acm.org> | 2000-10-24 02:34:45 (GMT) |
---|---|---|
committer | Fred Drake <fdrake@acm.org> | 2000-10-24 02:34:45 (GMT) |
commit | 669d36f02c6bae1fff38c767ee62a3c12fde43ff (patch) | |
tree | 7dde206803def7a7ef92ab319727b63987d0a4dc | |
parent | f61eac425a1061654150c4687c94bc71c0f6b7a2 (diff) | |
download | cpython-669d36f02c6bae1fff38c767ee62a3c12fde43ff.zip cpython-669d36f02c6bae1fff38c767ee62a3c12fde43ff.tar.gz cpython-669d36f02c6bae1fff38c767ee62a3c12fde43ff.tar.bz2 |
Paul Prescod <paul@prescod.net>:
Documentation for the xml.dom.minidom module & Python DOM API.
FLD: I have revised the markup in some places and added a few minor
details to Paul's text, but that's it. Given the substantial
structural differences with the bulk of the presentation, I will be
making additional revisions over the next few days.
-rw-r--r-- | Doc/lib/xmldom.tex | 614 |
1 files changed, 614 insertions, 0 deletions
diff --git a/Doc/lib/xmldom.tex b/Doc/lib/xmldom.tex new file mode 100644 index 0000000..c2945a4 --- /dev/null +++ b/Doc/lib/xmldom.tex @@ -0,0 +1,614 @@ +\section{\module{xml.dom.minidom} --- + The Document Object Model} + +\declaremodule{standard}{xml.dom.minidom} +\modulesynopsis{Lightweight Document Object Model (DOM) implementation.} +\moduleauthor{Paul Prescod}{paul@prescod.net} +\sectionauthor{Paul Prescod}{paul@prescod.net} +\sectionauthor{Martin v. L\"owis}{loewis@informatik.hu-berlin.de} + +\versionadded{2.0} + +The \module{xml.dom.minidom} provides a light-weight implementation of +the W3C Document Object Model. The DOM is a cross-language API from +the Web Consortium (W3C) for accessing and modifying XML documents. A +DOM implementation allows to convert an XML document into a tree-like +structure, or to build such a structure from scratch. It then gives +access to the structure through a set of objects which provided +well-known interfaces. Minidom is intended to be simpler than the full +DOM and also significantly smaller. + +The DOM is extremely useful for random-access applications. SAX only +allows you a view of one bit of the document at a time. If you are +looking at one SAX element, you have no access to another. If you are +looking at a text node, you have no access to a containing +element. When you write a SAX application, you need to keep track of +your program's position in the document somewhere in your own +code. Sax does not do it for you. Also, if you need to look ahead in +the XML document, you are just out of luck. + +Some applications are simply impossible in an event driven model with +no access to a tree. Of course you could build some sort of tree +yourself in SAX events, but the DOM allows you to avoid writing that +code. The DOM is a standard tree representation for XML data. + +%What if your needs are somewhere between SAX and the DOM? Perhaps you cannot +%afford to load the entire tree in memory but you find the SAX model +%somewhat cumbersome and low-level. There is also an experimental module +%called pulldom that allows you to build trees of only the parts of a +%document that you need structured access to. It also has features that allow +%you to find your way around the DOM. +% See http://www.prescod.net/python/pulldom + +DOM applications typically start by parsing some XML into a DOM. This +is done through the parse functions: + +\begin{verbatim} +from xml.dom.minidom import parse, parseString + +dom1 = parse('c:\\temp\\mydata.xml') # parse an XML file by name + +datasource = open('c:\\temp\\mydata.xml') +dom2 = parse(datasource) # parse an open file + +dom3 = parseString('<myxml>Some data<empty/> some more data</myxml>') +\end{verbatim} + +The parse function can take either a filename or an open file object. + +\begin{funcdesc}{parse}{filename_or_file{, parser}} + Return a \class{Document} from the given input. \var{filename_or_file} + may be either a file name, or a file-like object. \var{parser}, if + given, must be a SAX2 parser object. This function will change the + document handler of the parser and activate namespace support; other + parser configuration (like setting an entity resolver) must have been + done in advance. +\end{funcdesc} + +If you have XML in a string, you can use the parseString function +instead: + +\begin{funcdesc}{parseString}{string\optional{, parser}} + Return a \class{Document} that represents the \var{string}. This + method creates a \class{StringIO} object for the string and passes + that on to \function{parse}. +\end{funcdesc} + +Both functions return a document object representing the content of +the document. + +You can also create a document node merely by instantiating a +document object. Then you could add child nodes to it to populate +the DOM. + +\begin{verbatim} +from xml.dom.minidom import Document + +newdoc = Document() +newel = newdoc.createElement("some_tag") +newdoc.appendChild(newel) +\end{verbatim} + +Once you have a DOM document object, you can access the parts of your +XML document through its properties and methods. These properties are +defined in the DOM specification. The main property of the document +object is the documentElement property. It gives you the main element +in the XML document: the one that holds all others. Here is an +example program: + +\begin{verbatim} +dom3 = parseString("<myxml>Some data</myxml>") +assert dom3.documentElement.tagName == "myxml" +\end{verbatim} + +When you are finished with a DOM, you should clean it up. This is +necessary because some versions of Python do not support garbage +collection of objects that refer to each other in a cycle. Until this +restriction is removed from all versions of Python, it is safest to +write your code as if cycles would not be cleaned up. + +The way to clean up a DOM is to call its \method{unlink()} method: + +\begin{verbatim} +dom1.unlink() +dom2.unlink() +dom3.unlink() +\end{verbatim} + +\method{unlink()} is a \module{minidom}-specific extension to the DOM +API. After calling \method{unlink()}, a DOM is basically useless. + +\begin{seealso} + \seetitle[http://www.w3.org/TR/REC-DOM-Level-1/]{DOM Specification} + {This is the canonical specification for the level of the + DOM supported by \module{xml.dom.minidom}.} + \seetitle[http://pyxml.sourceforge.net]{PyXML}{Users that require a + full-featured implementation of DOM should use the PyXML + package.} +\end{seealso} + + +\subsection{DOM objects \label{dom-objects}} + +The definitive documentation for the DOM is the DOM specification from +the W3C. This section lists the properties and methods supported by +\refmodule{xml.dom.minidom}. + +\begin{classdesc}{Node}{} +All of the components of an XML document are subclasses of +\class{Node}. + +\begin{memberdesc}{nodeType} +An integer representing the node type. Symbolic constants for the +types are on the \class{Node} object: \constant{DOCUMENT_NODE}, +\constant{ELEMENT_NODE}, \constant{ATTRIBUTE_NODE}, +\constant{TEXT_NODE}, \constant{CDATA_SECTION_NODE}, +\constant{ENTITY_NODE}, \constant{PROCESSING_INSTRUCTION_NODE}, +\constant{COMMENT_NODE}, \constant{DOCUMENT_NODE}, +\constant{DOCUMENT_TYPE_NODE}, \constant{NOTATION_NODE}. +\end{memberdesc} + +\begin{memberdesc}{parentNode} +The parent of the current node. \code{None} for the document node. +\end{memberdesc} + +\begin{memberdesc}{attributes} +An \class{AttributeList} of attribute objects. Only +elements have this attribute. Others return \code{None}. +\end{memberdesc} + +\begin{memberdesc}{previousSibling} +The node that immediately precedes this one with the same parent. For +instance the element with an end-tag that comes just before the +\var{self} element's start-tag. Of course, XML documents are made +up of more than just elements so the previous sibling could be text, a +comment, or something else. +\end{memberdesc} + +\begin{memberdesc}{nextSibling} +The node that immediately follows this one with the same parent. See +also \member{previousSibling}. +\end{memberdesc} + +\begin{memberdesc}{childNodes} +A list of nodes contained within this node. +\end{memberdesc} + +\begin{memberdesc}{firstChild} +Equivalent to \code{childNodes[0]}. +\end{memberdesc} + +\begin{memberdesc}{lastChild} +Equivalent to \code{childNodes[-1]}. +\end{memberdesc} + +\begin{memberdesc}{nodeName} +Has a different meaning for each node type. See the DOM specification +for details. You can always get the information you would get here +from another property such as the \member{tagName} property for +elements or the \member{name} property for attributes. +\end{memberdesc} + +\begin{memberdesc}{nodeValue} +Has a different meaning for each node type. See the DOM specification +for details. The situation is similar to that with \member{nodeName}. +\end{memberdesc} + +\begin{methoddesc}{unlink}{} +Break internal references within the DOM so that it will be garbage +collected on versions of Python without cyclic GC. +\end{methoddesc} + +\begin{methoddesc}{writexml}{writer} +Write XML to the writer object. The writer should have a +\method{write()} method which matches that of the file object +interface. +\end{methoddesc} + +\begin{methoddesc}{toxml}{} +Return the XML string that the DOM represents. +\end{methoddesc} + +\begin{methoddesc}{hasChildNodes}{} +Returns true the node has any child nodes. +\end{methoddesc} + +\begin{methoddesc}{insertBefore}{newChild, refChild} +Insert a new child node before an existing child. It must be the case +that \var{refChild} is a child of this node; if not, +\exception{ValueError} is raised. +\end{methoddesc} + +\begin{methoddesc}{replaceChild}{newChild, oldChild} +Replace an existing node with a new node. It must be the case that +\var{oldChild} is a child of this node; if not, +\exception{ValueError} is raised. +\end{methoddesc} + +\begin{methoddesc}{removeChild}{oldChild} +Remove a child node. \var{oldChild} must be a child of this node; if +not, \exception{ValueError} is raised. +\end{methoddesc} + +\begin{methoddesc}{appendChild}{newChild} +Add a new child node to this node list. +\end{methoddesc} + +\begin{methoddesc}{cloneNode}{deep} +Clone this node. Deep means to clone all children also. Deep cloning +is not implemented in Python 2 so the deep parameter should always be +0 for now. +\end{methoddesc} + +\end{classdesc} + + +\begin{classdesc}{Document}{} +Represents an entire XML document, including its constituent elements, +attributes, processing instructions, comments etc. Remeber that it +inherits properties from \class{Node}. + +\begin{memberdesc}{documentElement} +The one and only root element of the document. +\end{memberdesc} + +\begin{methoddesc}{createElement}{tagName} +Create a new element. The element is not inserted into the document +when it is created. You need to explicitly insert it with one of the +other methods such as \method{insertBefore()} or +\method{appendChild()}. +\end{methoddesc} + +\begin{methoddesc}{createTextNode}{data} +Create a text node containing the data passed as a parameter. As with +the other creation methods, this one does not insert the node into the +tree. +\end{methoddesc} + +\begin{methoddesc}{createComment}{data} +Create a comment node containing the data passed as a parameter. As +with the other creation methods, this one does not insert the node +into the tree. +\end{methoddesc} + +\begin{methoddesc}{createProcessingInstruction}{target, data} +Create a processing instruction node containing the \var{target} and +\var{data} passed as parameters. As with the other creation methods, +this one does not insert the node into the tree. +\end{methoddesc} + +\begin{methoddesc}{createAttribute}{name} +Create an attribute node. This method does not associate the +attribute node with any particular element. You must use +\method{setAttributeNode()} on the appropriate \class{Element} object +to use the newly created attribute instance. +\end{methoddesc} + +\begin{methoddesc}{createElementNS}{namespaceURI, tagName} +Create a new element with a namespace. The \var{tagName} may have a +prefix. The element is not inserted into the document when it is +created. You need to explicitly insert it with one of the other +methods such as \method{insertBefore()} or \method{appendChild()}. +\end{methoddesc} + + +\begin{methoddesc}{createAttributeNS}{namespaceURI, qualifiedName} +Create an attribute node with a namespace. The \var{tagName} may have +a prefix. This method does not associate the attribute node with any +particular element. You must use \method{setAttributeNode()} on the +appropriate \class{Element} object to use the newly created attribute +instance. +\end{methoddesc} + +\begin{methoddesc}{getElementsByTagName}{tagName} +Search for all descendants (direct children, children's children, +etc.) with a particular element type name. +\end{methoddesc} + +\begin{methoddesc}{getElementsByTagNameNS}{namespaceURI, localName} +Search for all descendants (direct children, children's children, +etc.) with a particular namespace URI and localname. The localname is +the part of the namespace after the prefix. +\end{methoddesc} + +\end{classdesc} + + +\begin{classdesc}{Element}{} +\begin{memberdesc}{tagName} +The element type name. In a namespace-using document it may have +colons in it. +\end{memberdesc} + +\begin{memberdesc}{localName} +The part of the \member{tagName} following the colon if there is one, +else the entire \member{tagName}. +\end{memberdesc} + +\begin{memberdesc}{prefix} +The part of the \member{tagName} preceding the colon if there is one, +else the empty string. +\end{memberdesc} + +\begin{memberdesc}{namespaceURI} +The namespace associated with the tagName. +\end{memberdesc} + +\begin{methoddesc}{getAttribute}{attname} +Return an attribute value as a string. +\end{methoddesc} + +\begin{methoddesc}{setAttribute}{attname, value} +Set an attribute value from a string. +\end{methoddesc} + +\begin{methoddesc}{removeAttribute}{attname} +Remove an attribute by name. +\end{methoddesc} + +\begin{methoddesc}{getAttributeNS}{namespaceURI, localName} +Return an attribute value as a string, given a \var{namespaceURI} and +\var{localName}. Note that a localname is the part of a prefixed +attribute name after the colon (if there is one). +\end{methoddesc} + +\begin{methoddesc}{setAttributeNS}{namespaceURI, qname, value} +Set an attribute value from a string, given a \var{namespaceURI} and a +\var{qname}. Note that a qname is the whole attribute name. This is +different than above. +\end{methoddesc} + +\begin{methoddesc}{removeAttributeNS}{namespaceURI, localName} +Remove an attribute by name. Note that it uses a localName, not a +qname. +\end{methoddesc} + +\begin{methoddesc}{getElementsByTagName}{tagName} +Same as equivalent method in the \class{Document} class. +\end{methoddesc} + +\begin{methoddesc}{getElementsByTagNameNS}{tagName} +Same as equivalent method in the \class{Document} class. +\end{methoddesc} + +\end{classdesc} + + +\begin{classdesc}{Attribute}{} + +\begin{memberdesc}{name} +The attribute name. In a namespace-using document it may have colons +in it. +\end{memberdesc} + +\begin{memberdesc}{localName} +The part of the name following the colon if there is one, else the +entire name. +\end{memberdesc} + +\begin{memberdesc}{prefix} +The part of the name preceding the colon if there is one, else the +empty string. +\end{memberdesc} + +\begin{memberdesc}{namespaceURI} +The namespace associated with the attribute name. +\end{memberdesc} + +\end{classdesc} + + +\begin{classdesc}{AttributeList}{} + +\begin{memberdesc}{length} +The length of the attribute list. +\end{memberdesc} + +\begin{methoddesc}{item}{index} +Return an attribute with a particular index. The order you get the +attributes in is arbitrary but will be consistent for the life of a +DOM. Each item is an attribute node. Get its value with the +\member{value} attribbute. +\end{methoddesc} + +There are also experimental methods that give this class more +dictionary-like behavior. You can use them or you can use the +standardized \method{getAttribute*()}-family methods. + +\end{classdesc} + + +\begin{classdesc}{Comment}{} +Represents a comment in the XML document. + +\begin{memberdesc}{data} +The content of the comment. +\end{memberdesc} +\end{classdesc} + + +\begin{classdesc}{Text}{} +Represents text in the XML document. + +\begin{memberdesc}{data} +The content of the text node. +\end{memberdesc} +\end{classdesc} + + +\begin{classdesc}{ProcessingInstruction}{} +Represents a processing instruction in the XML document. + +\begin{memberdesc}{target} +The content of the processing instruction up to the first whitespace +character. +\end{memberdesc} + +\begin{memberdesc}{data} +The content of the processing instruction following the first +whitespace character. +\end{memberdesc} +\end{classdesc} + +Note that DOM attributes may also be manipulated as nodes instead of as +simple strings. It is fairly rare that you must do this, however, so this +usage is not yet documented here. + + +\begin{seealso} + \seetitle[http://www.w3.org/TR/REC-DOM-Level-1/]{DOM Specification} + {This is the canonical specification for the level of the + DOM supported by \module{xml.dom.minidom}.} +\end{seealso} + + +\subsection{DOM Example \label{dom-example}} + +This example program is a fairly realistic example of a simple +program. In this particular case, we do not take much advantage +of the flexibility of the DOM. + +\begin{verbatim} +from xml.dom.minidom import parse, parseString + +document=""" +<slideshow> +<title>Demo slideshow</title> +<slide><title>Slide title</title> +<point>This is a demo</point> +<point>Of a program for processing slides</point> +</slide> + +<slide><title>Another demo slide</title> +<point>It is important</point> +<point>To have more than</point> +<point>one slide</point> +</slide> +</slideshow> +""" + +dom = parseString(document) + +space=" " +def getText(nodelist): + rc="" + for node in nodelist: + if node.nodeType==node.TEXT_NODE: + rc=rc+node.data + return rc + +def handleSlideshow(slideshow): + print "<html>" + handleSlideshowTitle(slideshow.getElementsByTagName("title")[0]) + slides = slideshow.getElementsByTagName("slide") + handleToc(slides) + handleSlides(slides) + print "</html>" + +def handleSlides(slides): + for slide in slides: + handleSlide(slide) + +def handleSlide(slide): + handleSlideTitle(slide.getElementsByTagName("title")[0]) + handlePoints(slide.getElementsByTagName("point")) + +def handleSlideshowTitle(title): + print "<title>%s</title>"%getText(title.childNodes) + +def handleSlideTitle(title): + print "<h2>%s</h2>"%getText(title.childNodes) + +def handlePoints(points): + print "<ul>" + for point in points: + handlePoint(point) + print "</ul>" + +def handlePoint(point): + print "<li>%s</li>"%getText(point.childNodes) + +def handleToc(slides): + for slide in slides: + title = slide.getElementsByTagName("title")[0] + print "<p>%s</p>"%getText(title.childNodes) + +handleSlideshow(dom) +\end{verbatim} + +\subsection{minidom and the DOM standard \label{minidom-and-dom}} + +Minidom is basically a DOM 1.0-compatible DOM with some DOM 2 features +(primarily namespace features). + +Usage of the other DOM interfaces in Python is straight-forward. The +following mapping rules apply: + +\begin{itemize} + +\item Interfaces are accessed through instance objects. Applications +should +not instantiate the classes themselves; they should use the creator +functions. Derived interfaces support all operations (and attributes) +from the base interfaces, plus any new operations. + +\item Operations are used as methods. Since the DOM uses only +\code{in} +parameters, the arguments are passed in normal order (from left to +right). +There are no optional arguments. \code{void} operations return +\code{None}. + +\item IDL attributes map to instance attributes. For compatibility +with +the OMG IDL language mapping for Python, an attribute \code{foo} can +also be accessed through accessor functions \code{_get_foo} and +\code{_set_foo}. \code{readonly} attributes must not be changed. + +\item The types \code{short int},\code{unsigned int},\code{unsigned +long long}, +and \code{boolean} all map to Python integer objects. + +\item The type \code{DOMString} maps to Python strings. \code{minidom} +supports either byte or Unicode strings, but will normally produce +Unicode +strings. Attributes of type \code{DOMString} may also be \code{None}. + +\item \code{const} declarations map to variables in their respective +scope +(e.g. \code{xml.dom.minidom.Node.PROCESSING_INSTRUCTION_NODE}); they +must +not be changed. + +\item \code{DOMException} is currently not supported in +\module{minidom}. Instead, minidom returns standard Python exceptions +such as TypeError and AttributeError. + +\end{itemize} + +The following interfaces have no equivalent in minidom: + +\begin{itemize} + +\item DOMTimeStamp + +\item DocumentType + +\item DOMImplementation + +\item CharacterData + +\item CDATASection + +\item Notation + +\item Entity + +\item EntityReference + +\item DocumentFragment + +\end{itemize} + +Most of these reflect information in the XML document that is not of +general utility to most DOM users. |