summaryrefslogtreecommitdiffstats
path: root/Tools/webchecker/webchecker.py
Commit message (Collapse)AuthorAgeFilesLines
* Added more link attributes based on additonal information from ChrisFred Drake2001-04-051-1/+13
| | | | | | | McCafferty <christopher.mccafferty@csg.ch>, and a bit of experimentation with Navigator 4.7. HTML-as-deployed is evil!
* A number of improvements based on a discussion with Chris McCaffertyFred Drake2001-04-041-2/+24
| | | | | | | | | <christopher.mccafferty@csg.ch>: Add javascript: and telnet: to the types of URLs we ignore. Add support for several additional URL-valued attributes on the BODY, FRAME, IFRAME, LINK, OBJECT, and SCRIPT elements.
* Fix suggested by Magnus Kessler: in class Page, it is possible forGuido van Rossum2000-03-281-1/+4
| | | | | self.parser to be None; in that case don't dereference it in getnames().
* Integrated Sam Bayer's wcnew.py code. It seems silly to keep twoGuido van Rossum1999-11-171-46/+185
| | | | | files. Removed Sam's "SLB" change comments; otherwise this is the same as wcnew.py.
* Samuel L. Bayer:Guido van Rossum1999-11-171-1/+6
| | | | | | | | - forced new done origins to set errors if they're in self.bad (fixes bug where only the first of a number of errorful references to a link is reported under some circumstances) - suppressed adding duplicates to self.todo list (cleans up printout in wcgui details)
* Some changes (maybe not enough?) to make it work on Windows with localGuido van Rossum1999-04-261-3/+3
| | | | file URLs.
* Added note() message to Page class -- this was used but didn't exist.Guido van Rossum1998-08-061-0/+9
| | | | | (The alternative would be to call self.checker.note() but since self.checker might be None that's not quite right.
* Instead of printint, use self.message() or self.note().Guido van Rossum1998-07-081-71/+62
|
* sort the urls in the todo listGuido van Rossum1998-06-151-1/+3
|
* Use a try-except so that the pickle file is written even when we dieGuido van Rossum1998-04-271-14/+18
| | | | because of an unexpected exception.
* Give in to tabnannyGuido van Rossum1998-04-061-379/+379
|
* Major overhaul. Don't use global variable (e.g. verbose); useGuido van Rossum1998-02-211-130/+191
| | | | | | | instance variables. Make all global functions methods, for easy overriding. Restructure getpage() for easy overriding. Add save_pickle() method and load_pickle() global function to make it easier for other programs to emulate the toplevel interface.
* Several changes:Guido van Rossum1997-10-061-6/+24
| | | | | | | | | | - Change the code that looks for robots.txt to always look in /, even if the "root" path is somewhere deep down below. - Add link processing in <AREA> tags. - Change safeclose() to avoid crashing when the file has no geturl() method.
* Avoid the fancy handler for error 401 (request authentication).Guido van Rossum1997-05-071-4/+7
|
* Restructured Checker class to get rid of 'ext' table.Guido van Rossum1997-02-021-115/+72
| | | | | | | | | | | | | | | | | | | Links are now either in 'todo' or 'done', and ext links are hadled more like local links except that no further links are gathered (and sometimes they aren't checked, e.g. for mailto and news URLs). The -x option reverses its meaning: it disables checking of ext links (they are moved to 'done' without checking). A new 'errors' table collects pages with bad links as we go -- redundant, but useful for the GUI version which needs to report this as we go. Some new methods, including reset(). New checkpoint format. Adapted the GUI to the changes in the Checker class. Added Quit and "Start over" buttons, and a checkbox to disable checking external links. The details window now also shows bad links emanating from the selected page. Miscellaneous small chages.
* Process <img> and <frame> tags. Don't bother skipping second href.Guido van Rossum1997-02-011-3/+12
|
* Spin off checking of external page in a subroutine.Guido van Rossum1997-01-311-17/+20
| | | | | Increase MAXPAGE to 150K. Add back printing of __doc__ for usage message.
* Many misc changes.Guido van Rossum1997-01-311-95/+142
| | | | | | | | | | | | | | | | | | | | | | | - Faster HTML parser derivede from SGMLparser (Fred Gansevles). - All manipulations of todo, done, ext, bad are done via methods, so a derived class can override. Also moved the 'done' marking to dopage(), so run() is much simpler. - Added a method status() which returns a string containing the summary counts; added a "total" count. - Drop the guessing of the file type before opening the document -- we still need to check those links for validity! - Added a subroutine to close a connection which first slurps up the remaining data when it's an ftp URL -- apparently closing an ftp connection without reading till the end makes it hang. - Added -n option to skip running (only useful with -R). - The Checker object now has an instance variable which is set to 1 when it is changed. This is not pickled.
* Set proper User-agent header (Python-webchecker/<version>).Guido van Rossum1997-01-301-14/+21
| | | | | When -x is combined with -q, still do the checking, but don't print the error in this phase -- they are reported by report_errors().
* Some refinements of the external-link checking code: insert the errorsGuido van Rossum1997-01-301-9/+22
| | | | | | in the 'bad' dictionary (sanitize them so they are picklable; the sanitation code is now a subroutine); don't check mailto: URLs; omit colon in Error message.
* Added -x option to check external links. Slooooow!Guido van Rossum1997-01-301-10/+32
|
* Catch I/O errors when parsing robots.txt file.Guido van Rossum1997-01-301-5/+13
| | | | Add version number, printed at startup in non-quited mode.
* Added robots.txt support, using Skip Montanaro's parser.Guido van Rossum1997-01-301-3/+38
| | | | | Fixed occasional inclusion of unpicklable objects (Message in errors). Changed indent of a few messages.
* web tree checkerGuido van Rossum1997-01-301-0/+488