summaryrefslogtreecommitdiffstats
path: root/Tools/webchecker
Commit message (Collapse)AuthorAgeFilesLines
* Queue renaming reversal part 3: move module into place andGeorg Brandl2008-05-251-2/+2
| | | | change imports and other references. Closes #2925.
* Added stub for the Queue module to be renamed in 3.0.Alexandre Vassalotti2008-05-111-2/+2
| | | | Use the 3.0 module name to avoid spurious warnings.
* Patch #2167 from calvin: Remove unused importsChristian Heimes2008-02-232-3/+0
|
* Use sys.exc_info()Andrew M. Kuchling2006-07-261-1/+2
|
* Normalized a few cases of whitespace in function declarations.Martin Blais2006-06-061-2/+2
| | | | | | | | | | | | Found them using:: find . -name '*.py' | while read i ; do grep 'def[^(]*( ' $i /dev/null ; done find . -name '*.py' | while read i ; do grep ' ):' $i /dev/null ; done (I was doing this all over my own code anyway, because I'd been using spaces in all defs, so I thought I'd make a run on the Python code as well. If you need to do such fixes in your own code, you can use xx-rename or parenregu.el within emacs.)
* Whitespace normalization, via reindent.py.Tim Peters2004-07-182-9/+9
|
* [Patch #918212] Support XHTML's 'id' attribute, which can be on any element.Andrew M. Kuchling2004-03-212-6/+31
|
* SF bug 753592, websucker bugNeal Norwitz2003-07-011-1/+1
| | | | | Pass the proper variable when the user supplies a directory. Will backport.
* When bad HTML is encountered, ignore the page rather than failing withMark Hammond2003-02-271-1/+9
| | | | a traceback.
* Handle the Content-Type header a little more appropriately: if itFred Drake2002-11-121-0/+3
| | | | | | contains options, drop them to get the major/minor content type. Modified from the supplied patch to support more whitespace variation. Closes SF patch #613605.
* Apply diff2.txt from SF patch http://www.python.org/sf/572113Walter Dörwald2002-09-115-27/+22
| | | | | | | | (with one small bugfix in bgen/bgen/scantools.py) This replaces string module functions with string methods for the stuff in the Tools directory. Several uses of string.letters etc. are still remaining.
* Apply diff.txt from SF patch http://www.python.org/sf/561478Walter Dörwald2002-06-061-1/+2
| | | | | | This uses cgi.parse_header() in Checker.checkforhtml(), so that webchecker recognises the mime type text/html even if options are specified.
* [Bug #512799] urllib.splittype() returns a 2-tuple. (Reported by seb bacon)Andrew M. Kuchling2002-03-081-1/+1
|
* Fix SF bug #482171: webchecker dies on file: URLs w/o robots.txtGuido van Rossum2001-12-111-2/+2
| | | | | | The cause seems to be that when a file URL doesn't exist, urllib.urlopen() raises OSError instead of IOError. Simply add this to the except clause. Not elegant, but effective. :-)
* Only catch NameError and TypeError when attempting to subclass anFred Drake2001-05-111-1/+1
| | | | exception (for compatibility with old versions of Python).
* Added more link attributes based on additonal information from ChrisFred Drake2001-04-051-1/+13
| | | | | | | McCafferty <christopher.mccafferty@csg.ch>, and a bit of experimentation with Navigator 4.7. HTML-as-deployed is evil!
* A number of improvements based on a discussion with Chris McCaffertyFred Drake2001-04-041-2/+24
| | | | | | | | | <christopher.mccafferty@csg.ch>: Add javascript: and telnet: to the types of URLs we ignore. Add support for several additional URL-valued attributes on the BODY, FRAME, IFRAME, LINK, OBJECT, and SCRIPT elements.
* Patch inspired by Just van Rossum: on the Mac, in savefilename(), makeGuido van Rossum2000-04-251-1/+3
| | | | | the path to save a relative path by prefixing it with os.sep (':'). Also fix an indent inconsistency in the same function.
* Moved robotparser.py to the Lib directory.Guido van Rossum2000-03-291-97/+0
| | | | If you do a "cvs update" in the Lib directory, it will pop up there.
* Fix suggested by Magnus Kessler: in class Page, it is possible forGuido van Rossum2000-03-281-1/+4
| | | | | self.parser to be None; in that case don't dereference it in getnames().
* Skip Montanaro:Guido van Rossum2000-03-271-17/+17
| | | | | | | | | | | | The robotparser.py module currently lives in Tools/webchecker. In preparation for its migration to Lib, I made the following changes: * renamed the test() function _test * corrected the URLs in _test() so they refer to actual documents * added an "if __name__ == '__main__'" catcher to invoke _test() when run as a main program * added doc strings for the two main methods, parse and can_fetch * replaced usage of regsub and regex with corresponding re code
* Complete the integration of Sam Bayer's fixes.Guido van Rossum1999-11-172-912/+10
|
* Changed fron importing wcnew back to webchecker.Guido van Rossum1999-11-172-6/+2
|
* Integrated Sam Bayer's wcnew.py code. It seems silly to keep twoGuido van Rossum1999-11-171-46/+185
| | | | | files. Removed Sam's "SLB" change comments; otherwise this is the same as wcnew.py.
* # *NOT* by Sam Bayer: reindented to use 4 spaces like the rest here,Guido van Rossum1999-11-171-204/+203
| | | | # and removed trailing whitespace.
* Samuel L. Bayer:Guido van Rossum1999-11-171-4/+12
| | | | | | | | | - same trick with "import wcnew; webchecker = wcnew" as above - updated readhtml() method to handle pair representation; used new name suppression infrastructure from wcnew.py to suppress processing name anchors [And untabified --GvR]
* Samuel L. Bayer:Guido van Rossum1999-11-171-17/+46
| | | | | | | | | | | | | | | | | | | - added -t and -a arguments - added "import wcnew; webchecker = wcnew" in place of "import webchecker" (I assume that if you're happy with the changes, you'll just replace webchecker.py with wcnew.py, but if I were to do that, the diffs would be incomprehensible) - fixed buggy -v argument (I think you got out of sync with the way verbosity was handled in webchecker vs. wcgui between 1.5 and 1.5.2) - made -v actually do something by adding a call to c.setflags() (probably the same problem as above) - updated references to URLs to accommodate wcnew.py's pair representation; added appropriate calls to format_url() to handle display; added argument to ListPanel() initialization to provide access to format_url() [And untabified --GvR]
* Samuel L. Bayer:Guido van Rossum1999-11-171-154/+178
| | | | | | | | | | - same fixes from webchecker.py - incorporated small diff between current webchecker.py and 1.5.2 - fixed bug where "extra roots" added with the -t argument were being checked as real roots, not just as possible continuations - added -a argument to suppress checking of name anchors [And untabified --GvR]
* Samuel L. Bayer:Guido van Rossum1999-11-171-1/+6
| | | | | | | | - forced new done origins to set errors if they're in self.bad (fixes bug where only the first of a number of errorful references to a link is reported under some circumstances) - suppressed adding duplicates to self.todo list (cleans up printout in wcgui details)
* Some changes (maybe not enough?) to make it work on Windows with localGuido van Rossum1999-04-261-3/+3
| | | | file URLs.
* Added Samuel Bayer's new webchecker.Guido van Rossum1999-03-241-0/+884
| | | | | | | | Unfortunately his code breaks wcgui.py in a way that's not easy to fix. I expect that this is a temporary situation -- eventually Sam's changes will be merged back in. (The changes add a -t option to specify exceptions to the -x option, and explicit checking for #foo style fragment ids.)
* Recover from failed saves; when a file turns out to be a directory,Guido van Rossum1999-01-031-5/+17
| | | | create a directory and moer the original file to the index.html.
* Added note() message to Page class -- this was used but didn't exist.Guido van Rossum1998-08-061-0/+9
| | | | | (The alternative would be to call self.checker.note() but since self.checker might be None that's not quite right.
* Rewrite to support multiple suckers, each with their own thread.Guido van Rossum1998-07-081-102/+140
|
* Instead of printint, use self.message() or self.note().Guido van Rossum1998-07-082-72/+63
|
* # This is a new module I wrote over the weekend. Again, you missed theGuido van Rossum1998-06-151-16/+37
| | | | | | | # checkin email because my PC doesn't have the "Mail" command. Add threading (now that it works). Also some small adaptations to Unix again.
* Primitive GUI for websucker.Guido van Rossum1998-06-151-0/+185
|
* Fix the way a trailing / is changed to /index.html so that itGuido van Rossum1998-06-151-2/+3
| | | | doesn't depend on the value of os.sep. (I.e. ported to Windows :-)
* sort the urls in the todo listGuido van Rossum1998-06-151-1/+3
|
* Use a try-except so that the pickle file is written even when we dieGuido van Rossum1998-04-271-14/+18
| | | | because of an unexpected exception.
* Give in to tabnannyGuido van Rossum1998-04-066-1041/+850
|
* Use a better way to bind the checkext instance variable to a checkGuido van Rossum1998-03-051-9/+8
| | | | | button widget, not involving a __getattr__() method but a callback on the widget.
* Adapt to new webchecker structure. Due to better structure ofGuido van Rossum1998-02-211-59/+33
| | | | | getpage(), much less duplicate code is needed -- we only need to override readhtml().
* Major overhaul. Don't use global variable (e.g. verbose); useGuido van Rossum1998-02-211-130/+191
| | | | | | | instance variables. Make all global functions methods, for easy overriding. Restructure getpage() for easy overriding. Add save_pickle() method and load_pickle() global function to make it easier for other programs to emulate the toplevel interface.
* Map .shtml to text/html.Guido van Rossum1997-10-071-0/+1
|
* A variant on webchecker that creates a mirror copy of a remote site.Guido van Rossum1997-10-061-0/+131
|
* Several changes:Guido van Rossum1997-10-061-6/+24
| | | | | | | | | | - Change the code that looks for robots.txt to always look in /, even if the "root" path is somewhere deep down below. - Add link processing in <AREA> tags. - Change safeclose() to avoid crashing when the file has no geturl() method.
* Tiny script to play with it on a Mac.Guido van Rossum1997-05-281-0/+7
|
* Scroll to top of info window when done.Guido van Rossum1997-05-091-0/+1
|
* Avoid the fancy handler for error 401 (request authentication).Guido van Rossum1997-05-071-4/+7
|