| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
| |
This uses cgi.parse_header() in Checker.checkforhtml(), so that
webchecker recognises the mime type text/html even if options
are specified.
|
| |
|
|
|
|
|
|
| |
The cause seems to be that when a file URL doesn't exist,
urllib.urlopen() raises OSError instead of IOError. Simply add this
to the except clause. Not elegant, but effective. :-)
|
|
|
|
| |
exception (for compatibility with old versions of Python).
|
|
|
|
|
|
|
| |
McCafferty <christopher.mccafferty@csg.ch>, and a bit of experimentation
with Navigator 4.7.
HTML-as-deployed is evil!
|
|
|
|
|
|
|
|
|
| |
<christopher.mccafferty@csg.ch>:
Add javascript: and telnet: to the types of URLs we ignore.
Add support for several additional URL-valued attributes on the BODY,
FRAME, IFRAME, LINK, OBJECT, and SCRIPT elements.
|
|
|
|
|
| |
the path to save a relative path by prefixing it with os.sep (':').
Also fix an indent inconsistency in the same function.
|
|
|
|
| |
If you do a "cvs update" in the Lib directory, it will pop up there.
|
|
|
|
|
| |
self.parser to be None; in that case don't dereference it in
getnames().
|
|
|
|
|
|
|
|
|
|
|
|
| |
The robotparser.py module currently lives in Tools/webchecker. In
preparation for its migration to Lib, I made the following changes:
* renamed the test() function _test
* corrected the URLs in _test() so they refer to actual documents
* added an "if __name__ == '__main__'" catcher to invoke _test()
when run as a main program
* added doc strings for the two main methods, parse and can_fetch
* replaced usage of regsub and regex with corresponding re code
|
| |
|
| |
|
|
|
|
|
| |
files. Removed Sam's "SLB" change comments; otherwise this is the
same as wcnew.py.
|
|
|
|
| |
# and removed trailing whitespace.
|
|
|
|
|
|
|
|
|
| |
- same trick with "import wcnew; webchecker = wcnew" as above
- updated readhtml() method to handle pair representation; used
new name suppression infrastructure from wcnew.py to suppress
processing name anchors
[And untabified --GvR]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- added -t and -a arguments
- added "import wcnew; webchecker = wcnew" in place of "import
webchecker" (I assume that if you're happy with the changes, you'll
just replace webchecker.py with wcnew.py, but if I were to do that,
the diffs would be incomprehensible)
- fixed buggy -v argument (I think you got out of sync with the
way verbosity was handled in webchecker vs. wcgui between 1.5 and
1.5.2)
- made -v actually do something by adding a call to c.setflags()
(probably the same problem as above)
- updated references to URLs to accommodate wcnew.py's pair
representation; added appropriate calls to format_url() to handle
display; added argument to ListPanel() initialization to provide
access to format_url()
[And untabified --GvR]
|
|
|
|
|
|
|
|
|
|
| |
- same fixes from webchecker.py
- incorporated small diff between current webchecker.py and 1.5.2
- fixed bug where "extra roots" added with the -t argument were being
checked as real roots, not just as possible continuations
- added -a argument to suppress checking of name anchors
[And untabified --GvR]
|
|
|
|
|
|
|
|
| |
- forced new done origins to set errors if they're in self.bad (fixes
bug where only the first of a number of errorful references to a
link is reported under some circumstances)
- suppressed adding duplicates to self.todo list (cleans up printout
in wcgui details)
|
|
|
|
| |
file URLs.
|
|
|
|
|
|
|
|
| |
Unfortunately his code breaks wcgui.py in a way that's not easy
to fix. I expect that this is a temporary situation --
eventually Sam's changes will be merged back in.
(The changes add a -t option to specify exceptions to the -x
option, and explicit checking for #foo style fragment ids.)
|
|
|
|
| |
create a directory and moer the original file to the index.html.
|
|
|
|
|
| |
(The alternative would be to call self.checker.note() but since
self.checker might be None that's not quite right.
|
| |
|
| |
|
|
|
|
|
|
|
| |
# checkin email because my PC doesn't have the "Mail" command.
Add threading (now that it works). Also some small adaptations to
Unix again.
|
| |
|
|
|
|
| |
doesn't depend on the value of os.sep. (I.e. ported to Windows :-)
|
| |
|
|
|
|
| |
because of an unexpected exception.
|
| |
|
|
|
|
|
| |
button widget, not involving a __getattr__() method but a callback on
the widget.
|
|
|
|
|
| |
getpage(), much less duplicate code is needed -- we only need to
override readhtml().
|
|
|
|
|
|
|
| |
instance variables. Make all global functions methods, for easy
overriding. Restructure getpage() for easy overriding. Add
save_pickle() method and load_pickle() global function to make it
easier for other programs to emulate the toplevel interface.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
| |
- Change the code that looks for robots.txt to always look in /, even
if the "root" path is somewhere deep down below.
- Add link processing in <AREA> tags.
- Change safeclose() to avoid crashing when the file has no geturl()
method.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Links are now either in 'todo' or 'done', and ext links
are hadled more like local links except that no further
links are gathered (and sometimes they aren't checked,
e.g. for mailto and news URLs). The -x option reverses
its meaning: it disables checking of ext links (they are
moved to 'done' without checking). A new 'errors' table
collects pages with bad links as we go -- redundant,
but useful for the GUI version which needs to report
this as we go. Some new methods, including reset().
New checkpoint format.
Adapted the GUI to the changes in the Checker class.
Added Quit and "Start over" buttons, and a checkbox
to disable checking external links. The details
window now also shows bad links emanating from the
selected page. Miscellaneous small chages.
|
|
|
|
| |
If the users selects an item in 'To check', start checking there.
|
| |
|
| |
|
| |
|
|
|
|
|
| |
Increase MAXPAGE to 150K.
Add back printing of __doc__ for usage message.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- Faster HTML parser derivede from SGMLparser (Fred Gansevles).
- All manipulations of todo, done, ext, bad are done via methods, so a
derived class can override. Also moved the 'done' marking to
dopage(), so run() is much simpler.
- Added a method status() which returns a string containing the
summary counts; added a "total" count.
- Drop the guessing of the file type before opening the document -- we
still need to check those links for validity!
- Added a subroutine to close a connection which first slurps up the
remaining data when it's an ftp URL -- apparently closing an ftp
connection without reading till the end makes it hang.
- Added -n option to skip running (only useful with -R).
- The Checker object now has an instance variable which is set to 1
when it is changed. This is not pickled.
|
|
|
|
|
| |
When -x is combined with -q, still do the checking, but don't print
the error in this phase -- they are reported by report_errors().
|
|
|
|
|
|
| |
in the 'bad' dictionary (sanitize them so they are picklable; the
sanitation code is now a subroutine); don't check mailto: URLs; omit
colon in Error message.
|
| |
|
|
|
|
| |
Add version number, printed at startup in non-quited mode.
|