Sirobot - a web fetch tool similar to wget
sirobot.pl [options] <URL> [[options] <URL>...]
Sirobot is a web fetch tool. It's implemented in Perl 5 and runs from a command line.
Sirobot takes URLs as arguments and is able to download them as well as all given images and links in those HTML files recursively, too.
The main advantage over other tools like GNU wget is the ability to fetch several files concurrently which effectively speeds up your download.
Call Sirobot (the executable is called sirobot.pl
) with at least one URL (see URL) as an argument or specify a file to read URLs from (option --file <file
>, see OPTIONS). If it can't find any URLs, a short usage advice is displayed and Sirobot
quits.
There are various possibilities to influence Sirobot's behaviour such as how deep it should crawl into a WWW tree.
Sirobot tries to figure out which proxy to use. Therefor it looks for the
environment variables $http_proxy
and $ftp_proxy
. You can always set the proxy configuration manually (see --proxy
and --ftpproxy
).
Often used options may be put into ~/.sirobotrc. This file is processed upon startup before any command line option is
read. This is done similar to the --file
command (see below) so the syntax is the same as describe there.
See also EXAMPLES for a rather useful example.
(If you are familiar with the usage of URLs you may skip this section)
A correct URL may looks like this:
http://server/path/to/index.html # Standard URL http://server/file?query # Standard URL with query http://server/file#frag # Standard URL with fragment If you need to access a webserver at another port instead of the commonly used port 80 (default), try this (example accesses port 1234):
http://server:1234/
Some pages are protected by passwords. Sirobot can access these pages, too but it needs a username and password from you. The following example takes ``honestguy'' as username and ``secret'' as password:
http://honestguy:secret@server/
It works the same for FTP.
Note: If you get a strange message about a missing method while using password authentication try updating your libwww-perl and/or URI libraries. See INSTALL for where to get them.
(See EXAMPLES for how to use them)
Sirobot's behaviour can be influenced in a lot of different ways to better fit your needs.
You can see a short summary of available options by simply running
sirobot.pl --help (displays summary of frequently used options) sirobot.pl --morehelp (displays summary of ALL available options) sirobot.pl --examples (displays some examples how to use Sirobot)
Please don't get confused by so many options, you surely do not need them
all :-)) If you don't know where to start, run
sirobot.pl --help
and check out the commands displayed there.
Many arguments like --depth
, --samedomain
or --exclude remain active for all remaining URLs unless other commands overwrites them.
Some arguments take an additional value (eg. --depth
takes a number).
Note: the following notations are all the same and internally converted to the first version.
--depth 1 --depth=1 -d 1 (only available for short options) -d1 (only available for short options)
Print a helpscreen with the most important options along with a short explanaition and quit.
See also --morehelp
Print a list with all available options along with a short explanaition and quit.
See also --help
Print version + build date and quit.
Note: The following options are mutually exclusive which means every option lasts until overwritten by another.
Be incredibly verbose. Useful for debugging (who guessed that? ;-)). If you
want to debug the child processes, too, also add --nocurses
to your commandline.
See also --verbose
, --silent
and --quiet
.
Don't show statistics when all downloads are done.
See also --stats
.
Absolutely no output (not even on errors).
See also --verbose
, --debug
and --silent
.
Print errors only.
See also --quiet
, --verbose
and --debug
.
Show some statistics when all downloads are done (default).
See also --nostats
.
Be a bit more verbose during operation and print statistics if done.
See also --quiet
, --silent
and --debug
.
Note: The following options are mutually exclusive which means every option lasts until overwritten by another.
Use the curses library for the user interface (UI) if it is available (default). It will be used to improve readability of statistics etc. The drawback is a slightly worse performance if you download a lot of small files because of the many screenupdates.
If curses cannot be used (eg. if stdout is not a tty), the ``old'' interface will be used.
See also --nocurses
.
Do not use the curses library. Everything will be printed out as-is. You may want to use this option to turn off warning messages in case you don't have the lib installed.
See also --curses
.
Note: The following options are global and mutually exclusive which means only the last of the given options is active.
Continue download if file already exists. This is nearly the same as --tries
(see there for limitations) except the fact that
--continue
works even if the (incomplete) file was fetched with another tool.
See also --force
and --noclobber
.
If a file already exists on your harddisc, overwrite it without asking.
See also --continue
, --newer
and --noclobber
.
Don't touch any existing files but skip this link (default).
See also --force
, --newer
and --continue
.
Overwrite existing files only if newer. This feature utilizes the
modification time of the file and requires the Last-Modified HTTP header
set by the server, otherwise it behaves like --noclobber
.
See also --force
, --noclobber
and --continue
.
Note: The following options are mutually exclusive which means every option lasts until overwritten by another.
Note: These options also affect in which subdirectory the files are stored.
Upon recursive download, fetch all links, whereever they're pointing to. Use with care!
See also --samedir
, --samedomain
, --sameserver
and --depth
.
Upon recursive download, only fetch those links pointing to the same directory or any subdirectories as the specified URL. This is the default operation.
See also --sameserver
, --samedomain
, --anyserver
and --depth
.
Upon recursive download, only fetch those links pointing to the same domain as the specified URL.
See also --samedir
, --sameserver
, --anyserver
and --depth
.
Upon recursive download, only fetch those links pointing to the same server as the specified URL.
See also --samedir
, --samedomain
, --anyserver
and --depth
.
Note: The following options can be mixed and each option may overwrite the preceeding one partially or completly.
Do not download files recursivly that match a comma separated list of regular expressions. By default, all files are allowed. Everything Perl provides as regular expressions can be used for <regexp>, it will be directly converted to the Perl statement m/<regexp>/; Here are the main facts:
Letters and digits match as-is (case sensitive matching!).
ba
matches bad
and alban
but not bla
.
and period (``.'') matches any single character
h.llo
matches hallo
and hello
.
an asterisk (``*'') matches any number of repetitions (including none) of
the character in front of it
xa*ba
matches xaba
, xaaba
, xaaaaaba
and even xba
.
a ^
at the beginning denotes the start of a line
^here
matches only if here appears at the beginning of a line. Therefor it never
matches there.
a $
at the end denotes the end.
gif$
matches any file that ends on gif.
$
, .
, ^
, brackets among others must be escaped by a backslash (\
) eg \$
See man perlre
for even more stuff and EXAMPLES. You may enter several --exclude and mix them with --include
. If you want to allow only particular files, try this combination:
--exclude . --include <regexp
>
which will disallow all files (a dot matches any string with at least one
character) and re-allow files matching <regexp>. The default can be restored by inserting --include .
.
Note: when entered as a shell command, the regexp should be quoted:
--include '.*'
.
See also --include
.
Allow downloading files recursivly that match a comma separated list of
regular expressions. You may specify enter several --include
and mix them with --exclude. By default, all files are allowed. See --exclude for more informations.
Note: Sirobot first reads the environment variables $http_proxy
,
$ftp_proxy
and $no_proxy
to figure out your system's the default settings.
Note: These settings are global for all URLs to fetch. Commandline options override environment settings.
Use <FTPPROXYURL> for all FTP connections. (``-'' will unset). Sirobot can't access FTP sites directly but always needs a proxy that translates between HTTP and FTP for it (most proxies are able to do that).
See also --proxy
and --noproxy
.
A comma separated list of domains which will be accessed without a proxy.
See also --proxy
and --ftpproxy
.
Use <PROXYURL> as a proxy for all HTTP requests. (``-'' will unset).
See also --ftpproxy
and --noproxy
.
Note: The following options are mutually exclusive which means every option lasts until overwritten by another.
Sirobot can be asked to convert all links HTML-files from absolute to
relative. Useful for sites that use a lot of absolute links (eg. Slashdot)
which you cannot view directly. Please note that the options --anyserver
, --sameserver
,
--samedomain
and --samedir
affect the decision which links to actually convert and which not because
they affect in which folder the files are actually stored.
See also --noconvert
. =for html <br><br> =item --noconvert
Turn conversion feature off (default).
See also --convert
.
Note: The following options are mutually exclusive which means every option lasts until overwritten by another.
Read additional options and URLs from the given file. <file> may contain multiple lines. Lines starting with # will be ignored.
Note: Althought it is possible to have multiple arguments per line, using one line per argument is strongly recommended.
All arguments read from the file are processed as if they have been entered in the command line. That means the same syntax applies but remember you must not escape special shell characters or use quotes. This also implies you can't have spaces as a part of an argument or empty arguments at all (really need that? Write me!)
See also EXAMPLES.
Turn off the --remove
feature (default).
See also --remove
.
This option only makes sense in combination with one or more URLs read from
a file (see --file
). After the URL has been downloaded successfully, it is deactivated in the
file it came from. --remove
is useful to better keep track of which files are already fetched and which
are not.
Deactivation of a link is done by prepending a #[SIROBOT: done]
to the line that contains the link.
In order to perform the work correctly it is necessary to have only one link per line (and only the link, no options in the same line, put them in a separate line before the link).
This flag is inteded to be used in combination with --continue
(which is not turned on by default) in order to continue large downloads
whenever you are online but it can be used without --continue
, too.
Note: As mentioned earlier, Sirobot can only detect if a file is complete if the server provides information about it's content length.
See also --noremove
, --file
and EXAMPLES.
Write logging information to file. This is very useful because you cannot redirect output to file if you use --curses
. In that case, everything printed to the upper part of the curses screen
is also written to file.
If you have curses turned off (eg. by --nocurses
), the output is the same as on the screen.
See also --nolog
Turn logging off (default).
See also --log
Turn on daemon mode. In this mode, Sirobot opens a named pipe (see --pipe
) and does not exit if there are no more waiting jobs. You can write any
arguments to the file and Sirobot will process them like those given by --file
.
Note: The named pipe must be created before you run Sirobot (eg. by the shell
command mkfifo
).
Note: Unfortunally, Sirobot blocks upon startup unless at least one line is written to the pipe (eg. by echo >/tmp/sirobot). This is not Sirobot's fault.
See also --nodaemon
, --pipe
and EXAMPLES.
Turn off daemon mode (default)
See also --daemon
and --pipe
.
Set name of pipe used for daemon mode. Default is /tmp/sirobot.
See also --daemon
and --nodaemon
.
Sirobot can download images and links of HTML files as well. This option specifies how deep Sirobot should descent into it. Depth 0 means Sirobot must only download the URLs specified in the command line.
Depth 1 tells Sirobot to download all included images but no further links. Depth 2 does the same as Depth 1 PLUS it fetches all links on this page PLUS all images of the links. Depth 3-... I think you guess it ;-)
To avoid downloading the whole internet, the use of --samedir
,
--sameserver
and --samedomain
as well as --exlucde
and --include
is strongly recommended!
Set value for the ``From:'' header in HTTP requests. By default, Sirobot
guesses your email address using the environment variables $USER
and $HOSTNAME
. Please set your email address with this option in ~/.sirobotrc
as shown in EXAMPLES.
Add user defined header to all HTTP requests. If <header> is a ``-'', the list of headers will be discarded. As an
example, --header From=myname@home
will be translated into a ``From: myname@home''-line in the HTTP request
header. Useful for sites that need a correct Referrer:-tag before they
allow downloads.
Specifies the number of downloads Sirobot should do concurrently. Default
is 5. This is a global setting.
Ignore /robots.txt. Usually, HTTP-servers supply a file named http://<servername>/robots.txt to inform automatic tools like Sirobot which files on this server must not be downloaded to prevent unwanted behaviour and infinite recursion.
This option is NOT RECOMMENDED! USE WITH CARE!
Saves all downloaded files to <dir>. <dir> and it's
subdirectories will be created if neccessary. By default, Sirobot saves
files to the current directory.
Tells how often Sirobot should try to get each file. Default is 1 which means Sirobot doesn't try again in case of failure.
To be able to determine if the download was incomplete, Sirobot needs some
help from the server so this feature might not work with all files! This
also applies to --continue
.
Execute this external program whenever a HTML page was successfully downloaded. The following arguments will be appended: URL, filename and current depth.
Any output written to STDOUT will be discarded unless the lines start with
ECHO
. Error messages written to STDERR are currently not filtered and therefor
go directly to Sirobot's STDERR and may cause screen corruption when Curses
are turned on.
If turned on (--dump
), this feature will cause Sirobot to NOT follow links and download them
recursively but write all found links to a file <file> (or STDOUT, if <file> is ``-''). Double links are automatically removed and dumped only
once. Also, a "-d<number
``> will be put in front of each link to represent the current depth
setting.
This can be used by an external program to filter incoming links or to run
Sirobot in some kind of dry or test mode. In conjunction with option --daemon
, the external program can feed the filtered links back into Sirobot.
Here's a simple and senseless loopback demonstration (the named pipe
/tmp/sirobot must exist):
sirobot.pl --dump --dumpfile /tmp/sirobot --pipe /tmp/sirobot \ --daemon -d2 http://www.sirlab.de/linux/
Please note the following drawbacks:
The directory structure will not be correct unless you use --anyserver
.
Dumping links to STDOUT will corrupt the screen when Curses are turned on
(see --curses
and --nocurses
). You might also want to turn statistics and other messages off, too: --quiet
, --nostats
.
Other settings (prefix, include/exclude list, headers, ...) are not forwarded. The last setting will be available.
Every dumped link is internally marked as successfully downloaded.
The dumpfile is opened and closed once for every link. On the one hand this means a loss of speed, on the other hand it allows you to get a snapshot while Sirobot runs.
Wait/don't wait for the user to press a key after all downloads are
complete and the statistics are shown. This option only affects Sirobot's
behaviour if Curses are turned on.
Allow/disallow use of cookies. Cookies, when received from the server and allowed by the user, are stored in the file <file> and sent back automatically.
By default, cookies are turned off. It is recommended to activate cookies only if the site you are downloading from refuses to transmit information unless cookies are enabled, and to start every session with a cookie file that is either empty or contains only the necessary persistent cookies.
Note: If multiple jobs store new cookies simultaneously, the cookies file might
be corrupted. Similarly, new cookies might not be available to parallel
jobs immediately. The recommended procedure is therefore to make sure that
the pages accompanied by cookies are retrieved first. This can be
accomplished by using the single cookie-page as a starting page, or by
getting the files accompanied by cookies in a first serial run (with --jobs
1) and make a second call with parallel jobs using the cookies stored in
the first run.
Set the size of chunks to process. Files are downloaded and saved in chunks. Bigger values mean less overhead and therefor much better performance but also less accuracy of the progress bar in curses mode.
Default value is 4096 Bytes (4 KB). Use bigger values for fast links and
use the default value or less for slow ones.
sirobot.pl http://www.sirlab.de/linux/
Get the Sirobot homepage (index.html) and it's images and store them in the current directory.
sirobot.pl --prefix /tmp/fetched/ \ http://www.sirlab.de/linux/
Same as above but save all downloaded files to /tmp/fetched/
sirobot.pl --depth 0 http://www.sirlab.de/linux/
Get index.html only (depth 0).
sirobot.pl --anyserver --depth 2 http://www.tscc/searcher.html
Get all links mentioned on this page, whereever they're pointing to with a
maximum depth of two.
sirobot.pl --exclude '\.gif$' http://www.linux.org/
Get homepage of linux.org but don't download URLs that end with ``.gif''.
sirobot.pl --sameserver --depth 2 --exclude '.' \ --include '\.html$' http://www.linux.org/
Get all pages recursively with a maximum depth of 2. Exclude all files and
re-allow those that end with ``.html''. That effectively means, only HTML
files get fetched but no images and other stuff.
sirobot.pl --file getthis.txt
Read getthis.txt and process it's content as command line arguments. Imagine getthis.txt consists of the following lines:
### start of getthis.txt ### --depth 0 http://xy.org/ --prefix zzz http://zzz.net/ ### end of getthis.txt ###
which is the same as if you invoke
sirobot.pl --depth 0 http://xy.org/ --prefix zzz http://zzz.net/
sirobot.pl --remove --continue --file getthis.txt
This is nearly the same as above, with one major difference: After http://xy.org/ and http://zzz.net/ are successfully downloaded, getthis.txt reads like this:
### start of getthis.txt ### --depth 0 #[SIROBOT: done] http://xy.org/ --prefix zzz #[SIROBOT: done] http://zzz.net/ ### end of getthis.txt ###
What's that good for you ask? Well, imagine your connection becomes
terminated before the files are completly fetched (eg. because you've hung
up your modem, the link broke down etc). Then you can issue exactly the
same line when you're back online again. You don't need to keep track which
files are complete and which are not.
You may create a file ~/.sirobotrc which will be processed upon startup. It usually contains your preferred settings so you don't need to type them every time.
Here's what I have put into my ~/.sirobotrc:
### start of ~/.sirobotrc ### # Put your email address here: --from yourusername@somedomain # Exclude all nasty big files that might accidently be fetched # during recursions. They still may be re-enabled if needed. --exclude \.(gz|bz2|tar|tgz|zip|lzh|lha)(\?.*)?$ --exclude \.(mpg|mp3|wav|aif|au)(\?.*)?$ --exclude \.(ps|pdf)(\?.*)?$ ### end of ~/.sirobotrc ###
mkfifo /tmp/sirobot sirobot.pl --daemon & echo >/tmp/sirobot
This creates the named pipe /tmp/sirobot (aka fifo) and puts Sirobot in daemon mode. Sirobot will block until you write something to the named pipe, that's what the last line is good for.
Now you can send Sirobot additional commands if you write to the pipe:
echo --depth 0 >/tmp/sirobot echo http://slashdot.org >/tmp/sirobot echo --prefix fm/ http://freshmeat.net >/tmp/sirobot
End daemon mode by writing --nodaemon
to the pipe:
echo --nodaemon >/tmp/sirobot
Remember that the following options affect only URLs issued after them:
--anyserver
, --samedomain
, --samedir
, --sameserver
, --depth
,
--prefix
, --exclude, --include
and --tries
.
This means, you can get URL1 with depth 2 and URL2 with depth 1 and save them to different directories with one single call of Sirobot if you try the combination ``--prefix dir1/ --depth 2 URL1 --prefix dir2/ --depth 1 URL2''.
sirobot.pl --anyserver -d 2 http://slashdot.org/ \ --samedir http://freshmeat.net/
Get all links from Slashdot (depth 2) and those links from freshmeat.net that point to the same directory (depth 2, too!).
You still didn't get it? Let me know! See CONTACT for how to contact the author.
This piece of software comes with absolutely no warranty. The author cannot be made responsible for any failures, defects or other damages caused by this program. Use it at your own risk.
Sirobot is GPL.
Problems? Found a bug? Want new features?
Feel free to contact the author for any kind of reason except SPAM:
Email: Settel <settel@sirlab.de> WWW: http://www.sirlab.de/linux/contact.html IRC: Settel, usually on #unika and #linuxger
See the following page for updates, changelogs etc:
http://www.sirlab.de/linux/