NAME

Sirobot - a web fetch tool similar to wget


SYNOPSIS

 sirobot.pl [options] <URL> [[options] <URL>...]


DESCRIPTION

Sirobot is a web fetch tool. It's implemented in Perl 5 and runs from a command line.

Sirobot takes URLs as arguments and is able to download them as well as all given images and links in those HTML files recursively, too.

The main advantage over other tools like GNU wget is the ability to fetch several files concurrently which effectively speeds up your download.


USAGE

Call Sirobot (the executable is called sirobot.pl) with at least one URL (see URL) as an argument or specify a file to read URLs from (option --file <file>, see OPTIONS). If it can't find any URLs, a short usage advice is displayed and Sirobot quits.

There are various possibilities to influence Sirobot's behaviour such as how deep it should crawl into a WWW tree.

Sirobot tries to figure out which proxy to use. Therefor it looks for the environment variables $http_proxy and $ftp_proxy. You can always set the proxy configuration manually (see --proxy and --ftpproxy).

Often used options may be put into ~/.sirobotrc. This file is processed upon startup before any command line option is read. This is done similar to the --file command (see below) so the syntax is the same as describe there.

See also EXAMPLES for a rather useful example.


URL

(If you are familiar with the usage of URLs you may skip this section)

A correct URL may looks like this:

   http://server/path/to/index.html    # Standard URL
   http://server/file?query            # Standard URL with query
   http://server/file#frag             # Standard URL with fragment
   
If you need to access a webserver at another port instead of the commonly 
used port 80 (default), try this (example accesses port 1234):

   http://server:1234/

Some pages are protected by passwords. Sirobot can access these pages, too but it needs a username and password from you. The following example takes ``honestguy'' as username and ``secret'' as password:

   http://honestguy:secret@server/

It works the same for FTP.

Note: If you get a strange message about a missing method while using password authentication try updating your libwww-perl and/or URI libraries. See INSTALL for where to get them.


OPTIONS

(See EXAMPLES for how to use them)

Sirobot's behaviour can be influenced in a lot of different ways to better fit your needs.

You can see a short summary of available options by simply running

 sirobot.pl --help      (displays summary of frequently used options)
 sirobot.pl --morehelp  (displays summary of ALL available options)  
 sirobot.pl --examples  (displays some examples how to use Sirobot)

Please don't get confused by so many options, you surely do not need them all :-)) If you don't know where to start, run sirobot.pl --help and check out the commands displayed there.

Many arguments like --depth, --samedomain or --exclude remain active for all remaining URLs unless other commands overwrites them. Some arguments take an additional value (eg. --depth takes a number).

Note: the following notations are all the same and internally converted to the first version.

  --depth 1
  --depth=1
  -d 1          (only available for short options)
  -d1           (only available for short options)


Informative options

-h
--help

Print a helpscreen with the most important options along with a short explanaition and quit.

See also --morehelp

--morehelp

Print a list with all available options along with a short explanaition and quit.

See also --help

-V
--version

Print version + build date and quit.


Control verbosity

Note: The following options are mutually exclusive which means every option lasts until overwritten by another.

--debug

Be incredibly verbose. Useful for debugging (who guessed that? ;-)). If you want to debug the child processes, too, also add --nocurses to your commandline.

See also --verbose, --silent and --quiet.

--nostats

Don't show statistics when all downloads are done.

See also --stats.

--quiet

Absolutely no output (not even on errors).

See also --verbose, --debug and --silent.

--silent

Print errors only.

See also --quiet, --verbose and --debug.

--stats

Show some statistics when all downloads are done (default).

See also --nostats.

-v
--verbose

Be a bit more verbose during operation and print statistics if done.

See also --quiet, --silent and --debug.


Use curses library for user interface

Note: The following options are mutually exclusive which means every option lasts until overwritten by another.

--curses

Use the curses library for the user interface (UI) if it is available (default). It will be used to improve readability of statistics etc. The drawback is a slightly worse performance if you download a lot of small files because of the many screenupdates.

If curses cannot be used (eg. if stdout is not a tty), the ``old'' interface will be used.

See also --nocurses.

--nocurses

Do not use the curses library. Everything will be printed out as-is. You may want to use this option to turn off warning messages in case you don't have the lib installed.

See also --curses.


Control behaviour if files already exist.

Note: The following options are global and mutually exclusive which means only the last of the given options is active.

-c
--continue

Continue download if file already exists. This is nearly the same as --tries (see there for limitations) except the fact that --continue works even if the (incomplete) file was fetched with another tool.

See also --force and --noclobber.

-f
--force

If a file already exists on your harddisc, overwrite it without asking.

See also --continue, --newer and --noclobber.

--noclobber

Don't touch any existing files but skip this link (default).

See also --force, --newer and --continue.

-n
--newer

Overwrite existing files only if newer. This feature utilizes the modification time of the file and requires the Last-Modified HTTP header set by the server, otherwise it behaves like --noclobber.

See also --force, --noclobber and --continue.


Limit from where file will be fetched

Note: The following options are mutually exclusive which means every option lasts until overwritten by another.

Note: These options also affect in which subdirectory the files are stored.

--anyserver

Upon recursive download, fetch all links, whereever they're pointing to. Use with care!

See also --samedir, --samedomain, --sameserver and --depth.

--samedir

Upon recursive download, only fetch those links pointing to the same directory or any subdirectories as the specified URL. This is the default operation.

See also --sameserver, --samedomain, --anyserver and --depth.

--samedomain

Upon recursive download, only fetch those links pointing to the same domain as the specified URL.

See also --samedir, --sameserver, --anyserver and --depth.

--sameserver

Upon recursive download, only fetch those links pointing to the same server as the specified URL.

See also --samedir, --samedomain, --anyserver and --depth.


Limit which files will be fetched

Note: The following options can be mixed and each option may overwrite the preceeding one partially or completly.

--exclude <regexp>

Do not download files recursivly that match a comma separated list of regular expressions. By default, all files are allowed. Everything Perl provides as regular expressions can be used for <regexp>, it will be directly converted to the Perl statement m/<regexp>/; Here are the main facts:

See man perlre for even more stuff and EXAMPLES. You may enter several --exclude and mix them with --include. If you want to allow only particular files, try this combination:

--exclude . --include <regexp>

which will disallow all files (a dot matches any string with at least one character) and re-allow files matching <regexp>. The default can be restored by inserting --include .. Note: when entered as a shell command, the regexp should be quoted: --include '.*'.

See also --include.

--include <regexp>

Allow downloading files recursivly that match a comma separated list of regular expressions. You may specify enter several --include and mix them with --exclude. By default, all files are allowed. See --exclude for more informations.


Manual proxy configuration

Note: Sirobot first reads the environment variables $http_proxy, $ftp_proxy and $no_proxy to figure out your system's the default settings.

Note: These settings are global for all URLs to fetch. Commandline options override environment settings.

--ftpproxy <FTPPROXYURL>

Use <FTPPROXYURL> for all FTP connections. (``-'' will unset). Sirobot can't access FTP sites directly but always needs a proxy that translates between HTTP and FTP for it (most proxies are able to do that).

See also --proxy and --noproxy.

--noproxy <DOMAIN>,<DOMAIN>,...

A comma separated list of domains which will be accessed without a proxy.

See also --proxy and --ftpproxy.

-P <PROXYURL>
--proxy <PROXYURL>

Use <PROXYURL> as a proxy for all HTTP requests. (``-'' will unset).

See also --ftpproxy and --noproxy.


Convert pages

Note: The following options are mutually exclusive which means every option lasts until overwritten by another.

--convert

Sirobot can be asked to convert all links HTML-files from absolute to relative. Useful for sites that use a lot of absolute links (eg. Slashdot) which you cannot view directly. Please note that the options --anyserver, --sameserver, --samedomain and --samedir affect the decision which links to actually convert and which not because they affect in which folder the files are actually stored.

See also --noconvert. =for html <br><br> =item --noconvert

Turn conversion feature off (default).

See also --convert.


Read URLs and additional arguments from file

Note: The following options are mutually exclusive which means every option lasts until overwritten by another.

-F <file>
--file <file>

Read additional options and URLs from the given file. <file> may contain multiple lines. Lines starting with # will be ignored.

Note: Althought it is possible to have multiple arguments per line, using one line per argument is strongly recommended.

All arguments read from the file are processed as if they have been entered in the command line. That means the same syntax applies but remember you must not escape special shell characters or use quotes. This also implies you can't have spaces as a part of an argument or empty arguments at all (really need that? Write me!)

See also EXAMPLES.

--noremove

Turn off the --remove feature (default).

See also --remove.

--remove

This option only makes sense in combination with one or more URLs read from a file (see --file). After the URL has been downloaded successfully, it is deactivated in the file it came from. --remove is useful to better keep track of which files are already fetched and which are not.

Deactivation of a link is done by prepending a #[SIROBOT: done] to the line that contains the link.

In order to perform the work correctly it is necessary to have only one link per line (and only the link, no options in the same line, put them in a separate line before the link).

This flag is inteded to be used in combination with --continue (which is not turned on by default) in order to continue large downloads whenever you are online but it can be used without --continue, too.

Note: As mentioned earlier, Sirobot can only detect if a file is complete if the server provides information about it's content length.

See also --noremove, --file and EXAMPLES.


Log to file

--log <file>

Write logging information to file. This is very useful because you cannot redirect output to file if you use --curses. In that case, everything printed to the upper part of the curses screen is also written to file.

If you have curses turned off (eg. by --nocurses), the output is the same as on the screen.

See also --nolog

--nolog

Turn logging off (default).

See also --log


Daemon mode

--daemon

Turn on daemon mode. In this mode, Sirobot opens a named pipe (see --pipe) and does not exit if there are no more waiting jobs. You can write any arguments to the file and Sirobot will process them like those given by --file.

Note: The named pipe must be created before you run Sirobot (eg. by the shell command mkfifo).

Note: Unfortunally, Sirobot blocks upon startup unless at least one line is written to the pipe (eg. by echo >/tmp/sirobot). This is not Sirobot's fault.

See also --nodaemon, --pipe and EXAMPLES.

--nodaemon

Turn off daemon mode (default)

See also --daemon and --pipe.

--pipe <file>

Set name of pipe used for daemon mode. Default is /tmp/sirobot.

See also --daemon and --nodaemon.


Various options

-d <n>
--depth <n>

Sirobot can download images and links of HTML files as well. This option specifies how deep Sirobot should descent into it. Depth 0 means Sirobot must only download the URLs specified in the command line.

 Depth 1 tells Sirobot to download all included images but no
         further links.
 Depth 2 does the same as Depth 1 PLUS it fetches all links on this
         page PLUS all images of the links.
 Depth 3-... I think you guess it ;-)

To avoid downloading the whole internet, the use of --samedir, --sameserver and --samedomain as well as --exlucde and --include is strongly recommended!

--from <email>

Set value for the ``From:'' header in HTTP requests. By default, Sirobot guesses your email address using the environment variables $USER and $HOSTNAME. Please set your email address with this option in ~/.sirobotrc as shown in EXAMPLES.

-H <header>=<value>
--header <header>=<value>

Add user defined header to all HTTP requests. If <header> is a ``-'', the list of headers will be discarded. As an example, --header From=myname@home will be translated into a ``From: myname@home''-line in the HTTP request header. Useful for sites that need a correct Referrer:-tag before they allow downloads.

-j
--jobs

Specifies the number of downloads Sirobot should do concurrently. Default is 5. This is a global setting.

--norobots

Ignore /robots.txt. Usually, HTTP-servers supply a file named http://<servername>/robots.txt to inform automatic tools like Sirobot which files on this server must not be downloaded to prevent unwanted behaviour and infinite recursion.

This option is NOT RECOMMENDED! USE WITH CARE!

-p <dir>
--prefix <dir>

Saves all downloaded files to <dir>. <dir> and it's subdirectories will be created if neccessary. By default, Sirobot saves files to the current directory.

-t n
--tries n

Tells how often Sirobot should try to get each file. Default is 1 which means Sirobot doesn't try again in case of failure.

To be able to determine if the download was incomplete, Sirobot needs some help from the server so this feature might not work with all files! This also applies to --continue.

--exec <prg>

Execute this external program whenever a HTML page was successfully downloaded. The following arguments will be appended: URL, filename and current depth.

Any output written to STDOUT will be discarded unless the lines start with ECHO. Error messages written to STDERR are currently not filtered and therefor go directly to Sirobot's STDERR and may cause screen corruption when Curses are turned on.

--dump
--nodump
--dumpfile <file>

If turned on (--dump), this feature will cause Sirobot to NOT follow links and download them recursively but write all found links to a file <file> (or STDOUT, if <file> is ``-''). Double links are automatically removed and dumped only once. Also, a "-d<number ``> will be put in front of each link to represent the current depth setting.

This can be used by an external program to filter incoming links or to run Sirobot in some kind of dry or test mode. In conjunction with option --daemon, the external program can feed the filtered links back into Sirobot. Here's a simple and senseless loopback demonstration (the named pipe /tmp/sirobot must exist):

  sirobot.pl --dump --dumpfile /tmp/sirobot --pipe /tmp/sirobot \
             --daemon -d2 http://www.sirlab.de/linux/

Please note the following drawbacks:



--wait
--nowait

Wait/don't wait for the user to press a key after all downloads are complete and the statistics are shown. This option only affects Sirobot's behaviour if Curses are turned on.

--cookies <file>
--nocookies

Allow/disallow use of cookies. Cookies, when received from the server and allowed by the user, are stored in the file <file> and sent back automatically.

By default, cookies are turned off. It is recommended to activate cookies only if the site you are downloading from refuses to transmit information unless cookies are enabled, and to start every session with a cookie file that is either empty or contains only the necessary persistent cookies.

Note: If multiple jobs store new cookies simultaneously, the cookies file might be corrupted. Similarly, new cookies might not be available to parallel jobs immediately. The recommended procedure is therefore to make sure that the pages accompanied by cookies are retrieved first. This can be accomplished by using the single cookie-page as a starting page, or by getting the files accompanied by cookies in a first serial run (with --jobs 1) and make a second call with parallel jobs using the cookies stored in the first run.


Tuning options

--blocksize <val>

Set the size of chunks to process. Files are downloaded and saved in chunks. Bigger values mean less overhead and therefor much better performance but also less accuracy of the progress bar in curses mode.

Default value is 4096 Bytes (4 KB). Use bigger values for fast links and use the default value or less for slow ones.


Simple examples

get a single page

 sirobot.pl http://www.sirlab.de/linux/

Get the Sirobot homepage (index.html) and it's images and store them in the current directory.

save files to another directory

 sirobot.pl --prefix /tmp/fetched/ \
     http://www.sirlab.de/linux/

Same as above but save all downloaded files to /tmp/fetched/

don't fetch recursive

 sirobot.pl --depth 0 http://www.sirlab.de/linux/

Get index.html only (depth 0).

fetch recursive

 sirobot.pl --anyserver --depth 2 http://www.tscc/searcher.html

Get all links mentioned on this page, whereever they're pointing to with a maximum depth of two.

exclude files (simple)

 sirobot.pl --exclude '\.gif$' http://www.linux.org/

Get homepage of linux.org but don't download URLs that end with ``.gif''.

exclude files (advanced)

sirobot.pl --sameserver --depth 2 --exclude '.' \ --include '\.html$' http://www.linux.org/

Get all pages recursively with a maximum depth of 2. Exclude all files and re-allow those that end with ``.html''. That effectively means, only HTML files get fetched but no images and other stuff.


Read links from file

Read links from file

 sirobot.pl --file getthis.txt

Read getthis.txt and process it's content as command line arguments. Imagine getthis.txt consists of the following lines:

  ### start of getthis.txt ###
  --depth 0
  http://xy.org/
  --prefix zzz
  http://zzz.net/
  ###  end  of getthis.txt ###

which is the same as if you invoke

 sirobot.pl --depth 0 http://xy.org/ --prefix zzz http://zzz.net/


read links from file and remove them if done

 sirobot.pl --remove --continue --file getthis.txt

This is nearly the same as above, with one major difference: After http://xy.org/ and http://zzz.net/ are successfully downloaded, getthis.txt reads like this:

  ### start of getthis.txt ###
  --depth 0
  #[SIROBOT: done] http://xy.org/
  --prefix zzz
  #[SIROBOT: done] http://zzz.net/
  ###  end  of getthis.txt ###

What's that good for you ask? Well, imagine your connection becomes terminated before the files are completly fetched (eg. because you've hung up your modem, the link broke down etc). Then you can issue exactly the same line when you're back online again. You don't need to keep track which files are complete and which are not.


Your personal settings

You may create a file ~/.sirobotrc which will be processed upon startup. It usually contains your preferred settings so you don't need to type them every time.

Here's what I have put into my ~/.sirobotrc:

  ### start of ~/.sirobotrc ###
  # Put your email address here:
  --from yourusername@somedomain
  
  # Exclude all nasty big files that might accidently be fetched 
  # during recursions. They still may be re-enabled if needed.
  --exclude \.(gz|bz2|tar|tgz|zip|lzh|lha)(\?.*)?$
  --exclude \.(mpg|mp3|wav|aif|au)(\?.*)?$
  --exclude \.(ps|pdf)(\?.*)?$
  ###  end  of ~/.sirobotrc ###



Using daemon mode

 mkfifo /tmp/sirobot
 sirobot.pl --daemon &
 echo >/tmp/sirobot

This creates the named pipe /tmp/sirobot (aka fifo) and puts Sirobot in daemon mode. Sirobot will block until you write something to the named pipe, that's what the last line is good for.

Now you can send Sirobot additional commands if you write to the pipe:

 echo --depth 0 >/tmp/sirobot
 echo http://slashdot.org >/tmp/sirobot
 echo --prefix fm/ http://freshmeat.net >/tmp/sirobot

End daemon mode by writing --nodaemon to the pipe:

 echo --nodaemon >/tmp/sirobot



Hints

Remember that the following options affect only URLs issued after them: --anyserver, --samedomain, --samedir, --sameserver, --depth, --prefix, --exclude, --include and --tries.

This means, you can get URL1 with depth 2 and URL2 with depth 1 and save them to different directories with one single call of Sirobot if you try the combination ``--prefix dir1/ --depth 2 URL1 --prefix dir2/ --depth 1 URL2''.

 sirobot.pl --anyserver -d 2 http://slashdot.org/ \
     --samedir http://freshmeat.net/

Get all links from Slashdot (depth 2) and those links from freshmeat.net that point to the same directory (depth 2, too!).

You still didn't get it? Let me know! See CONTACT for how to contact the author.


DISCLAIMER

This piece of software comes with absolutely no warranty. The author cannot be made responsible for any failures, defects or other damages caused by this program. Use it at your own risk.


COPYRIGHT

Sirobot is GPL.


CONTACT

Problems? Found a bug? Want new features?

Feel free to contact the author for any kind of reason except SPAM:

    Email: Settel <settel@sirlab.de>
      WWW: http://www.sirlab.de/linux/contact.html
      IRC: Settel, usually on #unika and #linuxger

See the following page for updates, changelogs etc:

      http://www.sirlab.de/linux/