NAME

map_site - generate sitemap.xml file for a website


SYNOPSIS

map_site [--exclude exclude_pattern] website_url [website_rootdir]

map_site --help


DESCRIPTION

The map_site command generates a sitemap.xml file, as used by common search-engine crawlers, for a website. The command is intended to be run on the web server. Starting at the website_rootdir, the map_site program walks the directory tree and examines each ordinary file. If a file appears to be an .html, .pdf, or other type of file commonly associated with websites, it is included in the sitemap.xml file. Files which appear to be graphics files (.jpg, .png, and so on) are not included.

--exclude exclude_pattern

The exclude_pattern is compared to each file path. If it matches, then that file is not included in the sitemap.xml file. The comparison is made against the entire path, not just the file name, so a match against a directory in the path, for example, will cause every file in that directory or in any subdirectory thereof to be excluded.

The --exclude switch may be repeated to exclude pathnames matching any of several patterns.

website_url

This specifies the URL of the website root. In the generated sitemap.xml file, the website_url is prepended to each path to create the entire URL for the page. The protocol ("http://") should be included.

website_rootdir

This is the path to the document root directory of the website. It is also where the sitemap.xml file is placed. The default is the current directory.


RETURN VALUE

0 - success

1 - error or failure


DIAGNOSTICS

Error messages are self-explanatory. Generally, the program quits upon encountering any error.


FILES

<stdout> - error and information messages

$website_rootdir/sitemap.xml - generated sitemap

$website_rootdir/sitemap.xml.$dateandtime - backup of previous sitemap.xml file

$website_rootdir/sitemap.xml.tmp - temporary working file


ENVIRONMENT

Written as a Perl script for Linux. Other environments never attempted, but who knows? they might work.


RESTRICTIONS

Only works with static web pages. The URL of the page must be formed from the path to the file from the document root.


BUGS

The types of files (.pdf, .html, .xml, and so on) included in the sitemap are hard coded in the program. Perhaps in the future there will be added options to modify the list, or a configuration file to specify the included file types.

WARNING: This program was written as an ad hoc solution to a specific problem. It was not tested or verified outside of the original application. It may have very serious bugs. Consider it "unfinished". A little work and testing may be in order. Use and proceed with caution!

No support is promised. However, if you have problems or questions, then please contact support@tatanka.com, and they may be addressed when and if resources are available.


NOTES

  1. The map_site command assumes static pages, and expects the URLs of the pages to mirror the directory structure beginning at the document root.

  2. In evaluating whether a file is to be included in the sitemap, the following procedure is followed:

    Step 1:

    Files which match the exclude_pattern(s) are rejected. Paths which end in a tilde (U+007E TILDE, "~"), are presumed to be backup files and are excluded gratuitously. Also excluded are files (such as sitelist.xml) generated by the map_site program itself, and files whose names begin with a ".".

    Step 2:

    If the file name has a suffix, it is compared with a hard-coded list of allowable suffixes. If the suffix is found in the list, then the file is included in the sitemap. The list includes suffixes typically found on website filenames, such as .html, .xml, .pdf, etc.

    Step 3:

    If the file name has no suffix, then the find(1) command is used to determine the type of file. If the type is "text", then the file is included in the sitemap. Otherwise, it is excluded.

  3. Patterns are compared to path names using Perl regular expression syntax, and not by using shell metacharacters. For instance, use \. for a literal "."; with Perl, "." is a wildcard.

  4. The generated sitemap includes the <loc> and <lastmod> tags; the <changefreq> and <priority> tags are not created. The <lastmod> tag value is derived from the last modification date and time of the file.

  5. You might try to use this on a development machine, then transfer the resulting sitemap.xml to the web server. This probably would work, but if the timestamps for files on the server were not to match the timestamps on the development machine, then the crawlers might miss some updated files.


EXAMPLES

The command

        map_site http://www.example.com/

will create a site map for the document root assumed to be at the current directory.

        map_site --exclude pdir http://www.example.com/ /usr/web/htdocs/

will create a site map for the document root /usr/web/htdocs, and will exclude any files containing "pdir" in their paths.


SEE ALSO

find(1)

http://www.sitemaps.org/

This document may also be installed as a man page; try "man map_site".


COPYRIGHT

Copyright 2012 Michael Marking <marking@tatanka.com>. This document is made available under the terms of the GNU Free Documentation License version 1.3.

The program itself is free software. It is provided WITHOUT ANY WARRANTY, without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. It is distributed under the terms of the GNU General Public License version 3.0.

This page is from map_site-0.0.2-prototype (2012.01.13).