Thursday, December 31, 2009

Ubuntu 9.04: Backing Up and Copying Webpages and Websites

As described in a previous post, I had been using rsync to make backups of various files.  This strategy was not working so well in the case of webpages and websites, or at least I wasn't finding much guidance that I could understand.  (Incidentally, I had also tried the Windows program HTTrack Website Copier, but had found it to be complicated and frustrating.  It seemed to want either to download the entire Internet or nothing at all.)

The immediate need driving this investigation was that I wanted to know how to back up a blog.  I used the blog on which I am posting this note as my test bed.

Eventually, I discovered that maybe what I needed to use was wget, not rsync.  The wget manual seemed thorough if a bit longwinded and complex, so I tried the Wikipedia entry.  That, and another source, gave me the parts of the command I used first:

wget -r -l1 -np -A.html -N -w5 http://raywoodcockslatest.blogspot.com/search?max-results=1000 --directory-prefix=/media/Partition1/BlogBackup1

The parts of this wget command have the following meanings:

  • -r means that wget should recurse, i.e., it should go through the starting folder and all folders beneath it (e.g., www.website.com/topfolder and also www.website.com/topfolder/sub1 and sub2 and sub3 . . .)
  • -l1 (that's an L-one) means stay at level number one.  That is, don't download linked pages.
  • -np means "no parent" (i.e., stay at this level or below; don't go up to the parent directory)
  • -A.html means Accept only files with this extension (i.e., only .html files)
  • -N is short for Newer (i.e., only download files that are newer than what's already been downloaded).  In other words, it turns on timestamping
  • -w5 means wait five seconds between files.  This is because large downloads can overload the servers you are downloading from, in which case an irritated administrator may penalize you
  • The URL shown in this command is the URL of this very blog, plus the additional information needed to download all of my posts in one html file.  But it didn't work that way.  What I got, with this command, was each of the posts as a separate html file, which is what I preferred anyway
  • --directory-prefix indicates where I want to put the download.  If you don't use this option, everything will go into the folder where wget is running from.  I came across a couple of suggestions on what to do if your path has spaces in it, but I hadn't gotten that far yet

Incidentally, I also ran across another possibility that I didn't intend to use now, but that seemed potentially useful for the future.  Someone asked if there was a way to save each file with a unique name, so that every time  you run the wget script, you get the current state of the webpage.  One answer involved using mktemp.  Also, it seemed potentially useful to know that I could download all of the .jpg files from a webpage by using something like this:  wget -e robots=off -r -l1 --no-parent -A.jpg http://www.server.com/dir/

The first download was pretty good, but I had learned some more things in the meantime, and had some questions, so I decided to try again.  Here's the script I used for my second try:
wget -A.html --level=1 -N -np -p -r -w5 http://raywoodcockslatest.blogspot.com --directory-prefix=/media/Partition1/BlogBackup2

This time, I arranged the options (or at least the short ones) in alphabetical order.  The -p option indicated that images and style sheets would be downloaded too.  I wasn't sure I needed this -- the basic html pages looked pretty good in my download as they were -- but I thought it might be interesting to see how much larger that kind of download would be.  I used a shorter version of the source URL and I designated a different output directory.

I could have added -k (long form:  --convert-links) so that the links among the downloaded html pages would be modified to refer to the other downloaded pages, not to the webpage where I had downloaded them from; but then I decided that the purpose of the download was to give me a backup, not a local copy with full functionality; that is, I wanted the links to work properly when posted as webpages online, not necessarily when backed up on my hard drive.  I used the long form for the "level" option, just to make things clearer.  Likewise, with a bit of learning, I decided against using the -erobots=off option.  There were probably a million other options I could have considered, in the long description of wget in the official manual, but these were the ones that others seemed to mention most.

The results of this second try were mixed.  For one thing, I was getting a lot of messages of this form:

2010-01-01 01:43:03 (137 KB/s) - `/[target directory]/index.html?widgetType=BlogArchive&widgetId=BlogArchive1&action=toggle&dir=open&toggle=MONTHLY-1196485200000&toggleopen=MONTHLY-1259643600000' saved [70188]

Removing /[target directory]/index.html?widgetType=BlogArchive&widgetId=BlogArchive1&action=toggle&dir=open&toggle=MONTHLY-1196485200000&toggleopen=MONTHLY-1259643600000 since it should be rejected.

I didn't know what this meant, or why I hadn't gotten these kinds of messages when I ran the first version of the command (above).  It didn't seem likely that the mere rearrangement of options on the wget command line would be responsible.  To find out, I put it out of its misery (i.e., I ran "pkill wget" in a separate Terminal session) and took a closer look.

Things got a little confused at this point.  Blame it on the late hour.  I thought, for a moment, that I had found the answer.  A quick glance at the first forum that came up in response to my search led me to recognize that, of course, my command was contradictory:  it told wget to download style sheets (-p), but it also said that only html files would be accepted (-A.html).  But then, unless I muddled it somehow, it appeared that, in fact, I had not included the -p option after all.  I tried re-running version 2 of the command (above), this time definitely excluding the -p option.  And no, that wasn't it; I still got those same funky messages (above) about removing index.html.  So the -p option was not the culprit.

I tried again.  This time, I reverted to using exactly the command I had used in the first try (above), changing only the output directory.  Oh, and somewhere in this process, I shortened the target URL.  This gave me none of those funky messages.  So it seemed that the order of options on the command line did matter, and that the order used in the first version (above) was superior to that in the second version.  To sum up, then, the command that worked best for me, for purposes of backing up my Blogger.com (blogspot) blog, was this:

wget -r -l1 -np -A.html -N -w5 http://raywoodcockslatest.blogspot.com --directory-prefix=/media/Partition1/BlogBackup1

Since there are other blog hosts out there, I wanted to see if exactly the same approach would work elsewhere.  I also had a WordPress blog.  I tried the first version of the wget command (above), changing only the source URL and target folder, as follows:

wget -r -l1 -np -A.html -N -w5 http://raywoodcock.wordpress.com/ --directory-prefix=/media/Partition1/WordPressBackup

This did not work too well.  The script repeatedly produced messages saying "Last-modified header missing -- time-stamps turned off," so then wget would download the page again.  As far as I could tell from the pages I examined in a search, there was no way around this; apparently WordPress did not maintain time stamps.

The other problem was that it did not download all of the pages.  It would download only one index.html file for each month.  That index.html file would contain an actual post, which was good, but what about all the other posts from that month?  I modified the command to specify the year and month (e.g., http://raywoodcock.wordpress.com/2009/03/).  This worked.  Now the index.html file at the top of the subtree (e.g., http://raywoodcock.wordpress.com/2009/03/index.html) would display all of the posts from that month, and beneath it (in e.g., .../2009/03/01) I had named subfolders for each post, each of which contained the index.html file displaying that particular post.  So at this rate, I would have to write wget lines for each month in which I had posted blog entries.  But then I found that removing the -A.html option solved the problem.  But if I ran it at the year level, it worked only for some months, and skipped others.  I tried what appeared to be the suggestion of running it twice at the year level (i.e., at .../wordpress.com/ with an ending slash), with --save-cookies=cookies.txt --load-cookies=cookies.txt --keep-session-cookies.  That didn't seem to make a difference.  So the best I could do with a WordPress blog, at this point, was to enter separate wget commands for each month, like this:

wget -r -l1 -np -N -A.html -w5 http://raywoodcock.wordpress.com/2009/01 --directory-prefix=/media/Partition1/WordPressBackup

I added back the -A.html option, as shown, because it didn't seem to hurt anything; html pages were the only ones that had been downloaded anyway.

Since these monthly commands would re-download everything, I would run the older ones only occasionally, to pick up the infrequent revision of an older post.  I created a bunch of these, for the past and also for some months into the future.  I put the historical ones in a script called backup-hist.sh, which I planned to run only occasionally, and I put the current and future ones into my backup-day.sh, to run daily.

But, ah, not so fast.  When I tried this on another, unrelated WordPress blog, it did not consistently download all posts for each month.  I also noticed that it duplicated some posts, in the sense that the higher-level (e.g., month-level) index.html file seemed to contain everything that would appear on the month-level webpage on WordPress.  So, for example, if you had your WordPress blog set up to show a maximum of three posts per page, this higher-level webpage would show all three of those.  The pages looked good; it was just that I was not sure how I would use this mix in an effective backup-and-restore operation.  This raised the question for my own blog:  if I ever did have to restore my blog, was I going to examine the HTML for each webpage manually, to re-post only those portions of text and code that belonged on a given blog page?

I decided to combine approaches.  First, since it was yearend, I made a special-case backup of all posts in each blog.  I did this by setting the blogs to display 999 posts on one page, and then printed that page as a yearend backup PDF.  Second, I noticed that rerunning these scripts seemed to catch additional posts on the subsquent passes.  So instead of separating the current and historical posts, I decided to stay with the original idea of running one command to download each WordPress post.  I would hope that this got most of them, and for any that fell through the crack, I would refer to the most recent PDF-style copy of the posts.  The command I decided to use for this purpose was of this form:

wget -r -l1 -np -N -A.html -w5 [URL] --directory-prefix=/media/Backups/Blogs/WordPress

I had recently started one other blog.  This one was on Livejournal.com.  I tried the following command with that:

wget -r -l1 -np -N -A.html -w5 http://rwclippings.livejournal.com/ --directory-prefix=/media/Backups/Blogs//LiveJournal

This was as far as I was able to get into this process at this point.

0 comments: