Ray Woodcock's Latest: Blogger

Showing posts with label Blogger. Show all posts

Thursday, June 21, 2012

How Google Started to Become a Problem

I guess I have assumed that almost everybody loves Google, and those who don't are the bad guys. Microsoft, for example. Maybe it takes a huge corporation to stand up to another huge corporation. If so, Google is a champion for those who have disliked various things about how Microsoft got its start, what it did to increase its power, and what it has done with that power.

There comes a point, however, when the good guy turns bad. Maybe it doesn't have to happen. But power tends to corrupt. And even when it doesn't actually corrupt, it tends to create an impression of corruption. That impression may be able, by itself, to make people more or less as miserable as they would be in case of actual corruption and abuse.

Case in point. I have been blogging for years, here in Blogger. I wasn't necessarily eager to see Google acquire Blogger. But they were welcome to do so, for my purposes, as long as they left me alone. The deal was that I got to use their free blogging platform to put out various things that I wanted to write, and they got to use my work, my viewers, etc. to make money from advertising and whatnot.

Gifts can make people resentful when they stop. I would be unhappy with Google if they pulled the plug on my blogging enterprise, even though they're not charging me for it. I have spent years putting stuff here, linking one post to another and so forth. It would take a lot of work -- work that I might never do -- if they were suddenly to just shut it down or screw it up. I would feel that, after all, Google does have competitors, notably WordPress. If nothing else, I'd sooner be paying for a hosted website than to do all this work and then watch it get messed up.

What's sad is that I have been warned that they are quite capable of doing exactly that. It has already happened. Circa 2000, many people were using DejaNews as a convenient gateway to Usenet. Usenet newsgroups contained tons of free, helpful information on a vast array of subjects -- especially but not only computer-related, like this blog. Google acquired DejaNews. Evidently they felt that all that information would interfere with their desire to sell advertising related to webpages. For whatever reason, they basically destroyed Deja. That was a shame, for all those people who could have continued to use it to obtain useful information. And it was irritating to me, because all the things I had put out there, thinking I would always be able to access them, were removed from access as a practical matter, by me and most everyone else.

I was pretty unhappy with Google about that. That was the first big chink in their claim that they would "do no evil," as their corporate motto ("Don't be evil") has been widely reported. They had obviously ruined something useful, for purposes of increasing profits.

That stuff would not be coming back to mind now if I weren't having an off day with Google today. Here I am, working away on my blog, and suddenly it is no longer very functional in Internet Explorer. I have a nice little desktop arrangement, with various browsers, but now Blogger has suddenly ceased to work properly when I try to post or edit. Google lets me know that, instead, I should be using its own browser, Chrome, for this purpose.

That part happened several days ago. So, OK, I have been trying to post in Chrome instead. But I am finding that Chrome is not yet up to speed for this purpose. Google was eager enough to move me over to its browser -- the statements and signals have been out there for some time -- but, lo, it develops that Chrome is inserting white backgrounds. Whole chunks of my post are whited out. Why? I don't know. Probably they don't know either. I am having to go in and manually remove whiteing that I didn't put there. Why not just leave me alone, free to work on my blog in Internet Explorer, until Chrome gets its act together?

That seemed like a fair question, so I tried to present it to Google. Problem is, their "Contact Us" webpage is a lie. You cannot contact them through their webpage. Or at least I cannot. I tried today. I tried once before, with a problem so obvious and banal that it pained me to have to bring it to their attention. In that case, I gave up and wrote them a letter. It seemed ironic, and yet telling, that I had to use the U.S. Post Office to communicate a simple thought to one of the world's largest software corporations.

Like most people, I don't like being lied to. If you're not going to let me contact you, don't give me a "Contact Us" webpage. Call it "FAQs" or whatever. It's great that you can hire the best and the brightest, but that can backfire: you can create the impression that you think you're too good for the rest of us. It wouldn't be terribly smart to generate unnecessary resentment, would it?

It had never occurred to me, until today, to search for something that I have now searched for and found. Yes, as it turns out, there does exist something called IHateGoogle.org. I'm not really sure what it's about. I'm not resentful enough to dig into it. But, Google, keep it up: maybe someday I will be. You seem to be making a good start at it: today you tell me that as many as 1.4 million webpages convey that sort of feeling toward you and your actions.

Obviously, I am not the only person who has attempted to communicate with Google along these lines. People rarely get resentful when they feel they are being respected. If Google cannot make its own programs work together -- Chrome and Blogger, in this case -- it is welcome to keep them in beta. But forcing me to use them when I don't want to: at this point, that is a problem. Not just a software problem. As presented in this post, it is an indication of larger and more worrisome things.

Monday, June 14, 2010

Making a Post Look Right on Blogger

For several years, I had been using Blogger.com to host this blog. Blogger had a nasty habit of screwing up my formatting: it would insert spaces where I didn't want them, and would invent and repeat various codes that I did not want in my HTML. That is, even if things did look fine and function reasonably in the Compose view in Blogger, they could be a complete wreck in the Edit HTML view -- and in the final outcome. Blogger had often messed up my final posts, so that they would look different from how they had looked while I was editing them. Sometimes, things went to bizarre extremes. Here's an example:



<div style="text-align: center;">



Conclusion







<div style="text-align: left;">







<div style="display: inline ! important; text-align: center;">




<div style="display: inline ! important;">


</div>
</div>
</div>
</div>

Now, what was that all about? In normal view, it was just one word, "Conclusion," surrounded by a bunch of blank lines -- which, incidentally, I had not inserted; the webpage just took it upon itself to introduce all that junk into my *final* product. (Note that, even here, the foregoing example is surrounded by extra spaces that I did not insert.)

I posted a question about this in a Google help forum. One response pointed me toward a Blogger post on this Blogger problem. That post offered these tips:

Before creating your post, arrange your paragraphs using a word processor like MS Word, then paste it into Blogger.
Minimize changes (e.g., repositioning paragraphs) in the editor.
Preview before publishing to view initial output and put changes on it.

I felt that these were imperfect tips. On the first one, I had discovered that Word would add its own HTML codes when I pasted material directly into Blogger. I found it was better to copy the Word text into Notepad, which would tend to strip away that stuff, and then move it from Notepad to Blogger (or just do the editing in Notepad). The third one was not very helpful to me either, as the attempt to fix problems observed in Preview would sometimes just make things worse.

It seemed to me that the second of those three tips was the heart of the matter. Blogger's internal editor was just not up to the task of managing text editing. What I needed to do, I believed, was to try using some kind of HTML editing program, get my text all set up there, and then copy it over and post it without any further changes. I say an HTML editor, rather than a word processor, because I would want to insert links and make sure that the formatting was right in HTML terms.

About.com, which I had found to be a good general-purpose source of information, had a list of the 10 Best Windows WYSIWYG Editors. The ones topping the list, from Adobe and Microsoft, cost hundreds of dollars. In the past, I had used Microsoft FrontPage, which was part of Microsoft Office up through 2003, but seemed to have vanished thereafter. I was still running Office 2003, so I could have gone with that. I was trying to get away from relying on Microsoft software, however, and in any event About.com also listed four freeware HTML editors at the bottom of its Top 10 list. Of those four, SeaMonkey ranked highest, and had versions for Linux and Mac as well as Linux. It was also described as being appropriate for HTML newcomers. I was not that, but I was certainly not the alternative in their descriptions, i.e., a professional web designer. About.com also ranked SeaMonkey third in its list of Linux-specific HTML editors. On another list, though, SeaMonkey ranked only 17th, well behind Amaya (for professional web developers), which was ninth on the About.com list. Between the two, as I looked at their webpages, I felt that SeaMonkey was probably more the direction I wanted to go.

Before pursuing that choice further, I recalled that there was another possibility, namely, to use a blogging tool. Wikipedia offered a list of the blogging software used by what it described as the Top 20 blogs. Just two programs, Moveable Type and WordPress, dominated that list. The latter could be confusing: it was also the name of a blog hosting website, like Blogger, through which bloggers could easily use the WordPress blogging tool to prepare their own blogs. That last observation raised the question of why someone would use Blogger instead of WordPress to host a blog. The answer, in my case, was that I had started out with Blogger, I had a lot of stuff on there, and anyway I already had a WordPress blog, with a different orientation, and found it convenient to host this other blog somewhere else. I was already familiar with the WordPress website, though, and a brief glance at features suggested that it might be more approachable than Moveable Type. That said, a search for blogging tools introduced me to a number of factoids: that Twitter, among others, was considered a micro-blogging tool; that there was a free version of CoffeeCup, which had appeared on that About.com list; that PC Magazine gave me a comparison of blogging tools, and that Xanga -- the only one they rated with four stars (out of five) -- was free and seemed to offer good features, and that it just turned out to be another place to host a blog.

So now, I felt, my choice was down to the free tools (not blog hosting sites) offered by SeaMonkey, WordPress, and CoffeeCup. SeaMonkey came out sounding pretty good in Smashing Magazine's list, whereas one of the comments posted in response to their list said that CoffeeCup had the very problem I was experiencing, of having a lot of superfluous code generation. Another comment said positive things about SeaMonkey. As some of the foregoing comments suggest, WordPress seemed to be for web designers above my ability or interest level -- intended, among other things, for a direct link between the blogging tool and the blog host website. That seemed like a potentially more complex arrangement than I could justify at present.

I took one last look around and decided not to explore htmlArea's long list of WYSIWYG HTML editors. Instead, for purposes of my dual-boot and VMware-based machines, I downloaded the Windows version and installed the Ubuntu version via Synaptic. The latter was then available at Applications > Internet > SeaMonkey. I opened the program and selected File > New > Composer Page. (I wasn't able to install the Windows version right away because I had other programs running in my virtual machine, but I suspected the functionality was largely the same.) I went over to the webpage I was composing in Blogger and then realized that I wasn't sure whether I should copy my HTML or my normal text from Blogger into SeaMonkey. I went back to SeaMonkey and clicked the "HTML Tags" tab at the bottom. And -- whoa -- the program vanished. Not good! I went back to Applications > Internet > SeaMonkey and started it again. This time, I clicked on the HTML Source tab at the bottom of the screen. No problem: it opened a nearly empty HTML editing page.

I went over to Blogger, went into its HTML view, copied everything, and pasted it into that SeaMonkey page, between the two "body" codes. I clicked on the Normal tab and there it was, in more or less normal layout. I went back to the HTML Source tab and saw that it had not actually removed any of those excess codes. I cut and pasted the code out of there, put it into gedit (similar to Notepad), did a bunch of Find and Replace operations to remove the excess codes, and then pasted it back into the HTML Source tab. That removed all of my paragraph breaks, so I had to reinsert them manually. The location of paragraph breaks was easier to spot in the HTML Source tab. But then, when I flipped to the Normal view and back, the paragraph breaks that I had inserted by just using an extra Enter keystroke were gone. So I had to either do my paragraph breaks in Normal view or else use <p codes to break my paragraphs in HTML Source view. Also, lines were wrapped in a weird way. I didn't know how to remove unwanted line breaks in gedit or OpenOffice Writer, so I saved the HTML code as a text file, went into my virtual machine, and used Microsoft Word to replace ^p with a spacebar space, and then brought it back into the HTML Source tab -- but then it just reverted to how it was before. Also, the print in SeaMonkey was a bit small for good proofreading, but I couldn't figure out how to make it larger.

When I was done tinkering, I flipped back and forth between the Normal and HTML Source tabs a couple of times. It looked good; it did not stop looking good; and it did not seem to be inserting any unwanted new codes. Now, I needed to get all that nice HTML from SeaMonkey back into my Blogger webpage. SeaMonkey had a Publish Page option, wher it appeared that I could have just sent the result directly to Blogger. I wasn't sure how to do that, and anyway I wanted to look at the result in Blogger before publishing it. So I switched into HTML Source view, in SeaMonkey, and copied everything into the Edit HTML tab in Blogger (having deleted whatever was there before). I switched to Blogger's Compose view. This did not look too good. I clicked on Blogger's Preview button. It was a train wreck, mostly because lines were ending all over the place -- after one word, two words, three words, whatever. I looked again at the Edit HTML view in Blogger. The weird line endings and extra blank lines were there too. But they definitely weren't in the HTML in SeaMonkey.

I tried a different approach. I wiped out everything in Blogger and copied over again from SeaMonkey. This time, though, I copied from SeaMonkey's Normal view (i.e., not HTML Source) to Blogger's Compose view (i.e., not Edit HTML), and then I clicked on Blogger's Preview button. This was much better. Now lines were wrapping in sensible places. The problem now seemed to be that I was just getting line breaks (i.e., just the start of a new line) instead of paragraph breaks (i.e., with a blank line between paragraphs). So, OK, in SeaMonkey's HTML Source view, I tried doing a global replace (Ctrl-F) of with . I looked at the result in SeaMonkey's Normal view. Now most of my paragraph breaks were extra wide. Again, I copied this Normal view into Blogger's Compose view. But that wasn't it, either. Eventually I figured out that what I needed was to forget about and just use , so that took another global replacement in SeaMonkey, followed by manual adjustment of a lot of paragraph breaks. Later, I found SeaMonkey's Edit > Preferences > Composer option that said, "Return in a paragraph always creates a new paragraph." That was not checked, but I checked it. It would take another project to determine whether that would help.

The next problem was that, in Blogger's Preview, the first two paragraphs were in a larger typeface. They looked fine everywhere else; but in Preview they were wrong. I took a look in Blogger's Edit HTML view. Sure enough, Blogger had inserted this before the first word of my code: . Don't ask me why. I deleted it and took another look in Preview. That fixed that.

Now I had a few remaining random line breaks. The problem appeared to be that, somehow, some of my ordinary spacebar spaces had gotten replaced with nonbreaking space codes (i.e., " "). In SeaMonkey's HTML Source view, I did a temporary global replace of the space-nonbreaking space combination (i.e., " " ") to @@@. (Likewise when the two appeared in reverse order.) Then I did a search for all remaining occurrences of nonbreaking spaces, and replaced most of them with regular spaces. Not the ones at the starts or ends of lines, though: I had noticed that SeaMonkey did not search correctly for items wrapping over its own line breaks. So if a space occurred at the end of one line and a &38nbsp; occurred at the start of the next, SeaMonkey's find-and-replace would not find and replace. But making these decisions manually, for the hundreds of occurrences that may appear in a long post, was beyond my patience at this point. So I just made it global. Finally, I replaced the @@@ with the space-nonbreaking space combination again, and then took a look in SeaMonkey's Normal view. It looked good.

I belatedly realized that I had not yet tried SeaMonkey's Preview view, so I tried that now. It still looked good. I copied it from Normal view over to Blogger's Compose view again. I had to go into Blogger's Edit HTML view again to remove that funky starting font code again.

After hours of futzing around, it was done, and I posted it. Even with the aid of SeaMonkey, it was a hassle. SeaMonkey was definitely better than Blogger; my changes were making an improvement each time, and it was not undoing things that I had just fixed, and I did wind up with a satisfactory result. There seemed nonetheless to be some glitches in the way SeaMonkey worked, and I thought that I might want to try a different HTML editor next time. The foregoing review suggested that WordPress might be the weapon of choice, unless something new came along in the meantime.

Thursday, December 31, 2009

Ubuntu 9.04: Backing Up and Copying Webpages and Websites

As described in a previous post, I had been using rsync to make backups of various files. This strategy was not working so well in the case of webpages and websites, or at least I wasn't finding much guidance that I could understand. (Incidentally, I had also tried the Windows program HTTrack Website Copier, but had found it to be complicated and frustrating. It seemed to want either to download the entire Internet or nothing at all.)

The immediate need driving this investigation was that I wanted to know how to back up a blog. I used the blog on which I am posting this note as my test bed.

Eventually, I discovered that maybe what I needed to use was wget, not rsync. The wget manual seemed thorough if a bit longwinded and complex, so I tried the Wikipedia entry. That, and another source, gave me the parts of the command I used first:

wget -r -l1 -np -A.html -N -w5 http://raywoodcockslatest.blogspot.com/search?max-results=1000 --directory-prefix=/media/Partition1/BlogBackup1

The parts of this wget command have the following meanings:

-r means that wget should recurse, i.e., it should go through the starting folder and all folders beneath it (e.g., www.website.com/topfolder and also www.website.com/topfolder/sub1 and sub2 and sub3 . . .)
-l1 (that's an L-one) means stay at level number one. That is, don't download linked pages.
-np means "no parent" (i.e., stay at this level or below; don't go up to the parent directory)
-A.html means Accept only files with this extension (i.e., only .html files)
-N is short for Newer (i.e., only download files that are newer than what's already been downloaded). In other words, it turns on timestamping
-w5 means wait five seconds between files. This is because large downloads can overload the servers you are downloading from, in which case an irritated administrator may penalize you
The URL shown in this command is the URL of this very blog, plus the additional information needed to download all of my posts in one html file. But it didn't work that way. What I got, with this command, was each of the posts as a separate html file, which is what I preferred anyway
--directory-prefix indicates where I want to put the download. If you don't use this option, everything will go into the folder where wget is running from. I came across a couple of suggestions on what to do if your path has spaces in it, but I hadn't gotten that far yet

Incidentally, I also ran across another possibility that I didn't intend to use now, but that seemed potentially useful for the future. Someone asked if there was a way to save each file with a unique name, so that every time you run the wget script, you get the current state of the webpage. One answer involved using mktemp. Also, it seemed potentially useful to know that I could download all of the .jpg files from a webpage by using something like this: wget -e robots=off -r -l1 --no-parent -A.jpg http://www.server.com/dir/

The first download was pretty good, but I had learned some more things in the meantime, and had some questions, so I decided to try again. Here's the script I used for my second try:

wget -A.html --level=1 -N -np -p -r -w5 http://raywoodcockslatest.blogspot.com --directory-prefix=/media/Partition1/BlogBackup2

This time, I arranged the options (or at least the short ones) in alphabetical order. The -p option indicated that images and style sheets would be downloaded too. I wasn't sure I needed this -- the basic html pages looked pretty good in my download as they were -- but I thought it might be interesting to see how much larger that kind of download would be. I used a shorter version of the source URL and I designated a different output directory.

I could have added -k (long form: --convert-links) so that the links among the downloaded html pages would be modified to refer to the other downloaded pages, not to the webpage where I had downloaded them from; but then I decided that the purpose of the download was to give me a backup, not a local copy with full functionality; that is, I wanted the links to work properly when posted as webpages online, not necessarily when backed up on my hard drive. I used the long form for the "level" option, just to make things clearer. Likewise, with a bit of learning, I decided against using the -erobots=off option. There were probably a million other options I could have considered, in the long description of wget in the official manual, but these were the ones that others seemed to mention most.

The results of this second try were mixed. For one thing, I was getting a lot of messages of this form:

2010-01-01 01:43:03 (137 KB/s) - `/[target directory]/index.html?widgetType=BlogArchive&widgetId=BlogArchive1&action=toggle&dir=open&toggle=MONTHLY-1196485200000&toggleopen=MONTHLY-1259643600000' saved [70188]

Removing /[target directory]/index.html?widgetType=BlogArchive&widgetId=BlogArchive1&action=toggle&dir=open&toggle=MONTHLY-1196485200000&toggleopen=MONTHLY-1259643600000 since it should be rejected.

I didn't know what this meant, or why I hadn't gotten these kinds of messages when I ran the first version of the command (above). It didn't seem likely that the mere rearrangement of options on the wget command line would be responsible. To find out, I put it out of its misery (i.e., I ran "pkill wget" in a separate Terminal session) and took a closer look.

Things got a little confused at this point. Blame it on the late hour. I thought, for a moment, that I had found the answer. A quick glance at the first forum that came up in response to my search led me to recognize that, of course, my command was contradictory: it told wget to download style sheets (-p), but it also said that only html files would be accepted (-A.html). But then, unless I muddled it somehow, it appeared that, in fact, I had not included the -p option after all. I tried re-running version 2 of the command (above), this time definitely excluding the -p option. And no, that wasn't it; I still got those same funky messages (above) about removing index.html. So the -p option was not the culprit.

I tried again. This time, I reverted to using exactly the command I had used in the first try (above), changing only the output directory. Oh, and somewhere in this process, I shortened the target URL. This gave me none of those funky messages. So it seemed that the order of options on the command line did matter, and that the order used in the first version (above) was superior to that in the second version. To sum up, then, the command that worked best for me, for purposes of backing up my Blogger.com (blogspot) blog, was this:

wget -r -l1 -np -A.html -N -w5 http://raywoodcockslatest.blogspot.com --directory-prefix=/media/Partition1/BlogBackup1

Since there are other blog hosts out there, I wanted to see if exactly the same approach would work elsewhere. I also had a WordPress blog. I tried the first version of the wget command (above), changing only the source URL and target folder, as follows:

wget -r -l1 -np -A.html -N -w5 http://raywoodcock.wordpress.com/ --directory-prefix=/media/Partition1/WordPressBackup

This did not work too well. The script repeatedly produced messages saying "Last-modified header missing -- time-stamps turned off," so then wget would download the page again. As far as I could tell from the pages I examined in a search, there was no way around this; apparently WordPress did not maintain time stamps.

The other problem was that it did not download all of the pages. It would download only one index.html file for each month. That index.html file would contain an actual post, which was good, but what about all the other posts from that month? I modified the command to specify the year and month (e.g., http://raywoodcock.wordpress.com/2009/03/). This worked. Now the index.html file at the top of the subtree (e.g., http://raywoodcock.wordpress.com/2009/03/index.html) would display all of the posts from that month, and beneath it (in e.g., .../2009/03/01) I had named subfolders for each post, each of which contained the index.html file displaying that particular post. So at this rate, I would have to write wget lines for each month in which I had posted blog entries. But then I found that removing the -A.html option solved the problem. But if I ran it at the year level, it worked only for some months, and skipped others. I tried what appeared to be the suggestion of running it twice at the year level (i.e., at .../wordpress.com/ with an ending slash), with --save-cookies=cookies.txt --load-cookies=cookies.txt --keep-session-cookies. That didn't seem to make a difference. So the best I could do with a WordPress blog, at this point, was to enter separate wget commands for each month, like this:

wget -r -l1 -np -N -A.html -w5 http://raywoodcock.wordpress.com/2009/01 --directory-prefix=/media/Partition1/WordPressBackup

I added back the -A.html option, as shown, because it didn't seem to hurt anything; html pages were the only ones that had been downloaded anyway.

Since these monthly commands would re-download everything, I would run the older ones only occasionally, to pick up the infrequent revision of an older post. I created a bunch of these, for the past and also for some months into the future. I put the historical ones in a script called backup-hist.sh, which I planned to run only occasionally, and I put the current and future ones into my backup-day.sh, to run daily.

But, ah, not so fast. When I tried this on another, unrelated WordPress blog, it did not consistently download all posts for each month. I also noticed that it duplicated some posts, in the sense that the higher-level (e.g., month-level) index.html file seemed to contain everything that would appear on the month-level webpage on WordPress. So, for example, if you had your WordPress blog set up to show a maximum of three posts per page, this higher-level webpage would show all three of those. The pages looked good; it was just that I was not sure how I would use this mix in an effective backup-and-restore operation. This raised the question for my own blog: if I ever did have to restore my blog, was I going to examine the HTML for each webpage manually, to re-post only those portions of text and code that belonged on a given blog page?

I decided to combine approaches. First, since it was yearend, I made a special-case backup of all posts in each blog. I did this by setting the blogs to display 999 posts on one page, and then printed that page as a yearend backup PDF. Second, I noticed that rerunning these scripts seemed to catch additional posts on the subsquent passes. So instead of separating the current and historical posts, I decided to stay with the original idea of running one command to download each WordPress post. I would hope that this got most of them, and for any that fell through the crack, I would refer to the most recent PDF-style copy of the posts. The command I decided to use for this purpose was of this form:

wget -r -l1 -np -N -A.html -w5 [URL] --directory-prefix=/media/Backups/Blogs/WordPress

I had recently started one other blog. This one was on Livejournal.com. I tried the following command with that:

wget -r -l1 -np -N -A.html -w5 http://rwclippings.livejournal.com/ --directory-prefix=/media/Backups/Blogs//LiveJournal

This was as far as I was able to get into this process at this point.

Ray Woodcock's Latest

Pages

Thursday, June 21, 2012

How Google Started to Become a Problem

Monday, June 14, 2010

Making a Post Look Right on Blogger

Thursday, December 31, 2009

Ubuntu 9.04: Backing Up and Copying Webpages and Websites

Support This Blog

Total Pageviews

Archives

Ray Woodcock's Latest

Pages

Thursday, June 21, 2012

How Google Started to Become a Problem

Monday, June 14, 2010

Making a Post Look Right on Blogger

Thursday, December 31, 2009

Ubuntu 9.04: Backing Up and Copying Webpages and Websites

Support This Blog

RSS Feed - Subscribe to My

Total Pageviews

Archives