15 comments
Comment from: Kochise [Visitor]
Kochise

You could add at the end of each line in robots.txt the date in which the file should not be referenced anymore :) I think it’s easy to change a text file format (csv), what are parsers for otherwise ?

Kochise

09/28/06 @ 09:08
Comment from: [Member]

hum… Kochise, are you sure? where did you find that??

09/28/06 @ 09:33
Comment from: Danny Sullivan [Visitor]
Danny Sullivan

It’s very easy. You simply put a meta noarchive tag on each page you don’t want to have archived. That means Google will index the page, so you can find it in a search, but you won’t be able to view a cached copy ever. After two months, if you put the page in a registration required area, it will even drop out of Google entirely.

09/28/06 @ 11:58
Comment from: Peter [Visitor]
Peter

Yes, Danny is right.

09/28/06 @ 17:24
Comment from: [Member]

Danny’s got to be right! :)

I asked him by email if there would not be any side effects in telling Goggle not to Cache. I mean, Google needs the Cache to determine exact relevancy at search time. Not being in the Cache could restrict you to the supplemental results only.

Danny answered that people have been worried some time ago ago but that he hasn’t seen any worry like that for some time.

It is possible that meta noarchive just hides the archive link in Google but Google still caches internally. In this cases everything is okay.

Now I wonder: why don’t we all use meta noarchive? What good can it do to have content publicly available from Google’s cache instead of the original site? ;)

09/28/06 @ 17:34
Comment from: Angie Medford [Visitor]
Angie Medford

There have been plenty of cases where a page will be removed by the webmaster and get 404ed, but if you search Google with text that appeared on the original page, the cache copy shows up in Google. So Google is serving an index of “fresh” pages + caches of removed pages that webmasters and the content owners have removed from the Internet completely. That isn’t right. It’s harmful. And pretty evil.

09/28/06 @ 17:36
Comment from: [Member]

Hum… I also wonder how Google responds to a “410 Gone” response (instead of “404 Not Found")

The 410 response is primarily intended to assist the task of web maintenance by notifying the recipient that the resource is intentionally unavailable and that the server owners desire that remote links to that resource be removed. Such an event is common for limited-time, promotional services and for resources belonging to individuals no longer working at the server’s site. It is not necessary to mark all permanently unavailable resources as “gone” or to keep the mark for any length of time – that is left to the discretion of the server owner.

I Think it should definitely unindex and uncache the page in that event.

The remaining hacky solution would be to replace the gone page with a blank page containing a meta nocache header ;)

09/28/06 @ 17:42
Comment from: [Member]

Thanks for pointing that out Ben. I think it pretty much makes it clear that the papers could have fixed their issue without going to court! ;)

Another thing I understood from the ruling is that the papers were pretty much pissed off by the fact they Google never listened to their matter in the first place. I can believe that… since they didn’t show up at the court! :>

Maybe Google could afford a little tech support… even to Belgians ;)

As said in the comment on Reddit, you practically need to be an SEO to know there is a solution. (That also applies to Danny Sullivan above I guess ;)

09/28/06 @ 19:05
Comment from: nordsieck [Visitor]
nordsieck

Sounds like the solution for google is to totally ban all links to the effected newspapers.

I find issues of copyright and the web non-sensical - you had to make about 15 copies of this comment (along with the rest of the text on the page) in order to view it, between the inter-router hops, your browser’s cache, the in-memory version of the article, etc.

The fact that (at least in the US) everything is automatically copyrighted, and very few websites specifically grant people the right to copy flies in the face of their actions - stuff is on the web (generally) to be viewed (and that means copied) by everyone, as much as they want.

The entire situation simply doesn’t make sense.

09/28/06 @ 23:50
Comment from: Kochise [Visitor]
Kochise

@Angie Medford : “So Google is serving an index of “fresh” pages + caches of removed pages that webmasters and the content owners have removed from the Internet completely. That isn’t right. It’s harmful. And pretty evil.”

Google isn’t that bad, if you fall on a 404, it just couldn’t harm people anymore. Otherwise WebArchive may harm several people over the world, more than Google !

Kochise

09/29/06 @ 09:23
Comment from: John [Visitor]
John

In defense of the Google (or other) cache: Caches are VERY useful and provide a positive good to society in exposing hypocritical sites that post something controversial, then withdraw that posting (which is OK) but then try to claim that they *never* posted the material in question in the first place (which REALLY evil).

Also, most people here are missing the main point of Google’s objection to the ruling: their home page is ’sacred’. It is a key part of what makes them Google - the home page is simple and uncluttered.

There’s no reason the court couldn’t have compromised and permitted them to simply add a prominent link from their home page to the settlement. There’s no reason the text of settlement itself has to appear on the home page.

10/02/06 @ 08:08
Comment from: Nikhil [Visitor]
Nikhil

In response to Angie Medford, there was once a case of a server crashing at a major university. This server had a database of rare, historical documents. Because of Google’s system of caching this information, the university was able to recover a large percentage of the information from Google’s cache. So I would not really call the system of caching webpages /websites an “evil” practice. Any resource or utility can be used for good or bad purposes. It all depends on an individual’s or institution’s intentions.

10/02/06 @ 21:12
Comment from: Stefaan Vanderheyden [Visitor]
Stefaan Vanderheyden

Why should Google be held liable for another company’s inability to correctly manage their own web content?

It’s sad that Google did not appear in court. The judge’s ruling seems irrelevant in light of the fact that Google has always provided a technical means whereby the belgian newspapers could easily prevent users from linking to a cached version of their copyrighted articles. There are many sites which correctly use the meta noarchive to do just that.

It looks like the judge was simply defending a group of incompetent publishers’ right to continue being totally incompetent…

This fact becomes even clearer when you note that certain “archived” articles are available for a “1 credit” charge via LeSoir’s search box, but the same article remains accessible for free via another link on exactly the same website:

link
or
link

Luckily, I do not own shares Rossel et Cie SA (editor of Le Soir Magazine), ‘cause it seems to me that they do not know what the hell they are doing…

12/13/06 @ 18:10
Comment from: Click [Visitor]
Click

Seems to me like the papers, and possibly even Belguin, are just trying to say something to the effect of “we demand to be taken seriously,” and would have gone to court (and returned) over this even if they knew how to fix the problem technically. The judge from Belgium would be almost guaranteed to rule in favor of the papers every time, as it is in the best interest of Belgium, even if it seems a little unfair from an international perspective.

02/22/07 @ 05:19
Comment from: Thor Ingason [Visitor]
Thor Ingason

As a person battling to get an old website out of Google cache, which someone else submitted, I feel this scenario is REVERSED. It is the site webmasters who should put a meta tag in to HAVE sites archived, not the other way around. Google should NOT archive people’s sites without being asked to.

07/07/09 @ 17:04


Form is loading...