The Tragedy of Google Books (2017)

lispybanana | 503 points

These Google scans are also available in the HathiTrust [1], an organization built from the big academic libraries that participated in early book digitization efforts. The HathiTrust is better about letting the public read books that have actually fallen into the public domain. I have found many books that are "snippet view" only on Google Books but freely visible on HathiTrust.

If you are a student or researcher at one of the participating HathiTrust institutions, you can also get access to scans of books that are still in copyright.

The one advantage Google Books still has is that its search tools are much faster and sometimes better, so it can be useful to search for phrases or topics on Google Books and then jump over to HathiTrust to read specific books surfaced by the search.

[1] https://www.hathitrust.org/

philipkglass | 10 months ago

> Dan Clancy, the Google engineering lead on the project who helped design the settlement, thinks that it was a particular brand of objector—not Google’s competitors but “sympathetic entities” you’d think would be in favor of it, like library enthusiasts, academic authors, and so on—that ultimately flipped the DOJ.

I was at Google in 2009 on a team adjacent to Dan Clancy when he was most excited about the Authors’ Guild negotiations to publish orphan works and create a portal to pay copyright holders who signed up, and I recall that one opponent that he was frustrated at was Brewster Kahle of the Internet Archive, who filed a jealous amicus brief (https://docs.justia.com/cases/federal/district-courts/new-yo...) complaining that the Authors’ Guild settlement would not grant him access to publishing orphan works too. In my opinion Kahle was wrong; the existence of one orphan works clearinghouse would have encouraged Congress to grant more libraries access instead of doing nothing which is what actually happened in the 15 year since then. Instead of one company selling out-of-print but in-copyright books, or multiple organizations, no one is allowed to sell them today.

Since then, of course, Brewster Kahle launched an e-library of copyrighted books without legal authorization anyway which will probably be the death of the current organization that runs the Internet Archive. Tragic all around.

yonran | 10 months ago

I worked at the Library of Congress on their Digital Preservation Project, circa 2001-2003. The stated goal was to "digitize all of the Library's collections" and while most people think of books, I was in the Motion Picture Broadcast and Recorded Sound Division.

In our collection were Thomas Edison's first motion pictures, wire spool recordings from reporters at D-Day, and LPs of some of the greatest musicians of all time. And that was just our Division. Others - like American Heritage - had photos from the US Civil War and more.

Anyway, while the Rights information is one big, ugly tangled web, the other side is the hardware to read the formats. Much of the media is fragile and/or dangerous to use so you have to be exceptionally careful. Then you have to document all the settings you used because imagine that three months from now, you learn some filter you used was wrong or the hardware was misconfigured.. you need to go back and understand what was affected how.

Cool space. I wish I'd worked there longer.

caseysoftware | 10 months ago

“Page had always wanted to digitize books. Way back in 1996, the student project that eventually became Google—a “crawler” that would ingest documents and rank them for relevance against a user’s query—was actually conceived as part of an effort “to develop the enabling technologies for a single, integrated and universal digital library.” The idea was that in the future, once all books were digitized, you’d be able to map the citations among them, see which books got cited the most, and use that data to give better search results to library patrons. But books still lived mostly on paper. Page and his research partner, Sergey Brin, developed their popularity-contest-by-citation idea using pages from the World Wide Web.“

Larry Page had some cool ideas… can’t imagine Books will ever be resurrected, unfortunately.

ErikAugust | 10 months ago

O'Reilly, for whom I've been a lead author and co-author, did this: https://www.oreilly.com/pub/pr/1042

They call it Founder's Copyright. The also use Creative Commons. The goal is to make out of print books available at no cost.

Zigurd | 10 months ago

This seems to be the fate of knowledge/content that stays in institutions which have been built with the idea of collecting it and growing it.. but have turned into walled gardens/crypts of sort. Rot/Rust and be forgotten.

A very cynical and dark view is that the New things/people need that oblivion in order to feel great, for not haveing to compare with old great-er ones. Rewriting history as it seems fit the current powers-that-be, is easier this way.

Or may be it's just collective stupidity? or societal immaturity ?

(i am coming from completely different killed project on a different continent, but the idea is the same)

svilen_dobrev | 10 months ago

With library genesis, who needs Google Books anymore? I buy books physically to support the author/s and download an epub version from said site to my kindle. The physical books I hardly read, they are for my shelf. Although I love the feeling of printed books, but I read in bed, and it‘s easier to hold an ebook. Also I read when I commute. It’s lighter to have my Kindle Oasis with me with tons of books on it.

submeta | 10 months ago

IMO if a work is out of print (or equivalent depending on the medium) for more than a few years, it should be released into the public domain. Or maybe something like the public domain, but requires attribution.

thayne | 10 months ago

A huge proportion of this corpus is found in the Hathi Trust (see https://www.hathitrust.org/the-collection/). We have had a grant to crawl and derive an index on it via their supercomputing resources. I'm sure they are looking to LLM proposals, though they are exceedingly careful about the copyright issues.

https://www.hathitrust.org/

xipho | 10 months ago

Of course someone needs to scan/digitise those books but for those which already are, there is Anna’s Archive.

https://en.wikipedia.org/wiki/Anna%27s_Archive

boramalper | 10 months ago

Programmers not law makers really control what goes and doesnt online.

Bittorent and ipfs etc are nice but things would be better if there was a large static archive with desktop clients exchanging chunks in a complex modular way.

Say: I have pages 1-15 of file 123456, you have page 16 but are looking for page 1 of doc 2345, if i can obtain that page a fast exchange is possible. If not a different module can issue an iou that either means i owe something, you are owed something or both. Other modules could create groups that aim to store part of the archive without duplication amoung members. Spam driven modules could also be interesting.

The archive can be organized by how dubious the copyright is so that one can limit participation to 50 or 100+ year old publications and/or living or dead authors.

Its not unlike living on a far away island with the british empire seeking to control every aspect of your life without sufficient means of force.

theendisney4 | 10 months ago

For Kagi users, I recommend putting books.google.com as a pinned domain. This way, you'll many times be presented with some of the best sources for any search query. Then it's a matter of finding the ePub file of that book. To read on MacOS, FBReader is a high quality app.

carlosjobim | 10 months ago

We need a Copyright Term Reduction Act.

It's time. 50 years, renewal is possible but expensive.

Animats | 10 months ago

Let’s rewrite copyright law:

1. The author gets to say, “I produced this”, and to control if it gets published.

2. Exclusive copyright for 15 year terms.

3. Renewal possible if author still alive. Non-human rights holders (corporations, etc.) limited to 30 years total (one renewal) from date of first publication, regardless of item ownership. Failure to renew automatically opens up the product.

4. Existing copyright can be overridden if demand isn’t being adequately serviced (sliding scale, challenger must capture minimum % of existing market demand to prove). Pricing of overriding attempts must be reasonable, only cost of production can be directly paid for, everything else goes into an escrow account until the attempt is concluded. This is where anti-abuse rules for both sides are most extensive.

Information and knowledge must be free. Our civilization depends vitally upon that freedom.

rekabis | 10 months ago

I’m sure the lawyers will eventually figure out a way to train an LLM on them.

senkora | 10 months ago

Written from a capitalist perspective, extolling "market forces" and legitimizing corporate and government limitations on copying.

"between 1923 and 1963 ... copyrights back then had to be renewed, and often the rightsholder wouldn’t bother filing the paperwork" - oh no, how terrible. How lucky we are that in these modern times one doesn't even have to file paperwork in order to prevent you from copying information.

and they go on to suck to Google and decry how they didn't get to legitimize their control over a large swath of human knowledge and cultural heritage.

"It certainly seems unlikely that someone is going to spend political capital—especially today—trying to change the licensing regime for books, let alone old ones." <- copyright regime, licensing regime - all of this stuff is illegitimate apriori. Poetry, literature, music, software, papers and books - we cannot and must not tolerate restrictions on their dissemination.

What arrangements the commercial and governmental entities come to, our "arrangement" should be that everything gets disseminated widely and without restriction, so that curtailment, censorship, commercial control etc. just fail.

einpoklum | 10 months ago

James Somers writes beautifully; https://www.newyorker.com/contributors/james-somers has some of his other writing

shadytrees | 10 months ago

> Copyright terms have been radically extended in this country largely to keep pace with Europe, where the standard has long been that copyrights last for the life of the author plus 50 years. But the European idea, “It’s based on natural law as opposed to positive law,” Lateef Mtima, a copyright scholar at Howard University Law School, said. “Their whole thought process is coming out of France and Hugo and those guys that like, you know, ‘My work is my enfant,’” he said, “and the state has absolutely no right to do anything with it—kind of a Lockean point of view.” As the world has flattened, copyright laws have converged, lest one country be at a disadvantage by freeing its intellectual products for exploitation by the others. And so the American idea of using copyright primarily as a vehicle, per the constitution, “to promote the Progress of Science and useful Arts,” not to protect authors, has eroded to the point where today we’ve locked up nearly every book published after 1923.

This is disingenuous: the article doesn’t mention that the biggest proponent of the prolonging of the copyright terms were Americans (e.g., Walt Disney Corp and Jack Valenti, see “Mickey Mouse Protection Act” for more) not Europeans.

mcepl | 10 months ago

The tragedy is that Google is tasked with this at all. It would be cool if public libraries could work together on a massive public digital library. This shouldn't be Google's responsibility.

2OEH8eoCRo0 | 10 months ago

Would it not be a viable solution to let Google scan and sell books, but force them to give the profit from the sales to the government?

mparnisari | 10 months ago

I never seen an explicit mention if the Google Books corpus was indeed or not used for training LLMs…

Anyone knows more about it?

DrNosferatu | 10 months ago

> “Somewhere at Google there is a database containing 25 million books and nobody is allowed to read them.”

Greeted with a paywall on the source. Hypocracy...

kbbgl87 | 10 months ago

> what happened with piano rolls, with records, with radio, and with cable—isn’t that copyright holders squash the new technology. Instead, they cut a deal and start making money from it.

> “History has shown that time and market forces often provide equilibrium in balancing interests,” Wu writes.

It is completely braindead to argue that market forces had anything to do with compulsory licensing. It is a matter determined by courts in the public interest.

tempfile | 10 months ago

Sad and criminal.

anoncow | 10 months ago

Ironically behind a paywall (and below a political ad)

afh1 | 10 months ago

[dead]

LisaDziuba | 10 months ago

TL;DR: bye bye Google

geniium | 10 months ago

[flagged]

pluc | 10 months ago

Thanks Paul!

datadrivenangel | 10 months ago

Google must be tempted to put them in an LLM.

andrewstuart | 10 months ago

Good. It’s important that free access not be permitted. We don’t know what personal data might be contained within. We should only allow those works after a human (appropriately certified) has verified that no personal data exists within.

If it exists within the book must be destroyed in its entirety. Too many works of so-called scholarship have relied on the personal letters of dead people.

We should not reward grave robbing. The most important thing is the personal data. We must protect the personal data.

renewiltord | 10 months ago