HNPWA with Next.js

We're losing our digital history. Can the Internet Archive save it?

yamrzou | 55 points

> Research shows 25% of web pages posted between 2013 and 2023 have vanished.

I’ve been personally working on a project over the past year which addresses the exact issue: https://linkwarden.app

An open-source [1] bookmarking tool to collect, organize and preserve contents on the internet.

[1]: https://github.com/linkwarden/linkwarden

I've been trying to download various blogs, on blogspot.com and wordpress.com, as well as a couple now only on archive.org, using Linux CLI tools. I cannot make it work. Everything either seems to miss css, or jumps the wrong number of links, or stops arbitrarily, or has some other problem.

If I had a couple of days to devote to it entirely, I think I could make it work, but I've had to be sporadic, although it's cost me a ton of time cumulatively. I've tried wget, httrack, and a couple of other more obscure tools -- all with various options and parameters of course.

One issue is that blog info is duplicated -- you might get domainname.com/article/article.html; domainname.com/page/1; and domainname.com/2015/10/01; all of which contain the same links. Could there be some vicious circularity taking place, causing the downloader to be confused about what it's done and what it has yet to do? I wouldn't think so, but static, non-blog pages are obviously much simpler than blogs.

Anyway, is there a known, standardized way to download blogs? I haven't yet found one. But it seems such a common use case! Does anybody have any advice?

geye1234 | 10 months ago

I've been trying to extract historycommons.org from wayback and it is an uphill battle, even to grab the ~198 pages it says it collected. Even back in the days after 9/11 when it rose to prominence I was shuddering at the site's dynamically served implementation. These were the days of Java and they loaded down the server side with CPU time when it'd rather be serving static items... from REAL directories. With REAL If-Modified-Since: virtual support file attributes set from the combined database update times ... which seems to have gone by the wayside on the Internet completely.

Everything everywhere is now Last-Modified today, now, just for YOU! Even if it hasn't changed. Doesn't that make you happy? Do you have a PROBLEM with that??

Everything unique at the site was after the ? and there was more than one way to get 'there', there being anywhere.

I suspect that many tried to whack the site then finally gave up. I got a near-successful whack once after lots of experimenting, but said to myself then "This thing will go away, and it's sad".

That treasure is not reliably archived.

Suggestion: Even if the whole site is spawned from a database, choose a view that presents everything once and only once, and present to the world a group of pages that completely divulge the content with slash-separators only /x/y/z/xxx.(html|jpg|etc) with no duplicitous tangents IF the crawler ignores everything after the ? ... and place actual static items in a hierarchy. The most satisfying crawl is one where you can do this, knowing that the archive will be complete and relevant and there is no need to 'attack' the server side with processes-spawning.

HocusLocus | 10 months ago

One question seems obvious:

With AIs and stuff, are we saving humanity's digital history, or are we saving a swarm of potentially biased auto-generated content published by the few that can afford the large scale deployment of LLMs?

alganet | 10 months ago