Scrapism

cjlm | 121 points

A trick I think would be useful to include here is running scrapers in GitHub Actions that write their results back to the repository.

This is free(!) to host, and the commit log gives an enormous amount of detail about how the scraped resource changed over time.

I wrote more about this trick here: https://simonwillison.net/2020/Oct/9/git-scraping/

Here are 267 repos that are using it: https://github.com/topics/git-scraping?o=desc&s=updated

simonw | 2 years ago

I usually create some small scraping script for my daily life.

- Getting all comic's image and converting to e-book for my Kindle.

- Surveying info for buying new house.

- Helping my wife in collecting data for her new writing.

- Transferring all my Facebook fanpage post to my personal blog

And I did enjoyed my journey in scraping thing to make my life easier and full of joy.

quyleanh | 2 years ago

Hi Sam,

It might be worth adding a section on distributed anonymous scrapers that use some form of messaging middleware to distribute the URLs to scrape. Regarding the anonymous aspect (independent of job distribution, of course), you could walk them through using https://github.com/aaronsw/pytorctl or even a rotating tor proxy. This is how I scraped all those Instagram locations + metadata we discussed about five years ago. Hope you’re doing well!

helsinki | 2 years ago

This is from 2020. Besides a small change to the "Introduction to the Command Line" section, it has not been updated.

Back in 2015, the author reported using CasperJS to scrape public LinkedIn profiles. The author reported this was a PITA.

Here the author recommends using WebDriver implementations, e.g., chromedriver or geckodriver, in addition to scripting language frameworks such as Puppeteer and Selenium. Is scraping LinkedIn still a PITA.

Because the examples given are always relatively simple, i.e., not LinkedIn, I am skeptical when I see "web scraping" tutorials using Python frameworks and cURL as the only recommended option for automated public data/information retrieval from the www.[FN1,2] I use none of the above. For small tasks like the examples given in these tutorials, the approaches I use are not nearly as sophisticated/complicated and yet they are faster and use fewer resources than using Python and/or cURL. They are also easier to update if something changes. That is in part because (1) the binaries I use are smaller, (2) I do not rely on scripting languages[FN3] and third party libraries (and so much less code involved), (3) the programs I use start working immediately whereas Python takes seconds to start up and (4) compared to the programs I use, cURL as a means of sending HTTP requests is inflexible, e.g., one is limited to what "options" cURL provides and cURL has no option for HTTP/1.1 pipelining.

1. LinkedIn's so-called "technological measures" to prevent retrieval of public information have failed. Similarly, its attempts to prevent retrieval of public information through intimidation, e.g., cease-and-desist letters and threats of CFAA claims, have failed. Tutorials on "web scraping" that extol Python frameworks should use LinkedIn as an example instead of trivial examples for which using Python is, IMHO, overkill.

2. What would be more interesting is a Rosetta Code for "web scraping" tasks. There are many, many ways to do public data/information retrieval from the www. Using scripting languages such as Python, Ruby, NodeJS, etc. and frameworks are one way. That approach may be ideally suited for large scale jobs, like those undertaken by what the author calls "internet companies". But for smaller tasks undertaken by individual www users for noncommercial purposes, e.g., this author's concept of "scrapism", there are also faster, less complicated and more efficient options.

3. Other than the Almquist shell

1vuio0pswjnm7 | 2 years ago

Good guide!

The "Scraping XHR" [1] explains how to inspect network requests and reproduce them with Python. I actually built har2requests [2] to automate that process!

[1]: https://scrapism.lav.io/scraping-xhr/ [2]: https://github.com/louisabraham/har2requests

Labo333 | 2 years ago

Hi! This is a guide that I started during the pandemic but never quite finished. I’m in the process of re-writing/re-recording some parts of it to bring it back up to date, and adding in the bits that are still missing.

saaaam | 2 years ago

I'm bothered that this doesn't mention any of the ethics involved, such as checking the robots.txt file and so forth.

More than half of my traffic is from bots, so I'm paying something like half my operational expenses to support them. And we've had to do a lot of work to mitigate what would otherwise be DoS attacks from badly written (or badly intended!) bots. I think that at least a tip of the hat to avoiding damage would be appropriate in a piece like this.

CWuestefeld | 2 years ago

Tangentially related question:

Is Python still the most common tool used for web scraping and if so, what's the advantage over jsdom/cheerio or, say a headless browser based tool like puppeteer?

I've been using these tools for years, but I grew up in the JS world, so I'd be curious to hear people with different backgrounds/biases than mine:)

rpastuszak | 2 years ago

Was happy to find that the person behind it Sam Lavigne, one of the people behind Stupid Hackathon.

fjallstrom | 2 years ago
niharikajoshi | 2 years ago
niharikajoshi | 2 years ago