Archiver le web , un site à la fois.
Il y a différentes manières d’aborder cette façon de saturer son ordi d’un seul coup, sauvegarder pour références futures et/ou hors ligne, je n’ai pas encore fait de choix..
ArchiveBox host your own personal and private internet archive in your own server. free, open source view or read them offline. archivebox.io https://github.com/ArchiveBox/ArchiveBox https://ostechnix.com/self-host-internet-archive-with-archivebox/
Grab-site easy Preconfigured web crawler designed for backing up websites.
Shot-scraper Juste automatiser la capture de l’affichage du site: https://shot-scraper.datasette.io/en/stable/ , sauvegarde possible en pdf.
Site de l’auteur: https://til.simonwillison.net/chrome/headless (utilise chrome) .
Jasette : https://news.ycombinator.com/item?id=39810378
Offline Internet Archive server Crawls Internet Archive collections to a local server, Serves that content locally, Caches content while browsing, Moves content between servers by sneakernet […] https://github.com/internetarchive/dweb-mirror
Httrack Classique https://www.httrack.com/page/1/en/index.html
Heritrix Le crawler de archive.org, je ne sais pas si c’est toujours actuel. Internet Archive’s open-source, extensible, web-scale, archival-quality web crawler project. http://crawler.archive.org/index.html
Outils en ligne de commande:
Monolith CLI tool for saving complete web pages as a single HTML file https://github.com/Y2Z/monolith
WGET
Shitload d’exemples: http://www.thegeekstuff.com/2009/09/the-ultimate-wget-download-guide-with-15-awesome-examples/
wget -m -k -K -E http://url/du/beau/site
limiter la vitesse,faire une entre chaque partie et utilise/fake un user agent:
wget -r -p -U Mozilla --wait=10 --limit-rate=36K https://www.theinternet.com
Switches plus sémantiques:
wget --mirror --convert-links --backup-converted --adjust-extension http://url/du/beau/site
-m, --mirror Turns on recursion and time-stamping, sets infinite
recursion depth, and keeps FTP directory listings.
-p, --page-requisites Get all images, etc. needed to display HTML page.
-E, --adjust-extension Save HTML/CSS files with .html/.css extensions.
-k, --convert-links Make links in downloaded HTML point to local files.
-np, --no-parent Don't ascend to the parent directory when retrieving
recursively. This guarantees that only the files below
a certain hierarchy will be downloaded. Requires a slash
at the end of the directory, e.g. example.com/foo/.
Pas certain: Pasted un peu tout croche ici
Wayback Machine Downloader run with the desired domain and optional timestamp from the Internet Archive.
sudo gem install wayback_machine_downloader
mkdir example
cd example
wayback_machine_downloader http://example.com --timestamp 19700101000000