Outils

Archiver le web , un site à la fois.

Il y a différentes manières d’aborder cette façon de saturer son ordi d’un seul coup, sauvegarder pour références futures et/ou hors ligne, je n’ai pas encore fait de choix..

ArchiveBox host your own personal and private internet archive in your own server.   free, open source  view or read them offline. archivebox.io https://github.com/ArchiveBox/ArchiveBox https://ostechnix.com/self-host-internet-archive-with-archivebox/

Grab-site easy Preconfigured web crawler designed for backing up websites.

Shot-scraper Juste automatiser la capture de l’affichage du site: https://shot-scraper.datasette.io/en/stable/ , sauvegarde possible en pdf.
Site de l’auteur: https://til.simonwillison.net/chrome/headless (utilise chrome) .
Jasette : https://news.ycombinator.com/item?id=39810378

Offline Internet Archive server Crawls Internet Archive collections to a local server, Serves that content locally, Caches content while browsing, Moves content between servers by sneakernet […] https://github.com/internetarchive/dweb-mirror

Httrack Classique https://www.httrack.com/page/1/en/index.html

Heritrix   Le crawler de archive.org, je ne sais pas si c’est toujours actuel. Internet Archive’s open-source, extensible, web-scale, archival-quality web crawler project. http://crawler.archive.org/index.html

Outils en ligne de commande:

Monolith CLI tool for saving complete web pages as a single HTML file https://github.com/Y2Z/monolith

WGET

Shitload d’exemples: http://www.thegeekstuff.com/2009/09/the-ultimate-wget-download-guide-with-15-awesome-examples/

wget -m -k -K -E http://url/du/beau/site

limiter la vitesse,faire une entre chaque partie et utilise/fake un user agent:
wget -r -p -U Mozilla --wait=10 --limit-rate=36K https://www.theinternet.com

Switches plus sémantiques:
wget --mirror --convert-links --backup-converted --adjust-extension http://url/du/beau/site

-m, --mirror            Turns on recursion and time-stamping, sets infinite 
                          recursion depth, and keeps FTP directory listings.
-p, --page-requisites   Get all images, etc. needed to display HTML page.
-E, --adjust-extension  Save HTML/CSS files with .html/.css extensions.
-k, --convert-links     Make links in downloaded HTML point to local files.
-np, --no-parent        Don't ascend to the parent directory when retrieving 
                        recursively. This guarantees that only the files below 
                        a certain hierarchy will be downloaded. Requires a slash 
                        at the end of the directory, e.g. example.com/foo/.

Pas certain: Pasted un peu tout croche ici

 Wayback Machine Downloader  run with the desired domain and optional timestamp from the Internet Archive.

sudo gem install wayback_machine_downloader
mkdir example
cd example
wayback_machine_downloader http://example.com --timestamp 19700101000000 

 

 

Laisser un commentaire

Votre adresse courriel ne sera pas publiée. Les champs obligatoires sont indiqués avec *