For grabbing PDFs from ICRA 2022
Nelze vybrat více než 25 témat Téma musí začínat písmenem nebo číslem, může obsahovat pomlčky („-“) a může být dlouhé až 35 znaků.
Noëlle abf4cc8477
Add caching and resuming
před 2 roky
html Finalize main.py, Readme, and CSS/JS před 2 roky
.gitignore Finalize main.py, Readme, and CSS/JS před 2 roky
LICENSE Initial commit před 2 roky
README.md Add caching and resuming před 2 roky
config.ini Initialize před 2 roky
empty-config.ini add empty-config před 2 roky
main.py Add caching and resuming před 2 roky
requirements.txt Update main & requirements před 2 roky
scraper.py Initialize před 2 roky

README.md

pdf-grabber

For grabbing PDFs from ICRA 2022!

Usage

Make sure you have Python 3.6 or later, install a virtual environment if you like, then run these in a command line:

  • pip3 install -r requirements.txt
  • python3 main.py

This script will create a sub-directory, pdfs/, where it will store the PDFs it downloads. PDFs are named according to the presentation’s name and the PDF’s file number.

You can use the -e flag (e.g. python3 main.py -e 88) to determine which event ID to scan for presentations that have PDFs. By default, this is event 88. (The number is unfortunate; it’s the event this was written for, the 39th IEEE International Conference on Robotics and Automation, and bears no other symbolism here.)

You can use the -s flag (e.g. python3 main.py -s) to save the HTML content of each page along with the PDF. This is mostly for diagnostic purposes. The CSS and Javascript files required by the HTML files are included here, but you may have to move them somewhere else to get them to work properly (where depends on your system).

You can use the -c flag (e.g. python3 main.py -c) to disable caching. This script automatically caches which pages it’s seen, so it doesn’t try to download them again. The -c flag will start the whole process over every time you launch the script this way. Note that this will not destroy an existing cache.

You can use the -w flag (e.g. python3 main.py -w 76105) to start from a given session ID (not a presentation id). This is useful if you’ve used the -c flag and had to interrupt the process, or if you used the script before I added caching. Note that if you have caching enabled (this is the default) and you specify a session ID past where you last left off, the script will add the intermediate sessions to the cache.

Note that you can use any combination of these flags! python3 main.py -e 74 -s -w 24601 is just fine.

Please contact the author by email at noelle AT noelle.codes or by Mastodon at chat.noelle.codes/@noelle if you have questions or trouble.