JPL, Encounter PDF.
While NASA Jet Propulsion Laboratory JPL is known for driving Rovers on Mars and deploy spacecraft for study Planets in the solar systemJPL’s latest project is even more tangible: compiling the world’s largest publicly available archive of PDF files for security research purposes.
PDF files are the most popular form of digital documents in the world. And while they may look like scans of paper documents, they are actually collections of text, images, movies, and activetext that aren’t as secure as they should be and are scattered all over the place. To address this concern, JPL has partnered with the nonprofit PDF Association to develop a new archive of files that will help researchers analyze potential threats across an extensive library of real PDFs.
Related: The US Space Force wants private companies to help it confront “emerging threats” in space
The project involves compiling nearly 8 million PDF files totaling more than 8 terabytes of data from various online sources. This effort is part of a Defense Advanced Research Projects Agency (DARPA) initiative called SafeDocswhich aims to make digital documents safe from malicious code and other security concerns.
“PDF files are used everywhere and are important for contracts, legal documents, 3D engineering designs, and many other purposes,” said JPL data scientist Tim Allison. statement. “Unfortunately, they are complex and can be hacked to hide malicious code or maliciously present different information to different users.” To address these and other PDF challenges, a large sample of real-world PDFs must be collected from the Internet to create a shared, freely available resource for software experts. “
Using Crawl’s freely available public repository of web crawl information as a starting point, the JPL researchers identified PDFs to add to the collection, including those incomplete due to Common Crawl’s 1MB download limit per downloaded file. JPL then accesses the PDF file URLs directly to download the complete documents, ensuring a fully representative archive of the PDF file types accessible on the Web.
By making the collection publicly available, JPL hopes researchers can use and analyze the PDF files to identify better ways to secure the information these documents contain.
“Infuriatingly humble alcohol fanatic. Unapologetic beer practitioner. Analyst.”