Skip to content

WebNinjaDeveloper.com

Programming Tutorials




Menu
  • Home
  • Youtube Channel
  • Official Blog
  • Nearby Places Finder
  • Direction Route Finder
  • Distance & Time Calculator
Menu

Python 3 PyPDF2 Script to Extract All Hyperlinks From PDF Document in Terminal GUI Desktop App

Posted on October 6, 2022

Welcome folks today in this blog post we will be extracting all the hyperlinks from pdf document in terminal using python. All the full source code of the application is shown below.

 

 

Get Started

 

 

In order to get started we need to install the pypdf2 library in terminal using the command shown below

 

pip install pypdf2

 

 

This pypdf2 library is very famous for search and text extraction from pdf document which includes hyperlink also. First of all we need to import the library as shown below

 

 

Python
1
2
import PyPDF2
import re

 

 

Here we are importing the required packages that is pypdf2 and also re which stands for regular expression which is a built library in python.

 

 

Python
1
2
file = open("newfile.pdf", 'rb')
readPDF = PyPDF2.PdfFileReader(file)

 

 

And now we are using the built in open() method to open the pdf files in binary mode. And after that we are using the pdfFileReader() method to read the contents of the file and store inside the readPDF variable.

 

 

Python
1
2
3
4
5
6
for page_no in range(readPDF.numPages):
   page=readPDF.getPage(page_no)
   #Extract the text from the page
   text = page.extractText()
   # Print all URL
   print(find_url(text))

 

 

Now we are using the for loop to loop through all the pages inside the pdf document. First of all we are extracting the contents from the page using extractText() and then we are calling an external function which is find_url() to find all the hyperlinks from the content of the pdf file. Now we need to make this function

 

 

Python
1
2
3
4
5
6
def find_url(string):
   #Find all the String that matches with the pattern
   regex = r"(https?://\S+)"
   url = re.findall(regex,string)
   for url in url:
      return url

 

 

Here you can see we are receiving the string inside this function as the content of the pdf document. And then we are using the regular expression to find out the hyperlinks inside pdf document. And then we are using the findall() passing the regular expression string. After that we will get urls and store it inside array. And now we are using the for loop to print out all the urls.

 

 

Python
1
file.close()

 

 

Now we are using the file.close() method to close out the file and release memory.

 

Recent Posts

  • Android Java Project to Capture Image From Camera & Save it in SharedPreferences & Display it in Grid Gallery
  • Android Java Project to Store,Read & Delete Data Using SharedPreferences Example
  • Android Java Project to Download Multiple Images From URL With Progressbar & Save it inside Gallery
  • Android Java Project to Capture Image From Camera & Save it inside Gallery
  • Android Java Project to Crop,Scale & Rotate Images Selected From Gallery and Save it inside SD Card
  • Angular
  • Bunjs
  • C#
  • Deno
  • django
  • Electronjs
  • java
  • javascript
  • Koajs
  • kotlin
  • Laravel
  • meteorjs
  • Nestjs
  • Nextjs
  • Nodejs
  • PHP
  • Python
  • React
  • ReactNative
  • Svelte
  • Tutorials
  • Vuejs




©2023 WebNinjaDeveloper.com | Design: Newspaperly WordPress Theme