Welcome folks today in this blog post we will be extracting all the hyperlinks from pdf document in terminal using python. All the full source code of the application is shown below.
Get Started
In order to get started we need to install the pypdf2
library in terminal using the command shown below
pip install pypdf2
This pypdf2 library is very famous for search and text extraction from pdf document which includes hyperlink also. First of all we need to import the library as shown below
1 2 |
import PyPDF2 import re |
Here we are importing the required packages that is pypdf2 and also re which stands for regular expression which is a built library in python.
1 2 |
file = open("newfile.pdf", 'rb') readPDF = PyPDF2.PdfFileReader(file) |
And now we are using the built in open() method to open the pdf files in binary mode. And after that we are using the pdfFileReader() method to read the contents of the file and store inside the readPDF variable.
1 2 3 4 5 6 |
for page_no in range(readPDF.numPages): page=readPDF.getPage(page_no) #Extract the text from the page text = page.extractText() # Print all URL print(find_url(text)) |
Now we are using the for loop to loop through all the pages inside the pdf document. First of all we are extracting the contents from the page using extractText() and then we are calling an external function which is find_url() to find all the hyperlinks from the content of the pdf file. Now we need to make this function
1 2 3 4 5 6 |
def find_url(string): #Find all the String that matches with the pattern regex = r"(https?://\S+)" url = re.findall(regex,string) for url in url: return url |
Here you can see we are receiving the string inside this function as the content of the pdf document. And then we are using the regular expression to find out the hyperlinks inside pdf document. And then we are using the findall() passing the regular expression string. After that we will get urls and store it inside array. And now we are using the for loop to print out all the urls.
1 |
file.close() |
Now we are using the file.close()
method to close out the file and release memory.