Welcome folks today in this blog post we will be extracting the text and meta information from the given pdf document in node.js using the pdf parse library in command line. All the full source code of the application is shown below.
Get Started
In order to get started you need to initialize a new node.js project by executing the below command as shown below
npm init -y
npm i pdf-parse
Basic Example
Now we will be writing the basic example code for this node.js module. Just copy paste the below code inside the index.js
file of your project
index.js
1 2 3 4 |
const fs = require('fs'); const pdf = require('pdf-parse'); let dataBuffer = fs.readFileSync('path to PDF file...'); |
As you can see we are importing the pdf-parse
and also the fs
module as well for doing the reading
and writing the files inside the local file system. And then we are making the variable for reading the pdf file
from the path provided and this will hold the pdf file data in the form of the buffer object.
And now we need to read the meta information
and extract the text from the pdf document as shown below
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
pdf(dataBuffer).then(function(data) { // number of pages console.log(data.numpages); // number of rendered pages console.log(data.numrender); // PDF info console.log(data.info); // PDF metadata console.log(data.metadata); // PDF.js version // check https://mozilla.github.io/pdf.js/getting_started/ console.log(data.version); // PDF text console.log(data.text); }); |
As you can see we are passing the pdf document in the pdf-parse
constructor and then we have the promise returned from this function and inside it we have the metadata
which is returned as an argument. And then from this data we are getting the number of pages
which are present inside the pdf document. And also we are getting the meta data
of the pdf document. This is present inside the info
property and then we are extracting the text from the pdf document. For this we are using the text
property as shown above.