Skip to content

WebNinjaDeveloper.com

Programming Tutorials




Menu
  • Home
  • Youtube Channel
  • Official Blog
  • Nearby Places Finder
  • Direction Route Finder
  • Distance & Time Calculator
Menu

Node.js PDF Parse Library Tutorial to Extract Text & Meta Information From PDF Document in Command Line

Posted on November 15, 2022

 

 

Welcome folks today in this blog post we will be extracting the text and meta information from the given pdf document in node.js using the pdf parse library in command line. All the full source code of the application is shown below.

 

 

 

Get Started

 

 

In order to get started you need to initialize  a new node.js project by executing the below command as shown below

 

 

npm init -y

 

 

npm i pdf-parse

 

 

 

Basic Example

 

 

Now we will be writing the basic example code for this node.js module. Just copy paste the below code inside the index.js file of your project

 

 

index.js

 

 

JavaScript
1
2
3
4
const fs = require('fs');
const pdf = require('pdf-parse');
let dataBuffer = fs.readFileSync('path to PDF file...');

 

 

As you can see we are importing the pdf-parse and also the fs module as well for doing the reading and writing the files inside the local file system. And then we are making the variable for reading the pdf file from the path provided and this will hold the pdf file data in the form of the buffer object.

 

And now we need to read the meta information and extract the text from the pdf document as shown below

 

 

JavaScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
pdf(dataBuffer).then(function(data) {
    // number of pages
    console.log(data.numpages);
    // number of rendered pages
    console.log(data.numrender);
    // PDF info
    console.log(data.info);
    // PDF metadata
    console.log(data.metadata);
    // PDF.js version
    // check https://mozilla.github.io/pdf.js/getting_started/
    console.log(data.version);
    // PDF text
    console.log(data.text);
        
});

 

 

As you can see we are passing the pdf document in the pdf-parse constructor and then we have the promise returned from this function and inside it we have the metadata which is returned as an argument. And then from this data we are getting the number of pages which are present inside the pdf document. And also we are getting the meta data of the pdf document. This is present inside the info property and then we are extracting the text from the pdf document. For this we are using the text property as shown above.

 

Recent Posts

  • Android Java Project to Download Multiple Images From URL With Progressbar & Save it inside Gallery
  • Android Java Project to Capture Image From Camera & Save it inside Gallery
  • Android Java Project to Crop,Scale & Rotate Images Selected From Gallery and Save it inside SD Card
  • Android Kotlin Project to Load Image From URL into ImageView Widget
  • Android Java Project to Make HTTP Call to JSONPlaceholder API and Display Data in RecyclerView Using GSON & Volley Library
  • Angular
  • Bunjs
  • C#
  • Deno
  • django
  • Electronjs
  • java
  • javascript
  • Koajs
  • kotlin
  • Laravel
  • meteorjs
  • Nestjs
  • Nextjs
  • Nodejs
  • PHP
  • Python
  • React
  • ReactNative
  • Svelte
  • Tutorials
  • Vuejs




©2023 WebNinjaDeveloper.com | Design: Newspaperly WordPress Theme