Skip to content

WebNinjaDeveloper.com

Programming Tutorials




Menu
  • Home
  • Youtube Channel
  • PDF Invoice Generator
Menu

Node.js Tutorial to Parse & Extract Text & Tables From PDF Document Using pdfreader Library in Javascript

Posted on March 20, 2023

 

 

Welcome folks today in this blog post we will be extracting and parse text and tables from pdf document using pdfreader library in javascript. All the full source code of the application is shown below.

 

 

Get Started

 

 

In order to get started you need to make a new node.js project using the npm command

 

 

npm init -y

 

 

And now you need to install the below libraries using the below command as shown below

 

 

npm i pdfreader

 

 

And after that you need to create the index.js file and copy paste the following code. And now we need to go to package.json file and add the type to module as shown below

 

 

package.json

 

 

 

 

index.js

 

 

JavaScript
1
2
3
4
5
6
7
import { PdfReader } from "pdfreader";
 
new PdfReader().parseFileItems("file.pdf", (err, item) => {
  if (err) console.error("error:", err);
  else if (!item) console.warn("end of file");
  else if (item.text) console.log(item.text);
});

 

 

As you can see we are importing the pdfreader library at the top and then we are passing the path of the pdf file and then we are returning the contents of the pdf file in the terminal.

 

 

Parsing a password-protected PDF file

 

We can even parse the contents of the password protected pdf file as shown below

 

 

JavaScript
1
2
3
4
5
6
7
8
9
10
import { PdfReader } from "pdfreader";
 
new PdfReader({ password: "YOUR_PASSWORD" }).parseFileItems(
  "test/sample-with-password.pdf",
  function (err, item) {
    if (err) console.error(err);
    else if (!item) console.warn("end of file");
    else if (item.text) console.log(item.text);
  }
);

 

 

As you can see in the above code we are providing the password property and then we are providing the parseFileItems() method to get the text contents of the pdf file and then we are printing the text content on the command line.

 

 

Raw PDF reading from a PDF buffer

 

 

We can even read the content of the pdf file from the buffer and then print the text content inside the terminal as shown below

 

 

JavaScript
1
2
3
4
5
6
7
8
9
10
11
import fs from "fs";
import { PdfReader } from "pdfreader";
 
fs.readFile("test/sample.pdf", (err, pdfBuffer) => {
  // pdfBuffer contains the file content
  new PdfReader().parseBuffer(pdfBuffer, (err, item) => {
    if (err) console.error("error:", err);
    else if (!item) console.warn("end of buffer");
    else if (item.text) console.log(item.text);
  });
});

 

Recent Posts

  • Node.js Express Project to Remove Background of Images Using Rembg & Formidable Library in Browser
  • Node.js Tutorial to Remove Background From Image Using Rembg & Sharp Library in Command Line
  • Python 3 Flask Project to Remove Background of Multiple Images Using Rembg Library in Browser
  • Python 3 Rembg Library Script to Bulk Process Multiple Images and Remove Background in Command Line
  • Python 3 Rembg Library Script to Remove Background From Image in Command Line
  • Angular
  • Bunjs
  • C#
  • Deno
  • django
  • Electronjs
  • java
  • javascript
  • Koajs
  • kotlin
  • Laravel
  • meteorjs
  • Nestjs
  • Nextjs
  • Nodejs
  • PHP
  • Python
  • React
  • ReactNative
  • Svelte
  • Tutorials
  • Vuejs




©2023 WebNinjaDeveloper.com | Design: Newspaperly WordPress Theme