Skip to main content

PDFLoader

Compatibility

Only available on Node.js.

This notebook provides a quick overview for getting started with PDFLoader document loaders. For detailed documentation of all PDFLoader features and configurations head to the API reference.

Overview​

Integration details​

ClassPackageCompatibilityLocalPY support
PDFLoader@langchain/communityNode-onlyβœ…πŸŸ  (See note below)

The Python package has many PDF loaders to choose from. See this link for a full list of Python document loaders.

Setup​

To access PDFLoader document loader you’ll need to install the @langchain/community integration, along with the pdf-parse package.

Credentials​

Installation​

The LangChain PDFLoader integration lives in the @langchain/community package:

yarn add @langchain/community @langchain/core pdf-parse

Instantiation​

Now we can instantiate our model object and load documents:

import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";

const nike10kPdfPath = "../../../../data/nke-10k-2023.pdf";

const loader = new PDFLoader(nike10kPdfPath);

Load​

const docs = await loader.load();
docs[0];
Document {
pageContent: 'Table of Contents\n' +
'UNITED STATES\n' +
'SECURITIES AND EXCHANGE COMMISSION\n' +
'Washington, D.C. 20549\n' +
'FORM 10-K\n' +
'(Mark One)\n' +
'β˜‘ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934\n' +
'FOR THE FISCAL YEAR ENDED MAY 31, 2023\n' +
'OR\n' +
'☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934\n' +
'FOR THE TRANSITION PERIOD FROM TO .\n' +
'Commission File No. 1-10635\n' +
'NIKE, Inc.\n' +
'(Exact name of Registrant as specified in its charter)\n' +
'Oregon93-0584541\n' +
'(State or other jurisdiction of incorporation)(IRS Employer Identification No.)\n' +
'One Bowerman Drive, Beaverton, Oregon 97005-6453\n' +
'(Address of principal executive offices and zip code)\n' +
'(503) 671-6453\n' +
"(Registrant's telephone number, including area code)\n" +
'SECURITIES REGISTERED PURSUANT TO SECTION 12(B) OF THE ACT:\n' +
'Class B Common StockNKENew York Stock Exchange\n' +
'(Title of each class)(Trading symbol)(Name of each exchange on which registered)\n' +
'SECURITIES REGISTERED PURSUANT TO SECTION 12(G) OF THE ACT:\n' +
'NONE\n' +
'Indicate by check mark:YESNO\n' +
'β€’if the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act.ΓΎ ̈\n' +
'β€’if the registrant is not required to file reports pursuant to Section 13 or Section 15(d) of the Act. ̈þ\n' +
'β€’whether the registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 during the preceding\n' +
'12 months (or for such shorter period that the registrant was required to file such reports), and (2) has been subject to such filing requirements for the\n' +
'past 90 days.\n' +
'þ ̈\n' +
'β€’whether the registrant has submitted electronically every Interactive Data File required to be submitted pursuant to Rule 405 of Regulation S-T\n' +
'(Β§232.405 of this chapter) during the preceding 12 months (or for such shorter period that the registrant was required to submit such files).\n' +
'þ ̈\n' +
'β€’whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company or an emerging growth company. See the definitions of β€œlarge accelerated filer,”\n' +
'β€œaccelerated filer,” β€œsmaller reporting company,” and β€œemerging growth company” in Rule 12b-2 of the Exchange Act.\n' +
'Large accelerated filerþAccelerated filer☐Non-accelerated filer☐Smaller reporting company☐Emerging growth company☐\n' +
'β€’if an emerging growth company, if the registrant has elected not to use the extended transition period for complying with any new or revised financial\n' +
'accounting standards provided pursuant to Section 13(a) of the Exchange Act.\n' +
' ̈\n' +
"β€’whether the registrant has filed a report on and attestation to its management's assessment of the effectiveness of its internal control over financial\n" +
'reporting under Section 404(b) of the Sarbanes-Oxley Act (15 U.S.C. 7262(b)) by the registered public accounting firm that prepared or issued its audit\n' +
'report.\n' +
'ΓΎ\n' +
'β€’if securities are registered pursuant to Section 12(b) of the Act, whether the financial statements of the registrant included in the filing reflect the\n' +
'correction of an error to previously issued financial statements.\n' +
' ̈\n' +
'β€’whether any of those error corrections are restatements that required a recovery analysis of incentive-based compensation received by any of the\n' +
"registrant's executive officers during the relevant recovery period pursuant to Β§ 240.10D-1(b).\n" +
' ̈\n' +
'β€’\n' +
'whether the registrant is a shell company (as defined in Rule 12b-2 of the Act).☐þ\n' +
"As of November 30, 2022, the aggregate market values of the Registrant's Common Stock held by non-affiliates were:\n" +
'Class A$7,831,564,572 \n' +
'Class B136,467,702,472 \n' +
'$144,299,267,044 ',
metadata: {
source: '../../../../data/nke-10k-2023.pdf',
pdf: {
version: '1.10.100',
info: [Object],
metadata: null,
totalPages: 107
},
loc: { pageNumber: 1 }
},
id: undefined
}
console.log(docs[0].metadata);
{
source: '../../../../data/nke-10k-2023.pdf',
pdf: {
version: '1.10.100',
info: {
PDFFormatVersion: '1.4',
IsAcroFormPresent: false,
IsXFAPresent: false,
Title: '0000320187-23-000039',
Author: 'EDGAR Online, a division of Donnelley Financial Solutions',
Subject: 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31',
Keywords: '0000320187-23-000039; ; 10-K',
Creator: 'EDGAR Filing HTML Converter',
Producer: 'EDGRpdf Service w/ EO.Pdf 22.0.40.0',
CreationDate: "D:20230720162200-04'00'",
ModDate: "D:20230720162208-04'00'"
},
metadata: null,
totalPages: 107
},
loc: { pageNumber: 1 }
}

Usage, one document per file​

import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";

const singleDocPerFileLoader = new PDFLoader(nike10kPdfPath, {
splitPages: false,
});

const singleDoc = await singleDocPerFileLoader.load();
console.log(singleDoc[0].pageContent.slice(0, 100));
Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K

Usage, custom pdfjs build​

By default we use the pdfjs build bundled with pdf-parse, which is compatible with most environments, including Node.js and modern browsers. If you want to use a more recent version of pdfjs-dist or if you want to use a custom build of pdfjs-dist, you can do so by providing a custom pdfjs function that returns a promise that resolves to the PDFJS object.

In the following example we use the β€œlegacy” (see pdfjs docs) build of pdfjs-dist, which includes several polyfills not included in the default build.

yarn add pdfjs-dist
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";

const customBuildLoader = new PDFLoader(nike10kPdfPath, {
// you may need to add `.then(m => m.default)` to the end of the import
pdfjs: () => import("pdfjs-dist/legacy/build/pdf.js"),
});

Eliminating extra spaces​

PDFs come in many varieties, which makes reading them a challenge. The loader parses individual text elements and joins them together with a space by default, but if you are seeing excessive spaces, this may not be the desired behavior. In that case, you can override the separator with an empty string like this:

import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";

const noExtraSpacesLoader = new PDFLoader(nike10kPdfPath, {
parsedItemSeparator: "",
});

const noExtraSpacesDocs = await noExtraSpacesLoader.load();
console.log(noExtraSpacesDocs[0].pageContent.slice(100, 250));
(Mark One)
β˜‘ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
FOR THE FISCAL YEAR ENDED MAY 31, 2023
OR
☐ TRANSITI

Loading directories​

import { DirectoryLoader } from "langchain/document_loaders/fs/directory";
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

const exampleDataPath =
"../../../../../../examples/src/document_loaders/example_data/";

/* Load all PDFs within the specified directory */
const directoryLoader = new DirectoryLoader(exampleDataPath, {
".pdf": (path: string) => new PDFLoader(path),
});

const directoryDocs = await directoryLoader.load();

console.log(directoryDocs[0]);

/* Additional steps : Split text into chunks with any TextSplitter. You can then use it as context or save it to memory afterwards. */
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
});

const splitDocs = await textSplitter.splitDocuments(directoryDocs);
console.log(splitDocs[0]);
Unknown file type: Star_Wars_The_Clone_Wars_S06E07_Crisis_at_the_Heart.srt
Unknown file type: example.txt
Unknown file type: notion.md
Unknown file type: bad_frontmatter.md
Unknown file type: frontmatter.md
Unknown file type: no_frontmatter.md
Unknown file type: no_metadata.md
Unknown file type: tags_and_frontmatter.md
Unknown file type: test.mp3
Document {
pageContent: 'Bitcoin: A Peer-to-Peer Electronic Cash System\n' +
'Satoshi Nakamoto\n' +
'satoshin@gmx.com\n' +
'www.bitcoin.org\n' +
'Abstract. A purely peer-to-peer version of electronic cash would allow online \n' +
'payments to be sent directly from one party to another without going through a \n' +
'financial institution. Digital signatures provide part of the solution, but the main \n' +
'benefits are lost if a trusted third party is still required to prevent double-spending. \n' +
'We propose a solution to the double-spending problem using a peer-to-peer network. \n' +
'The network timestamps transactions by hashing them into an ongoing chain of \n' +
'hash-based proof-of-work, forming a record that cannot be changed without redoing \n' +
'the proof-of-work. The longest chain not only serves as proof of the sequence of \n' +
'events witnessed, but proof that it came from the largest pool of CPU power. As \n' +
'long as a majority of CPU power is controlled by nodes that are not cooperating to \n' +
"attack the network, they'll generate the longest chain and outpace attackers. The \n" +
'network itself requires minimal structure. Messages are broadcast on a best effort \n' +
'basis, and nodes can leave and rejoin the network at will, accepting the longest \n' +
'proof-of-work chain as proof of what happened while they were gone.\n' +
'1.Introduction\n' +
'Commerce on the Internet has come to rely almost exclusively on financial institutions serving as \n' +
'trusted third parties to process electronic payments. While the system works well enough for \n' +
'most transactions, it still suffers from the inherent weaknesses of the trust based model. \n' +
'Completely non-reversible transactions are not really possible, since financial institutions cannot \n' +
'avoid mediating disputes. The cost of mediation increases transaction costs, limiting the \n' +
'minimum practical transaction size and cutting off the possibility for small casual transactions, \n' +
'and there is a broader cost in the loss of ability to make non-reversible payments for non-\n' +
'reversible services. With the possibility of reversal, the need for trust spreads. Merchants must \n' +
'be wary of their customers, hassling them for more information than they would otherwise need. \n' +
'A certain percentage of fraud is accepted as unavoidable. These costs and payment uncertainties \n' +
'can be avoided in person by using physical currency, but no mechanism exists to make payments \n' +
'over a communications channel without a trusted party.\n' +
'What is needed is an electronic payment system based on cryptographic proof instead of trust, \n' +
'allowing any two willing parties to transact directly with each other without the need for a trusted \n' +
'third party. Transactions that are computationally impractical to reverse would protect sellers \n' +
'from fraud, and routine escrow mechanisms could easily be implemented to protect buyers. In \n' +
'this paper, we propose a solution to the double-spending problem using a peer-to-peer distributed \n' +
'timestamp server to generate computational proof of the chronological order of transactions. The \n' +
'system is secure as long as honest nodes collectively control more CPU power than any \n' +
'cooperating group of attacker nodes.\n' +
'1',
metadata: {
source: '/Users/bracesproul/code/lang-chain-ai/langchainjs/examples/src/document_loaders/example_data/bitcoin.pdf',
pdf: {
version: '1.10.100',
info: [Object],
metadata: null,
totalPages: 9
},
loc: { pageNumber: 1 }
},
id: undefined
}
Document {
pageContent: 'Bitcoin: A Peer-to-Peer Electronic Cash System\n' +
'Satoshi Nakamoto\n' +
'satoshin@gmx.com\n' +
'www.bitcoin.org\n' +
'Abstract. A purely peer-to-peer version of electronic cash would allow online \n' +
'payments to be sent directly from one party to another without going through a \n' +
'financial institution. Digital signatures provide part of the solution, but the main \n' +
'benefits are lost if a trusted third party is still required to prevent double-spending. \n' +
'We propose a solution to the double-spending problem using a peer-to-peer network. \n' +
'The network timestamps transactions by hashing them into an ongoing chain of \n' +
'hash-based proof-of-work, forming a record that cannot be changed without redoing \n' +
'the proof-of-work. The longest chain not only serves as proof of the sequence of \n' +
'events witnessed, but proof that it came from the largest pool of CPU power. As \n' +
'long as a majority of CPU power is controlled by nodes that are not cooperating to',
metadata: {
source: '/Users/bracesproul/code/lang-chain-ai/langchainjs/examples/src/document_loaders/example_data/bitcoin.pdf',
pdf: {
version: '1.10.100',
info: [Object],
metadata: null,
totalPages: 9
},
loc: { pageNumber: 1, lines: [Object] }
},
id: undefined
}

API reference​

For detailed documentation of all PDFLoader features and configurations head to the API reference: https://api.js.langchain.com/classes/langchain_community_document_loaders_fs_pdf.PDFLoader.html


Was this page helpful?


You can also leave detailed feedback on GitHub.