How to scan a PDF for text using C#

reZach

27 Oct 2023 • 1 min read

A sample PDF with which we will scan the text

You can easily scan PDFs for specific text in C#, here's how you might do that. We'll follow by using NET Core as our framework, the only nuget package you'll need to install is itext7.

Code

using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Listener;

namespace ScanTextInPDFs
{
    internal class Program
    {
        public static async Task Main(string[] args)
        {
            string executingDirectory = AppContext.BaseDirectory;

            byte[] bytes = await File.ReadAllBytesAsync($"{executingDirectory}PDFs\\Brochure.pdf");
            string textToFind = "Lorem ipsum";
            bool foundText = false;

            using (MemoryStream memoryStream = new MemoryStream(bytes))
            {
                using PdfReader pdfReader = new PdfReader(memoryStream);
                using PdfDocument pdfDocument = new PdfDocument(pdfReader);

                for (int page = 1; page <= pdfDocument.GetNumberOfPages(); page++)
                {
                    PdfPage pdfPage = pdfDocument.GetPage(page);
                    string pageText = PdfTextExtractor.GetTextFromPage(pdfPage, new SimpleTextExtractionStrategy());

                    if (pageText.Contains(textToFind, StringComparison.Ordinal))
                        foundText = true;
                }
            }

            if (foundText)
                Console.WriteLine($"Found '{textToFind}' in the pdf.");
            else
                Console.WriteLine($"Did not find '{textToFind}' in the pdf.");
        }
    }
}

A simple NET Core application that reads a PDF for specific text

In addition, you'll want to create a folder named "PDFs" in your solution where you will place the file.

We have a PDFs folder in our project that contains a pdf file named "Brochure.pdf"

Explanation

We first get the directory which our application will be running with AppContext.BaseDirectory. This allows us to more-easily get to the path of the PDFs folder we created and the PDF inside that.

Next, using the itext7 library, we read the bytes of the PDF file into a PdfReader, and then create a PdfDocument from that PdfReader. The PdfDocument has PdfPages; using the PdfTextExtractor, we can pull out the text from the PDF to check if it contains particular text.

Github

You can view the whole solution here.