How to scan a PDF for text using C#

You can easily scan PDFs for specific text in C#, here's how you might do that. We'll follow by using NET Core as our framework, the only nuget package you'll need to install is itext7
.
Code
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
namespace ScanTextInPDFs
{
internal class Program
{
public static async Task Main(string[] args)
{
string executingDirectory = AppContext.BaseDirectory;
byte[] bytes = await File.ReadAllBytesAsync($"{executingDirectory}PDFs\\Brochure.pdf");
string textToFind = "Lorem ipsum";
bool foundText = false;
using (MemoryStream memoryStream = new MemoryStream(bytes))
{
using PdfReader pdfReader = new PdfReader(memoryStream);
using PdfDocument pdfDocument = new PdfDocument(pdfReader);
for (int page = 1; page <= pdfDocument.GetNumberOfPages(); page++)
{
PdfPage pdfPage = pdfDocument.GetPage(page);
string pageText = PdfTextExtractor.GetTextFromPage(pdfPage, new SimpleTextExtractionStrategy());
if (pageText.Contains(textToFind, StringComparison.Ordinal))
foundText = true;
}
}
if (foundText)
Console.WriteLine($"Found '{textToFind}' in the pdf.");
else
Console.WriteLine($"Did not find '{textToFind}' in the pdf.");
}
}
}
A simple NET Core application that reads a PDF for specific text
In addition, you'll want to create a folder named "PDFs" in your solution where you will place the file.

Explanation
We first get the directory which our application will be running with AppContext.BaseDirectory
. This allows us to more-easily get to the path of the PDFs folder we created and the PDF inside that.
Next, using the itext7
library, we read the bytes of the PDF file into a PdfReader, and then create a PdfDocument from that PdfReader. The PdfDocument has PdfPages; using the PdfTextExtractor, we can pull out the text from the PDF to check if it contains particular text.
Github
You can view the whole solution here.