PDF to Text Using LLM in Ruby

Converting the contents of a PDF into structured, searchable text is a common but deceptively complex task. Traditional OCR engines struggle with mixed layouts, low-contrast scans, and non-standard fonts. By combining Ruby, MiniMagick, and OpenAI’s GPT-4o-mini, you can achieve far higher accuracy and preserve the original document structure—headings, lists, tables, and all.

This article presents a step-by-step guide to creating a production-ready pipeline that:

Splits a PDF into high-resolution images (one per page).
Transcribes each image with GPT-4o-mini, using a carefully crafted prompt that enforces Markdown structure.
Embeds the cleaned text for semantic search or retrieval-augmented generation (RAG) workflows.

1 Pipeline at a Glance

Stage	Technology	Purpose
Page-to-Image	MiniMagick (`convert`)	Render each PDF page as a 300 DPI PNG
OCR & Rewrite	GPT-4o-mini	Detect and rewrite all visible text, retaining layout
Embedding	OpenAI Embeddings API / pgvector	Generate fixed-length vectors for fast semantic lookup

2 Prerequisites

Ruby 3.2+ (Bundler recommended)
ImageMagick 6 with PDF reading enabled
The mini_magick and openai gems
An OpenAI API key with access to GPT-4o-mini
(Optional) A vector database such as PostgreSQL + pgvector or Pinecone

Install the core gems:

bundle add mini_magick
bundle add openai

3 Enable PDF Reading in ImageMagick

Most Linux distributions disable PDF processing in ImageMagick for security reasons. Confirm—or add—the following line in /etc/ImageMagick-6/policy.xml:

<policy domain="coder" rights="read|write" pattern="PDF" />

Verify it:

grep PDF /etc/ImageMagick-6/policy.xml
# <policy domain="coder" rights="read|write" pattern="PDF" />

4 Converting PDF Pages to Images

Create a small service object that converts every page to a separate PNG at 300 DPI. A higher resolution dramatically improves GPT’s accuracy on small fonts and superscripts.

# lib/pdf_to_images.rb
# frozen_string_literal: true

require "mini_magick"
require "fileutils"

class PdfToImages
  DENSITY = 300 # DPI

  # @param pdf_path  [String]  Absolute path to the PDF file
  # @param output_dir [String] Directory for generated images
  # @return [Array<String>]    Sorted list of image file paths
  def self.call(pdf_path:, output_dir:)
    FileUtils.mkdir_p(output_dir)

    MiniMagick::Tool::Convert.new do |convert|
      convert.density DENSITY
      convert.background "white"
      convert.flatten
      convert << pdf_path
      convert << File.join(output_dir, "page-%03d.png")
    end

    Dir.glob(File.join(output_dir, "page-*.png")).sort
  end
end

5 Prompt Engineering: Getting Markdown You Can Trust

GPT models are sensitive to prompt clarity. The following prompt consistently yields Markdown that mirrors the source layout while avoiding code fences, making the text immediately storable or indexable.

💬 Recommended Prompt (expand)

You are an expert document transcriber.

**Objective**
Transcribe *exactly* what you see in the PDF page image. Preserve wording, punctuation, and structure.

**Instructions**

1. Detect every text element—headings, paragraphs, lists, tables, footnotes.
2. Rewrite in GitHub-flavoured Markdown:
   - `#` Main heading, `##` Subheading, `###` Sub-subheading
   - Unordered list → `-`, ordered list → `1.` `2.` …
   - Tables → `| col1 | col2 |` with `|---|---|` separators
3. Special cases:
   - **Bold** → `**bold**`, *italic* → `*italic*`
   - Blockquotes → `>`
   - Unclear text → `[unclear text]`
4. Output **plain text only** (no code fences).
5. Do not translate mixed-language content.

Context (do not include in output):
- **Document title**: <Title here>
- **Abstract**: <Short description here>

Begin.

6 Embedding the Transcribed Output

Once GPT returns the Markdown text for a page, generate embeddings:

require "openai"

client = OpenAI::Client.new

embedding = client.embeddings(
  parameters: {
    model: "text-embedding-3-large",
    input: markdown_text
  }
).dig("data", 0, "embedding")

Store the resulting vector alongside the raw Markdown, typically in a pgvector column or an external vector store.

7 End-to-End Orchestration Script

A minimal orchestrator ties the pieces together:

# scripts/pdf_pipeline.rb
require_relative "../lib/pdf_to_images"
require "openai"

pdf_path   = ARGV[0] || "whitepaper.pdf"
output_dir = "tmp/pdf_images"
prompt     = File.read("prompts/pdf_ocr_prompt.md")

# 1. Render images
pages = PdfToImages.call(pdf_path: pdf_path, output_dir: output_dir)

# 2. OCR each image with GPT-4o-mini
client = OpenAI::Client.new
markdown_pages = pages.map do |img|
  client.chat(
    parameters: {
      model: "gpt-4o-mini",
      messages: [
        { role: "system", content: "You are a precise PDF transcriber." },
        { role: "user",   content: prompt },
        { role: "user",   content: { image_url: File.expand_path(img) } }
      ]
    }
  ).dig("choices", 0, "message", "content")
end

# 3. Embed each page
markdown_pages.each do |md|
  vector = client.embeddings(
    parameters: { model: "text-embedding-3-large", input: md }
  ).dig("data", 0, "embedding")

  # Persist both md and vector to the database...
end

Run the script:

ruby scripts/pdf_pipeline.rb path/to/document.pdf

8 Troubleshooting

Problem	Root Cause	Solution
`attempt to perform an operation not allowed by the security policy`	PDF disabled in ImageMagick	Confirm the `<policy>` entry
Blurry or pixelated text	Default 72 DPI render	Use `-density 300` (or higher)
GPT response cuts off mid-sentence	Token limit exceeded	Chunk the page or raise `max_tokens`
Vector size mismatch in DB	Embedding model inconsistency	Embed all pages with the same model ID

9 Where to Go from Here

Parallel processing: Off-load each page to Sidekiq, Resque, or Parallel to speed up large batches.
Post-processing: Normalize heading levels, trim blank lines, or merge tables across page boundaries.
User-facing search: Combine pgvector with pg_trgm to support both semantic and keyword queries.
RAG chatbots: Feed the embedded corpus into a retrieval-augmented generation pipeline for question-answering or summarisation.

10 Conclusion

By integrating a high-quality rendering step, a GPT-powered transcription prompt, and robust embedding, you can transform even the most complex PDFs into machine-readable assets without sacrificing layout fidelity. The resulting pipeline is flexible—swap GPT-4o-mini for a local model, choose any vector store, or extend the prompt rules as your use case evolves.