Converting the contents of a PDF into structured, searchable text is a common but deceptively complex task. Traditional OCR engines struggle with mixed layouts, low-contrast scans, and non-standard fonts. By combining Ruby, MiniMagick, and OpenAI’s GPT-4o-mini, you can achieve far higher accuracy and preserve the original document structure—headings, lists, tables, and all.
This article presents a step-by-step guide to creating a production-ready pipeline that:
- Splits a PDF into high-resolution images (one per page).
- Transcribes each image with GPT-4o-mini, using a carefully crafted prompt that enforces Markdown structure.
- Embeds the cleaned text for semantic search or retrieval-augmented generation (RAG) workflows.
1 Pipeline at a Glance
| Stage | Technology | Purpose |
|---|---|---|
| Page-to-Image | MiniMagick (convert) |
Render each PDF page as a 300 DPI PNG |
| OCR & Rewrite | GPT-4o-mini | Detect and rewrite all visible text, retaining layout |
| Embedding | OpenAI Embeddings API / pgvector | Generate fixed-length vectors for fast semantic lookup |
2 Prerequisites
- Ruby 3.2+ (Bundler recommended)
- ImageMagick 6 with PDF reading enabled
- The
mini_magickandopenaigems - An OpenAI API key with access to GPT-4o-mini
- (Optional) A vector database such as PostgreSQL + pgvector or Pinecone
Install the core gems:
bundle add mini_magick
bundle add openai
3 Enable PDF Reading in ImageMagick
Most Linux distributions disable PDF processing in ImageMagick for security reasons. Confirm—or add—the following line in /etc/ImageMagick-6/policy.xml:
<policy domain="coder" rights="read|write" pattern="PDF" />
Verify it:
grep PDF /etc/ImageMagick-6/policy.xml
# <policy domain="coder" rights="read|write" pattern="PDF" />
4 Converting PDF Pages to Images
Create a small service object that converts every page to a separate PNG at 300 DPI. A higher resolution dramatically improves GPT’s accuracy on small fonts and superscripts.
# lib/pdf_to_images.rb
# frozen_string_literal: true
require "mini_magick"
require "fileutils"
class PdfToImages
DENSITY = 300 # DPI
# @param pdf_path [String] Absolute path to the PDF file
# @param output_dir [String] Directory for generated images
# @return [Array<String>] Sorted list of image file paths
def self.call(pdf_path:, output_dir:)
FileUtils.mkdir_p(output_dir)
MiniMagick::Tool::Convert.new do |convert|
convert.density DENSITY
convert.background "white"
convert.flatten
convert << pdf_path
convert << File.join(output_dir, "page-%03d.png")
end
Dir.glob(File.join(output_dir, "page-*.png")).sort
end
end
5 Prompt Engineering: Getting Markdown You Can Trust
GPT models are sensitive to prompt clarity. The following prompt consistently yields Markdown that mirrors the source layout while avoiding code fences, making the text immediately storable or indexable.
💬 Recommended Prompt (expand)
You are an expert document transcriber.
**Objective**
Transcribe *exactly* what you see in the PDF page image. Preserve wording, punctuation, and structure.
**Instructions**
1. Detect every text element—headings, paragraphs, lists, tables, footnotes.
2. Rewrite in GitHub-flavoured Markdown:
- `#` Main heading, `##` Subheading, `###` Sub-subheading
- Unordered list → `-`, ordered list → `1.` `2.` …
- Tables → `| col1 | col2 |` with `|---|---|` separators
3. Special cases:
- **Bold** → `**bold**`, *italic* → `*italic*`
- Blockquotes → `>`
- Unclear text → `[unclear text]`
4. Output **plain text only** (no code fences).
5. Do not translate mixed-language content.
Context (do not include in output):
- **Document title**: <Title here>
- **Abstract**: <Short description here>
Begin.
6 Embedding the Transcribed Output
Once GPT returns the Markdown text for a page, generate embeddings:
require "openai"
client = OpenAI::Client.new
embedding = client.embeddings(
parameters: {
model: "text-embedding-3-large",
input: markdown_text
}
).dig("data", 0, "embedding")
Store the resulting vector alongside the raw Markdown, typically in a pgvector column or an external vector store.
7 End-to-End Orchestration Script
A minimal orchestrator ties the pieces together:
# scripts/pdf_pipeline.rb
require_relative "../lib/pdf_to_images"
require "openai"
pdf_path = ARGV[0] || "whitepaper.pdf"
output_dir = "tmp/pdf_images"
prompt = File.read("prompts/pdf_ocr_prompt.md")
# 1. Render images
pages = PdfToImages.call(pdf_path: pdf_path, output_dir: output_dir)
# 2. OCR each image with GPT-4o-mini
client = OpenAI::Client.new
markdown_pages = pages.map do |img|
client.chat(
parameters: {
model: "gpt-4o-mini",
messages: [
{ role: "system", content: "You are a precise PDF transcriber." },
{ role: "user", content: prompt },
{ role: "user", content: { image_url: File.expand_path(img) } }
]
}
).dig("choices", 0, "message", "content")
end
# 3. Embed each page
markdown_pages.each do |md|
vector = client.embeddings(
parameters: { model: "text-embedding-3-large", input: md }
).dig("data", 0, "embedding")
# Persist both md and vector to the database...
end
Run the script:
ruby scripts/pdf_pipeline.rb path/to/document.pdf
8 Troubleshooting
| Problem | Root Cause | Solution |
|---|---|---|
attempt to perform an operation not allowed by the security policy |
PDF disabled in ImageMagick | Confirm the <policy> entry |
| Blurry or pixelated text | Default 72 DPI render | Use -density 300 (or higher) |
| GPT response cuts off mid-sentence | Token limit exceeded | Chunk the page or raise max_tokens |
| Vector size mismatch in DB | Embedding model inconsistency | Embed all pages with the same model ID |
9 Where to Go from Here
- Parallel processing: Off-load each page to Sidekiq, Resque, or
Parallelto speed up large batches. - Post-processing: Normalize heading levels, trim blank lines, or merge tables across page boundaries.
- User-facing search: Combine pgvector with
pg_trgmto support both semantic and keyword queries. - RAG chatbots: Feed the embedded corpus into a retrieval-augmented generation pipeline for question-answering or summarisation.
10 Conclusion
By integrating a high-quality rendering step, a GPT-powered transcription prompt, and robust embedding, you can transform even the most complex PDFs into machine-readable assets without sacrificing layout fidelity. The resulting pipeline is flexible—swap GPT-4o-mini for a local model, choose any vector store, or extend the prompt rules as your use case evolves.