Extracting Images and Adjacent Text from PDFs Using Fitz

What will you learn?

In this comprehensive tutorial, you will delve into the world of extracting images along with their adjacent text from PDF files using the powerful Fitz library in Python. By mastering this skill, you will be equipped to handle various data processing tasks with ease and efficiency.

Introduction to Problem and Solution

Encountering PDF documents containing a mix of text and images is common, often with crucial information embedded within annotations or neighboring text. Extracting this amalgamation programmatically can pose challenges but offers immense value for tasks like data analysis, archiving, and even machine learning endeavors. The aim here is not just to isolate images but also to capture any text positioned directly beside these images within a PDF document.

To tackle this challenge effectively, we turn to the Fitz library (a part of PyMuPDF), a robust tool tailored for PDF manipulation in Python. Through a scripted approach that traverses each page of a specified PDF file, we identify image blocks along with their neighboring text blocks and extract both components seamlessly. This methodology involves analyzing page layouts to discern spatial relationships between different content types.

Code

import fitz  # Import the PyMuPDF library

def extract_images_and_text(pdf_path):
    doc = fitz.open(pdf_path)  # Open the provided PDF path

    for page_number in range(len(doc)):  # Iterate through each page
        page = doc.load_page(page_number)
        image_list = page.get_images(full=True)

        if image_list:
            block_list = page.get_text("blocks")  # Get all blocks of text/images

            for img_index, img_info in enumerate(image_list):
                xref = img_info[0]
                base_image = doc.extract_image(xref)

                # Image's bounding box
                img_rect = fitz.Rect(base_image["bbox"])

                adjacent_texts = []

                for block in block_list:
                    block_rect = fitz.Rect(block[:4])  # Block's bounding box

                    if block_rect.intersects(img_rect):  # Check if text is near/overlapping image
                        adjacent_texts.append(block[4])

                print(f"Image {img_index + 1} on Page {page_number + 1}:")
                print("\n".join(adjacent_texts))

    doc.close()

extract_images_and_text("path/to/your/document.pdf")

# Copyright PHD

Explanation

Understanding How This Works:

  • Opening the Document: Begin by opening the target PDF file using fitz.open(), granting access to its pages.
  • Iterating Through Pages: Loop through each page within the document as every page must be considered individually.
  • Getting Images: Utilize get_images(full=True) per page to retrieve details about all present images.
  • Extracting Text Blocks: Obtain all rectangular blocks containing either text or imagery with get_text(“blocks”).
  • Identifying Adjacent Text: Compare spatial locations (fitz.Rect) of blocks against those of identified images (base_image[“bbox”]) to determine adjacent text.
  • Outputting Results: Print out texts found alongside each image across pages along with respective indexes and pages.

This code efficiently navigates through layers of content within your PDF files, treating them as two-dimensional spaces where textual and visual elements coexist harmoniously. Its effectiveness lies in understanding that proximity between items could signify relatedness or relevance.

  1. How does Fitz differ from other Python libraries?

  2. Fitz (PyMuPDF) distinguishes itself due to its extensive feature set allowing not only reading but also writing back into pdf files�ranging from extracting contents (text,image)to modifying them like adding annotations or encryption.

  3. Is Fitz free?

  4. Yes! Fitz is part of an open-source project which means it�s freely available for personal or commercial use under GNU GPL v3 license terms.

  5. Can I extract specific fonts with this method?

  6. While the demonstrated method focuses primarily on extracting imagery and associated texts based upon their layout positions rather than font styles,size etc.,it is indeed possible by further analyzing attributes available within block entities returned by get_text() method calls.

  7. Does this work with scanned documents?

  8. The approach works best when dealing with digitally created PDFs since scanned documents might require OCR (Optical Character Recognition) technology first,to convert imagery-based text into selectable/searchable formats before proceeding as described above.

  9. What formats can extracted images be saved as?

  10. Extracted images are typically retrieved as dictionaries including bytes-like objects representing them�these can subsequently be saved into common formats such as JPEG,PNG,GIF etc.,using standard imaging libraries e.g., PIL(Pillow).

Conclusion

Mastering extraction techniques involving complex document structures opens up numerous possibilities�from automated data entry systems through scholarly research facilitation up until aiding visually impaired users better understand mixed-content materials. The potential applications are vast once you grasp handling tools like Fitz effectively ensuring no vital piece gets missed during conversions/analyses processes.

Leave a Comment