James Smith @Floppy

**Simon Willison** @simon@simonwillison.net · Mar 30, 2024

Mar 30, 2024

Simon Willison @simon@simonwillison.net

I built a new tool: https://tools.simonwillison.net/ocr - it runs OCR against images and PDFs entirely in your browser (no file upload needed) using Tesseract.js and PDF.js

I wrote more about the tool and how I built it (with copious amounts of Claude 3 Opus and a little bit of ChatGPT) here: https://simonwillison.net/2024/Mar/30/ocr-pdfs-images/

GIF

**Simon Willison** @simon@simonwillison.net · Mar 30, 2024

Mar 30, 2024

Simon Willison @simon@simonwillison.net

Something I really like about this tool is that the entire thing is 226 lines of combined HTML, CSS and JavaScript (plus the PDF.js and Tesseract.js dependencies, loaded from a CDN)

The code is a little untidy but at 226 lines it honestly doesn't matter https://github.com/simonw/tools/blob/9fb049424f4ec8f8ffb91a59ab7111cad56088fc/ocr.html

GitHubtools/ocr.html at 9fb049424f4ec8f8ffb91a59ab7111cad56088fc · simonw/toolsAssorted tools. Contribute to simonw/tools development by creating an account on GitHub.

**Simon Willison** @simon@simonwillison.net · Mar 30, 2024

Mar 30, 2024

Simon Willison @simon@simonwillison.net

Also neat is that the enabling libraries here - Tesseract.js and PDF.js - are both pretty old at this point:

First commit to Tesseract.js was Jun 26, 2015 https://github.com/naptha/tesseract.js/commit/906ce3cadbffaf5f7317a4418f282c4b78bf8385

First to PDF.js was Apr 25, 2011 https://github.com/mozilla/pdf.js/commit/6dc1770bba7a417ce5664c0305469e5bb7ea76bd

GitHubinit · naptha/tesseract.js@906ce3cPure Javascript OCR for more than 100 Languages 📖🎉🖥 - init · naptha/tesseract.js@906ce3c

**Simon Willison** @simon@simonwillison.net · Mar 30, 2024

Mar 30, 2024

Simon Willison @simon@simonwillison.net

My other OCR project from yesterday: textract-cli, the thinnest possible CLI wrapper around AWS's Textract API, built out of frustration at how hard that is to use!

https://github.com/simonw/textract-cli

It only works with JPEGs and PNGs up to 5MB in size, reflecting limitations in Textract’s synchronous API - anything more than that has to go to S3 first.

Assuming you’ve configured AWS credentials already, this is all you need to know:

pipx install textract-cli
textract-cli image.jpeg > output.txt

GitHubGitHub - simonw/textract-cli: CLI for running files through AWS TextractCLI for running files through AWS Textract. Contribute to simonw/textract-cli development by creating an account on GitHub.

Sym Roe @symroe@mastodon.me.uk

@simon The (oddly hard to find) Textractor python library does this nicely, with async interface too:

> pip[x] install amazon-textract-textractor
> textractor detect-document-text your_file.png output.json

https://aws-samples.github.io/amazon-textract-textractor/commandline.html

But maybe it's processing the output into something useful that you needed? Parsing their JSON can be tricky, but that library also has a Document class with handy `to_markdown` or `to_pandas` methods

aws-samples.github.ioCLI — amazon-textract-textractor 1.0.0 documentation

Mar 31, 2024, 07:21 AM··Web

0boosts·1favorite