mastodon.me.uk is one of the many independent Mastodon servers you can use to participate in the fediverse.
Open, user-supported, corporation-free social media for the UK.

Administered by:

Server stats:

499
active users

I built a new tool: tools.simonwillison.net/ocr - it runs OCR against images and PDFs entirely in your browser (no file upload needed) using Tesseract.js and PDF.js

I wrote more about the tool and how I built it (with copious amounts of Claude 3 Opus and a little bit of ChatGPT) here: simonwillison.net/2024/Mar/30/

Something I really like about this tool is that the entire thing is 226 lines of combined HTML, CSS and JavaScript (plus the PDF.js and Tesseract.js dependencies, loaded from a CDN)

The code is a little untidy but at 226 lines it honestly doesn't matter github.com/simonw/tools/blob/9

GitHubtools/ocr.html at 9fb049424f4ec8f8ffb91a59ab7111cad56088fc · simonw/toolsAssorted tools. Contribute to simonw/tools development by creating an account on GitHub.

My other OCR project from yesterday: textract-cli, the thinnest possible CLI wrapper around AWS's Textract API, built out of frustration at how hard that is to use!

github.com/simonw/textract-cli

It only works with JPEGs and PNGs up to 5MB in size, reflecting limitations in Textract’s synchronous API - anything more than that has to go to S3 first.

Assuming you’ve configured AWS credentials already, this is all you need to know:

pipx install textract-cli
textract-cli image.jpeg > output.txt

GitHubGitHub - simonw/textract-cli: CLI for running files through AWS TextractCLI for running files through AWS Textract. Contribute to simonw/textract-cli development by creating an account on GitHub.
Sym Roe

@simon The (oddly hard to find) Textractor python library does this nicely, with async interface too:

> pip[x] install amazon-textract-textractor
> textractor detect-document-text your_file.png output.json

aws-samples.github.io/amazon-t

But maybe it's processing the output into something useful that you needed? Parsing their JSON can be tricky, but that library also has a Document class with handy `to_markdown` or `to_pandas` methods

aws-samples.github.ioCLI — amazon-textract-textractor 1.0.0 documentation

@symroe well that would have saved me a bit of time! Thanks for the link, I'll add that to the textract-cli README

@simon I also built about 70% of a DIY solution before finding it! 🤷‍♂️