Textricator is a tool for extracting text from computer-generated PDFs and generating structured data . If you have a bunch of PDFs with the same format (or one big, consistently formatted PDF) and you want to extract the data to CSV or JSON,
Free • Open Source
What is Textricator?
Textricator is a tool to extract text from documents and generate structured data.
If you have a bunch of PDFs with the same format (or one big, consistently formatted PDF) and you want to extract the data to CSV or JSON, Textricator can help! It can even work on OCR'ed documents!
Textricator is released under the GNU Affero General Public License Version 3.
Textricator is deployed to Maven Central with GAV io.mfj:textricator.
This application is actively used and developed by Measures for Justice. We welcome feedback, bug reports, and contributions. Create an issue, send a pull request, or email us at firstname.lastname@example.org. If you use Textricator, please let us know. Send us your mailing address and we will mail you a sticker.
io.mfj.textricator.Textricator is the main entry point for library usage.
io.mfj.textricator.cli.TextricatorCli is the command-line interface.
The CLI has three subcommands, to use the three main features of Textricator:
text - Extract text from the PDF and generate JSON. table - Parse the text that is in columns and rows. See Table section. form - Parse the text with a configured finite state machine. See Form section.