Extract Structured Data • Documind Documentation

Extracting data with Documind

Let’s run a sample extraction with the Documind open-source package.

import { extract } from 'documind';

const result = await extract({
  file: 'https://example.com/bank_statement.pdf',
  schema: [
    {
      "name": "accountNumber",
      "type": "string",
      "description": "The account number of the bank statement."
    },
    {
      "name": "openingBalance",
      "type": "number",
      "description": "The opening balance in the account."
    },
    {
      "name": "transactions",
      "type": "array",
      "description": "A list of transactions in the account.",
      "children": [
        {
          "name": "date",
          "type": "string",
          "description": "The date of the transaction."
        },
        {
          "name": "creditAmount",
          "type": "number",
          "description": "The amount credited in the transaction."
        },
        {
          "name": "debitAmount",
          "type": "number",
          "description": "The amount debited in the transaction."
        },
        {
          "name": "description",
          "type": "string",
          "description": "A short note about the transaction."
        }
      ]
    },
    {
      "name": "closingBalance",
      "type": "number",
      "description": "The closing balance in the account."
    },
    {
    name: "highValueAccount",
    type: "boolean",
    description: "Closing balance is more than 50000 ."
  },
  {
    name: "statementType",
    type: "enum",
    description: "The type of document",
    values: ["Current Account", "Savings Account"]
  }
  ]
});

console.log(result);

file

string

required

The file URL.

schema

object[]

required

The schema that defines the structure of the data you want to extract. More on schema definitions here.

Currently, only URLs are accepted. Ensure your document is hosted and accessible via a public URL.

Example Output

Once the extraction process is complete, the result will return a structured JSON object with the extracted data:

{
  "success": true,
  "pages": 1,
  "data": {
    "accountNumber": "100002345",
    "openingBalance": 3200,
    "transactions": [
      {
        "date": "2021-05-12",
        "creditAmount": null,
        "debitAmount": 100,
        "description": "transfer to Tom"
      },
      {
        "date": "2021-05-12",
        "creditAmount": 50,
        "debitAmount": null,
        "description": "For lunch the other day"
      },
      {
        "date": "2021-05-13",
        "creditAmount": 20,
        "debitAmount": null,
        "description": "Refund for voucher"
      },
      {
        "date": "2021-05-13",
        "creditAmount": null,
        "debitAmount": 750,
        "description": "May's rent"
      }
    ],
    "closingBalance": 2420,
    "highValueAccount": false,
    "statementType": "Savings Account"
  },
  "fileName": "bank_statement.pdf"
}

success

boolean

Indicates whether the extraction was successful or not.

pages

number

The number of pages processed in the document.

data

object

The extracted data based on the schema.

fileName

string

The name of the processed file

Configurations

Specify the model you want to use to perform an extraction.

If you have a custome BASE_URL set in your environment, you can choose between llava and llama3.2-vision models. Otherwise you can use gpt-4o or gpt-4o-mini.

import { extract } from 'documind';

const result = await extract({
file: 'https://example.com/document.pdf',
model: 'llama3.2-vision' // Specify a model or use the default (Llava)
});

console.log(result);

To use a template, simply pass the template name in the extract function:

import { extract } from 'documind';

const result = await extract({
file: 'https://example.com/document.pdf',
template: 'invoice', // Specify the template name
});

console.log(result);

Automatically generate a schema. Documind will create and apply a schema basd on the content of the document.

import { extract } from 'documind';

const result = await extract({
file: 'https://example.com/document.pdf',
autoSchema: true
});

console.log(result);

You can only select one of template, schema, or autoSchema.

Introduction

Guides

Extracting Data

Extracting data with Documind

Example Output

Configurations

Introduction

Guides

​Extracting data with Documind

Example Output

Configurations

Extracting data with Documind