Skip to content

Commit dc0bdc1

Browse files
Merge pull request #251 from SyncfusionExamples/1022877-DataExtraction_MDSamples
Documentation (1022877) – Added PDF to Markdown Extraction Samples
2 parents c803683 + 4de7217 commit dc0bdc1

17 files changed

Lines changed: 262 additions & 0 deletions

File tree

Binary file not shown.
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# Extract Structured Data from PDF
2+
3+
The Syncfusion® [Smart Data Extractor](https://www.syncfusion.com/document-sdk/net-pdf-data-extraction) is a .NET library used to extract document structures such as hierarchies, text blocks, images, headers, and footers from PDFs and scanned images by analyzing visual layout patterns like lines, boxes, and alignment. It returns structured JSON with per-field confidence scores
4+
5+
## Steps to Extract Structured Data from PDF Files
6+
7+
Step 1: **Create a new project:** Begin by setting up a new C# Console Application project.
8+
9+
Step 2: **Install the NuGet package:** Add the [Syncfusion.SmartDataExtractor.Net.Core](https://www.nuget.org/packages/Syncfusion.SmartDataExtractor.Net.Core) package to your project from [NuGet.org](https://www.nuget.org/).
10+
11+
Step 3: **Include necessary namespaces:** Add these namespaces in your Program.cs file:
12+
13+
```csharp
14+
using System.IO;
15+
using System.Text;
16+
using Syncfusion.SmartDataExtractor;
17+
18+
```
19+
20+
Step 4: Add the following code snippet in Program.cs file to extract data from PDF.
21+
22+
```csharp
23+
// Open the input PDF file as a stream.
24+
using (FileStream stream = new FileStream(Path.GetFullPath("Input.pdf"), FileMode.Open, FileAccess.ReadWrite))
25+
{
26+
// Initialize the Smart Data Extractor.
27+
DataExtractor extractor = new DataExtractor();
28+
// Extract form data as JSON.
29+
string data = extractor.ExtractDataAsJson(stream);
30+
// Save the extracted JSON data into an output file.
31+
File.WriteAllText(Path.GetFullPath(@"Output.json"), data, Encoding.UTF8);
32+
}
33+
34+
```
35+
For a complete working example, download it from [GitHub](https://github.com/SyncfusionExamples/PDF-Examples/tree/master/Data-Extraction/Smart-Data-Extractor/Extract-data-as-JSON-from-PDF/.NET).
36+
37+
More information about Extract data from PDF can be refer in this [documentation](https://help.syncfusion.com/document-processing/data-extraction/smart-data-extractor/overview)section.
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
2+
Microsoft Visual Studio Solution File, Format Version 12.00
3+
# Visual Studio Version 18
4+
VisualStudioVersion = 18.5.11716.220 stable
5+
MinimumVisualStudioVersion = 10.0.40219.1
6+
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "Extract-data-as-MD-from-PDF", "Extract-data-as-MD-from-PDF\Extract-data-as-MD-from-PDF.csproj", "{29872D0F-18F6-6AB7-0892-D538C0E179BA}"
7+
EndProject
8+
Global
9+
GlobalSection(SolutionConfigurationPlatforms) = preSolution
10+
Debug|Any CPU = Debug|Any CPU
11+
Release|Any CPU = Release|Any CPU
12+
EndGlobalSection
13+
GlobalSection(ProjectConfigurationPlatforms) = postSolution
14+
{29872D0F-18F6-6AB7-0892-D538C0E179BA}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
15+
{29872D0F-18F6-6AB7-0892-D538C0E179BA}.Debug|Any CPU.Build.0 = Debug|Any CPU
16+
{29872D0F-18F6-6AB7-0892-D538C0E179BA}.Release|Any CPU.ActiveCfg = Release|Any CPU
17+
{29872D0F-18F6-6AB7-0892-D538C0E179BA}.Release|Any CPU.Build.0 = Release|Any CPU
18+
EndGlobalSection
19+
GlobalSection(SolutionProperties) = preSolution
20+
HideSolutionNode = FALSE
21+
EndGlobalSection
22+
EndGlobal
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
<Project Sdk="Microsoft.NET.Sdk">
2+
3+
<PropertyGroup>
4+
<OutputType>Exe</OutputType>
5+
<TargetFramework>net8.0</TargetFramework>
6+
<RootNamespace>Extract_data_as_MD_from_PDF</RootNamespace>
7+
<ImplicitUsings>enable</ImplicitUsings>
8+
<Nullable>enable</Nullable>
9+
</PropertyGroup>
10+
11+
<ItemGroup>
12+
<PackageReference Include="Syncfusion.SmartDataExtractor.Net.Core" Version="*" />
13+
</ItemGroup>
14+
15+
<ItemGroup>
16+
<None Update="Data\Input.pdf">
17+
<CopyToOutputDirectory>Always</CopyToOutputDirectory>
18+
</None>
19+
<None Update="Output\.gitkeep">
20+
<CopyToOutputDirectory>Always</CopyToOutputDirectory>
21+
</None>
22+
</ItemGroup>
23+
24+
</Project>

Data-Extraction/Smart-Data-Extractor/Extract-data-as-MD-from-PDF/.NET/Extract-data-as-MD-from-PDF/Output/.gitkeep

Whitespace-only changes.
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
using Syncfusion.SmartDataExtractor;
2+
using System.Text;
3+
4+
//Open the input PDF file as a stream.
5+
using (FileStream stream = new FileStream(Path.GetFullPath(@"Data\Input.pdf"), FileMode.Open, FileAccess.ReadWrite))
6+
{
7+
//Initialize the Smart Data Extractor.
8+
DataExtractor extractor = new DataExtractor();
9+
//Extract data as Markdown.
10+
string data = extractor.ExtractDataAsMarkdown(stream);
11+
//Save the extracted Markdown data into an output file.
12+
File.WriteAllText(Path.GetFullPath(@"Output\Output.md"), data, Encoding.UTF8);
13+
}
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Extract Structured Data from PDF as Markdown
2+
3+
The Syncfusion® [Smart Data Extractor](https://www.syncfusion.com/document-sdk/net-pdf-data-extraction) is a .NET library that extracts document structures such as hierarchies, text blocks, images, headers, and footers from PDFs and scanned images by analyzing visual layout patterns like lines, boxes, and alignment.
4+
5+
## Steps to Extract Data as Markdown from PDF Files
6+
7+
Step 1: **Create a new project:** Begin by setting up a new C# Console Application project.
8+
9+
Step 2: **Install the NuGet package:** Add the [Syncfusion.SmartDataExtractor.Net.Core](https://www.nuget.org/packages/Syncfusion.SmartDataExtractor.Net.Core) package to your project from [NuGet.org](https://www.nuget.org/).
10+
11+
Step 3: **Include necessary namespaces:** Add these namespaces in your Program.cs file:
12+
13+
```csharp
14+
using System.IO;
15+
using System.Text;
16+
using Syncfusion.SmartDataExtractor;
17+
18+
```
19+
20+
Step 4: Add the following code snippet in Program.cs file to extract data from PDF.
21+
22+
```csharp
23+
//Open the input PDF file as a stream.
24+
using (FileStream stream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read))
25+
{
26+
//Initialize the Smart Data Extractor.
27+
DataExtractor extractor = new DataExtractor();
28+
//Extract data as Markdown.
29+
string data = extractor.ExtractDataAsMarkdown(stream);
30+
//Save the extracted Markdown data into an output file.
31+
File.WriteAllText("Output.md", data, Encoding.UTF8);
32+
}
33+
34+
```
35+
More information about Extract data from PDF can be refer in this [documentation](https://help.syncfusion.com/document-processing/data-extraction/smart-data-extractor/overview)section.
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# Recognize Form Data from PDF using C#
2+
3+
The Syncfusion® [Smart Form Recognizer](https://www.syncfusion.com/document-sdk/net-pdf-data-extraction) is a .NET C# library that detects form regions and extracts text fields, checkboxes, radio buttons, and signatures by interpreting visual patterns such as boxes and selection markers. The extracted results are returned as normalized JSON with confidence scores, enabling applications to automatically process form data.
4+
5+
## Steps to Recognize Form Data from PDF Files
6+
7+
Step 1: **Create a new project:** Begin by setting up a new C# Console Application project.
8+
9+
Step 2: **Install the NuGet package:** Add the [Syncfusion.SmartFormRecognizer.Net.Core](https://www.nuget.org/packages/Syncfusion.SmartFormRecognizer.Net.Core) package to your project from [NuGet.org](https://www.nuget.org/).
10+
11+
Step 3: **Include necessary namespaces:** Add these namespaces in your Program.cs file:
12+
13+
```csharp
14+
using System.IO;
15+
using Syncfusion.SmartFormRecognizer;
16+
```
17+
18+
Step 4: Add the following code snippet in Program.cs file to extract data from PDF.
19+
20+
```csharp
21+
// Read the input PDF file as stream.
22+
using (FileStream inputStream = new FileStream(Path.GetFullPath(@"Input.pdf"), FileMode.Open, FileAccess.ReadWrite))
23+
{
24+
// Initialize the Form Recognizer.
25+
FormRecognizer smartFormRecognizer = new FormRecognizer();
26+
// Recognize the form and get the output as JSON string.
27+
string outputJson = smartFormRecognizer.RecognizeFormAsJson(inputStream);
28+
// Save the output JSON to file.
29+
File.WriteAllText(Path.GetFullPath(@"Output.json"),outputJson);
30+
}
31+
```
32+
For a complete working example, download it from [GitHub](https://github.com/SyncfusionExamples/PDF-Examples/tree/master/Data-Extraction/Smart-Form-Recognizer/Recognize-forms-using-JSON/.NET).
33+
34+
More information about SmartFormRecognizer can be refer in this [documentation](https://help.syncfusion.com/document-processing/data-extraction/smart-form-recognizer/overview)section.

0 commit comments

Comments
 (0)