I used LayoutPDFReader to estract the contents of a pdf files. Then I used the object doc, created using the examples of Google Colab, to save its contents in a json file.
When I print the content of json file, the output, often, is repeated again.
For examples, if the output was:
{
"tag": "header",
"level": 3,
"text": "2.1.2 Surface Reconstruction for Unoriented Point Clouds",
"children": [
{
"tag": "para",
"level": 4,
"text": "Attributed to the inherent challenge of acquiring normal information, numerous point clouds, particularly."
},
{
"tag": "para",
"level": 4,
"text": "In recent years, some research has been proposed on handling unoriented point clouds based on traditional methods."
}
]
},
{
"tag": "header",
"level": 3,
"text": "2.2 Normal Consistent Orientations for Point Clouds",
"children": [
{
"tag": "para",
"level": 4,
"text": "Calculating high-quality normals for unoriented point clouds is an important topic in geometric modeling and computer graphics.\nConverting unoriented point clouds into the oriented representation provides a novel approach to surface reconstruction.\nOverall, we can divide them into optimization-based and learning-based methods."
},
{
"tag": "para",
"level": 4,
"text": "Optimization-based Methods The research on optimizationbasedmethods has a long history, which can date back to the last century.\nThe pioneering work uses principal component analysis (PCA) to initialize normal orientations by Hoppe et al.\n[1992].\nAlthough many propagation-based methods have emerged, these methods always fail to handle complex or separated point clouds."
},
{
"tag": "para",
"level": 4,
"text": "Over the years, many technologies have been proposed tomitigate the problems of relying solely on local information.\nDipole [Metzer et al.\n2021] uses dipole propagation across patches iteratively.\nVIPSS [Huang et al.\n2019] minimizes the Duchon’s energy with the L-BFGS algorithm.\nXu et al.\n[2023b] propose a smooth nonlinear objective function to characterize the requirements of an acceptable winding-number field [McIntyre and Cairns 1993] and turn the problem into an unconstrained optimization problem."
},
{
"tag": "para",
"level": 4,
"text": "𝜕Φ(𝒙 −𝒚) 𝜕𝒏(𝒚)"
},
{
"tag": "para",
"level": 4,
"text": "d𝑆 (𝒚) = 𝜒 (𝒙) = Learning-based Methods Learning-based methods often treat oriented normal estimation as a classification or regression task where the normals are directly regressed from the feature"
},
{
"tag": "para",
"level": 4,
"text": "Due to the tight link between orientation and surface reconstruction, some proposed state-of-the-art methods can simultaneously accomplish orientation and surface reconstruction tasks.\nFor example, as reconstruction methods, iPSR [Hou et al.\n2022] and PGR [Lin et al.\n2023]"
}
]
}
]
}
]
},
at the end of blocks, randomly, we start again from a block that is printed already, i.e.
{
"tag": "header",
"level": 3,
"text": "2.1.2 Surface Reconstruction for Unoriented Point Clouds",
"children": [
{
"tag": "para",
"level": 4,
"text": "Attributed to the inherent challenge of acquiring normal information, numerous point clouds, particularly."
},
{
"tag": "para",
"level": 4,
"text": "In recent years, some research has been proposed on handling unoriented point clouds based on traditional methods."
}
]
},
{
"tag": "header",
"level": 3,
"text": "2.2 Normal Consistent Orientations for Point Clouds",
"children": [
{
"tag": "para",
"level": 4,
"text": "Calculating high-quality normals for unoriented point clouds is an important topic in geometric modeling and computer graphics.\nConverting unoriented point clouds into the oriented representation provides a novel approach to surface reconstruction.\nOverall, we can divide them into optimization-based and learning-based methods."
},
{
"tag": "para",
"level": 4,
"text": "Optimization-based Methods The research on optimizationbasedmethods has a long history, which can date back to the last century.\nThe pioneering work uses principal component analysis (PCA) to initialize normal orientations by Hoppe et al.\n[1992].\nAlthough many propagation-based methods have emerged, these methods always fail to handle complex or separated point clouds."
},
{
"tag": "para",
"level": 4,
"text": "Over the years, many technologies have been proposed tomitigate the problems of relying solely on local information.\nDipole [Metzer et al.\n2021] uses dipole propagation across patches iteratively.\nVIPSS [Huang et al.\n2019] minimizes the Duchon’s energy with the L-BFGS algorithm.\nXu et al.\n[2023b] propose a smooth nonlinear objective function to characterize the requirements of an acceptable winding-number field [McIntyre and Cairns 1993] and turn the problem into an unconstrained optimization problem."
},
{
"tag": "para",
"level": 4,
"text": "𝜕Φ(𝒙 −𝒚) 𝜕𝒏(𝒚)"
},
{
"tag": "para",
"level": 4,
"text": "d𝑆 (𝒚) = 𝜒 (𝒙) = Learning-based Methods Learning-based methods often treat oriented normal estimation as a classification or regression task where the normals are directly regressed from the feature"
},
{
"tag": "para",
"level": 4,
"text": "Due to the tight link between orientation and surface reconstruction, some proposed state-of-the-art methods can simultaneously accomplish orientation and surface reconstruction tasks.\nFor example, as reconstruction methods, iPSR [Hou et al.\n2022] and PGR [Lin et al.\n2023]"
}
]
}
]
}
]
},
{
"tag": "para",
"level": 4,
"text": "𝜕Φ(𝒙 −𝒚) 𝜕𝒏(𝒚)"
},
{
"tag": "para",
"level": 4,
"text": "d𝑆 (𝒚) = 𝜒 (𝒙) = Learning-based Methods Learning-based methods often treat oriented normal estimation as a classification or regression task where the normals are directly regressed from the feature"
},
{
"tag": "para",
"level": 4,
"text": "Due to the tight link between orientation and surface reconstruction, some proposed state-of-the-art methods can simultaneously accomplish orientation and surface reconstruction tasks.\nFor example, as reconstruction methods, iPSR [Hou et al.\n2022] and PGR [Lin et al.\n2023]"
}
]
}
]
}
]
},
Why does it do? And, does a way exists to resolve it?
In PDF, when I am uploading it from my local, it is giving the error message as "LocationValueError: No host specified." Is there any other method to work with local files here.
Hi Ambika Sukla, Nice Article. Thanks for sharing it.
In PDFs which we have used for our testing, we find that some of the sections are not properly tagged/classified as section. It is kind of treated as one of the list item of the Previous Section or paragraph of the Previous Section.
The Section heading are in Bold . Visually i am not able to find any distinction between other sections which are parsed properly by LayOut Parser and the issue section.
Any Idea how to correct the same or figure out why Layout Parser is treating it as line item/Paragraph instead of a Section (which actually it is).
Headers and footers are removed and content from multiple pages are joined together. Multiple columns are also considered. We currently do not connect tables that run through multiple pages (in the works). Please try your PDFs and raise an issue in github if you see one.
from where will I get "API url for LLM Sherpa"??
The endpoint gives connection pool timeout, not sure if this can be used in production.
WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7ceb443fa590>, 'Connection to readers.llmsherpa.com timed out. (connect timeout=None)')': /api/document/developer/parseDocument?renderFormat=all
The article is very well documented!
Hi. I have a question.
I used LayoutPDFReader to estract the contents of a pdf files. Then I used the object doc, created using the examples of Google Colab, to save its contents in a json file.
When I print the content of json file, the output, often, is repeated again.
For examples, if the output was:
{
"tag": "header",
"level": 3,
"text": "2.1.2 Surface Reconstruction for Unoriented Point Clouds",
"children": [
{
"tag": "para",
"level": 4,
"text": "Attributed to the inherent challenge of acquiring normal information, numerous point clouds, particularly."
},
{
"tag": "para",
"level": 4,
"text": "In recent years, some research has been proposed on handling unoriented point clouds based on traditional methods."
}
]
},
{
"tag": "header",
"level": 3,
"text": "2.2 Normal Consistent Orientations for Point Clouds",
"children": [
{
"tag": "para",
"level": 4,
"text": "Calculating high-quality normals for unoriented point clouds is an important topic in geometric modeling and computer graphics.\nConverting unoriented point clouds into the oriented representation provides a novel approach to surface reconstruction.\nOverall, we can divide them into optimization-based and learning-based methods."
},
{
"tag": "para",
"level": 4,
"text": "Optimization-based Methods The research on optimizationbasedmethods has a long history, which can date back to the last century.\nThe pioneering work uses principal component analysis (PCA) to initialize normal orientations by Hoppe et al.\n[1992].\nAlthough many propagation-based methods have emerged, these methods always fail to handle complex or separated point clouds."
},
{
"tag": "para",
"level": 4,
"text": "Over the years, many technologies have been proposed tomitigate the problems of relying solely on local information.\nDipole [Metzer et al.\n2021] uses dipole propagation across patches iteratively.\nVIPSS [Huang et al.\n2019] minimizes the Duchon’s energy with the L-BFGS algorithm.\nXu et al.\n[2023b] propose a smooth nonlinear objective function to characterize the requirements of an acceptable winding-number field [McIntyre and Cairns 1993] and turn the problem into an unconstrained optimization problem."
},
{
"tag": "para",
"level": 4,
"text": "𝜕Φ(𝒙 −𝒚) 𝜕𝒏(𝒚)"
},
{
"tag": "para",
"level": 4,
"text": "d𝑆 (𝒚) = 𝜒 (𝒙) = Learning-based Methods Learning-based methods often treat oriented normal estimation as a classification or regression task where the normals are directly regressed from the feature"
},
{
"tag": "para",
"level": 4,
"text": "Due to the tight link between orientation and surface reconstruction, some proposed state-of-the-art methods can simultaneously accomplish orientation and surface reconstruction tasks.\nFor example, as reconstruction methods, iPSR [Hou et al.\n2022] and PGR [Lin et al.\n2023]"
}
]
}
]
}
]
},
at the end of blocks, randomly, we start again from a block that is printed already, i.e.
{
"tag": "header",
"level": 3,
"text": "2.1.2 Surface Reconstruction for Unoriented Point Clouds",
"children": [
{
"tag": "para",
"level": 4,
"text": "Attributed to the inherent challenge of acquiring normal information, numerous point clouds, particularly."
},
{
"tag": "para",
"level": 4,
"text": "In recent years, some research has been proposed on handling unoriented point clouds based on traditional methods."
}
]
},
{
"tag": "header",
"level": 3,
"text": "2.2 Normal Consistent Orientations for Point Clouds",
"children": [
{
"tag": "para",
"level": 4,
"text": "Calculating high-quality normals for unoriented point clouds is an important topic in geometric modeling and computer graphics.\nConverting unoriented point clouds into the oriented representation provides a novel approach to surface reconstruction.\nOverall, we can divide them into optimization-based and learning-based methods."
},
{
"tag": "para",
"level": 4,
"text": "Optimization-based Methods The research on optimizationbasedmethods has a long history, which can date back to the last century.\nThe pioneering work uses principal component analysis (PCA) to initialize normal orientations by Hoppe et al.\n[1992].\nAlthough many propagation-based methods have emerged, these methods always fail to handle complex or separated point clouds."
},
{
"tag": "para",
"level": 4,
"text": "Over the years, many technologies have been proposed tomitigate the problems of relying solely on local information.\nDipole [Metzer et al.\n2021] uses dipole propagation across patches iteratively.\nVIPSS [Huang et al.\n2019] minimizes the Duchon’s energy with the L-BFGS algorithm.\nXu et al.\n[2023b] propose a smooth nonlinear objective function to characterize the requirements of an acceptable winding-number field [McIntyre and Cairns 1993] and turn the problem into an unconstrained optimization problem."
},
{
"tag": "para",
"level": 4,
"text": "𝜕Φ(𝒙 −𝒚) 𝜕𝒏(𝒚)"
},
{
"tag": "para",
"level": 4,
"text": "d𝑆 (𝒚) = 𝜒 (𝒙) = Learning-based Methods Learning-based methods often treat oriented normal estimation as a classification or regression task where the normals are directly regressed from the feature"
},
{
"tag": "para",
"level": 4,
"text": "Due to the tight link between orientation and surface reconstruction, some proposed state-of-the-art methods can simultaneously accomplish orientation and surface reconstruction tasks.\nFor example, as reconstruction methods, iPSR [Hou et al.\n2022] and PGR [Lin et al.\n2023]"
}
]
}
]
}
]
},
{
"tag": "para",
"level": 4,
"text": "𝜕Φ(𝒙 −𝒚) 𝜕𝒏(𝒚)"
},
{
"tag": "para",
"level": 4,
"text": "d𝑆 (𝒚) = 𝜒 (𝒙) = Learning-based Methods Learning-based methods often treat oriented normal estimation as a classification or regression task where the normals are directly regressed from the feature"
},
{
"tag": "para",
"level": 4,
"text": "Due to the tight link between orientation and surface reconstruction, some proposed state-of-the-art methods can simultaneously accomplish orientation and surface reconstruction tasks.\nFor example, as reconstruction methods, iPSR [Hou et al.\n2022] and PGR [Lin et al.\n2023]"
}
]
}
]
}
]
},
Why does it do? And, does a way exists to resolve it?
Hello, I am encountering the same problem with the duplicate chunks. Were you able to figure out what was causing this?
Hi Ambika, thanks for the article.
In PDF, when I am uploading it from my local, it is giving the error message as "LocationValueError: No host specified." Is there any other method to work with local files here.
Thanks,
Hi Ambika Sukla, Nice Article. Thanks for sharing it.
In PDFs which we have used for our testing, we find that some of the sections are not properly tagged/classified as section. It is kind of treated as one of the list item of the Previous Section or paragraph of the Previous Section.
The Section heading are in Bold . Visually i am not able to find any distinction between other sections which are parsed properly by LayOut Parser and the issue section.
Any Idea how to correct the same or figure out why Layout Parser is treating it as line item/Paragraph instead of a Section (which actually it is).
Is there an offline version of LayoutPDFReader or a model that is downloadable to process documents locally offline?
Hi Tony. We ahve a private option available on Azure Market place. https://azuremarketplace.microsoft.com/en-us/marketplace/apps/nlmaticscorp1686371242615.layout_pdf_parser?tab=Overview. There are a couple of different pricing options and in this mode, your data will remain within your tenancy. Hope this helps.
Nice article. I have complex documents at hand. Here are my questions if someone can provide directions?
1) How to know if content on the current page and the next page is related or not?
2) Avoid Headers and footers which doesn't contain consistent margin across all the document and few document doesn't have any headers or footers?
3) Multiple columns text on page?
4) Tables within the section
5) Tables split between multiple pages with no headers in the split tables?
Does LayoutPDFReader help in these scenarios?
Hi Aravind,
Headers and footers are removed and content from multiple pages are joined together. Multiple columns are also considered. We currently do not connect tables that run through multiple pages (in the works). Please try your PDFs and raise an issue in github if you see one.