JinRongExtractScript

金融出版社文件拆分脚本

运行环境：

Python2.7

Python3

JDK8

数据提取流程：

1.提取所有的PDF源文件，文件名重命名为当前文件夹名，并形成参照文件PDF_File.txt【GetAllPDF.py】

2.提取所有的XML目录信息，文件名重命名为当前文件夹名，并形成参照文件XML_MULU_File.txt【GetAllMuluXML.py】

3.提取所有的XML章节信息，文件夹名重命名为当前文件夹名，并形成参照文件XML_Chapter_File.txt【GetChapterXML.py】

4.根据提取的目录XML信息，提取目录信息（目录名称，开始页码，文件名称，目录层级），并形成参照文件parseMuluXML.txt【GetMuluInfo.py】

5.根据MULU_File.txt计算目录层级，得出下级目录的上级目录名称，并形成参照文件MULU_Level_File.txt【GetLevelName.py】

6.生成校审文件JiaoShenFile.txt,生成一个外键【GetJiaoShenFile.py】

7.生成偏移对照文件Check_Page_Offset_File.txt，生成一个外键【GetPageOffset.py】

8.由数据库根据两个生成校对表格,校对完成后再生成Page_Offset_File.txt

1）先切分，多切一页，再审核。

2）默认生成页码，校对页码后再做切分。

9.获取所有的PDF文件名，对条目、PDF文件生成唯一GUID【GetAllBookName.py,java程序生成】

10.编写脚本根据起始偏移拆分PDF文件【GenFile.py】

11.查找所有的Tag标签【GetAllTag.py】

12.根据碎片化XML文件提取章、节、小节正文内容(去除额外标签)，生成临时文件Check_Content_File.txt【GetContent.py】

13.对提取的信息进行中文空格去除，英文空格保留处理（使用Python3）形成参照文件Content_File.txt。【DealContentFile.py】

14.使用Python3进行入库处理（对文本数据进行转译处理）【CreateInsertContentSQL.py】

15.对数据进行散列存储，？采用多层hash还是bookName散列

16.提取PDF的元数据XML信息，形成参照文件XML_BookINFO_File.txt【GetAllBookXML.py】

17.解析PDF元数据XML信息，提取元信息，形成参照文件BookInfo_File.txt【GetBookInfo.py】

18.校对信息，信息入库

19.通过XML信息提取图片、图表信息（待定，将来只提取图表，看有多少类型：formula，table）,形成参照文件All_Image_Info.txt【All_Image_Info.py】

20.提取所有的图片源文件，形成参照文件ALL_IMAGE_File.txt【GetAllImage.py】

21.校对图片信息，保持数量一致。

23.提取单本PDF文本数据，形成参照文件Content_PDF_File.txt，使用Python3入库。【同14】

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.gitignore		.gitignore
ComputeFileLines.py		ComputeFileLines.py
CreateInsertContentSQL.py		CreateInsertContentSQL.py
CreateInsertLawBookInfoSql.py		CreateInsertLawBookInfoSql.py
DealContentFile.py		DealContentFile.py
DeleteImg.py		DeleteImg.py
ExtractMagazineCData.py		ExtractMagazineCData.py
GenFile.py		GenFile.py
GenMagazineFile.py		GenMagazineFile.py
GetAllBookName.py		GetAllBookName.py
GetAllBookXML.py		GetAllBookXML.py
GetAllImage.py		GetAllImage.py
GetAllImageFromReferDatabase.py		GetAllImageFromReferDatabase.py
GetAllMagazineImage.py		GetAllMagazineImage.py
GetAllMuluXML.py		GetAllMuluXML.py
GetAllPDF.py		GetAllPDF.py
GetAllTag.py		GetAllTag.py
GetAuthor.py		GetAuthor.py
GetAuthorTitleRefer.py		GetAuthorTitleRefer.py
GetBlankPDFDirSet.py		GetBlankPDFDirSet.py
GetBookAuthorRefer.py		GetBookAuthorRefer.py
GetBookCatecoryStr.py		GetBookCatecoryStr.py
GetBookInfo.py		GetBookInfo.py
GetChapterContent.py		GetChapterContent.py
GetChapterPageOffset.py		GetChapterPageOffset.py
GetChapterXML.py		GetChapterXML.py
GetContent.py		GetContent.py
GetDiffImg.py		GetDiffImg.py
GetDiffItems.py		GetDiffItems.py
GetDiffMagazine.py		GetDiffMagazine.py
GetDiffMagazineItems.py		GetDiffMagazineItems.py
GetFirstLetter.py		GetFirstLetter.py
GetImageInfo.py		GetImageInfo.py
GetImageRelateContent.py		GetImageRelateContent.py
GetImageType.py		GetImageType.py
GetJiaoShenFile.py		GetJiaoShenFile.py
GetLawInfo.py		GetLawInfo.py
GetLevelName.py		GetLevelName.py
GetLevelNameFor2and3.py		GetLevelNameFor2and3.py
GetMagazine.py		GetMagazine.py
GetMagazineCatalog.py		GetMagazineCatalog.py
GetMagazineColumnMapping.py		GetMagazineColumnMapping.py
GetMagazineContent.py		GetMagazineContent.py
GetMagazineImageInfo.py		GetMagazineImageInfo.py
GetMagazineImageRelateContent.py		GetMagazineImageRelateContent.py
GetMagazineName.py		GetMagazineName.py
GetMagazinePDF.py		GetMagazinePDF.py
GetMagazinePDFFileName.py		GetMagazinePDFFileName.py
GetMagazinePageOffset.py		GetMagazinePageOffset.py
GetMagazinePageOffsetGUID.py		GetMagazinePageOffsetGUID.py
GetMagazineTitle.py		GetMagazineTitle.py
GetMagazineXml.py		GetMagazineXml.py
GetMuluInfo.py		GetMuluInfo.py
GetMuluInfoFromChapterXML.py		GetMuluInfoFromChapterXML.py
GetMuluInfoFromChapterXMLAccordingToLevel.py		GetMuluInfoFromChapterXMLAccordingToLevel.py
GetNoCoverBookName.py		GetNoCoverBookName.py
GetOrgInfo.py		GetOrgInfo.py
GetPageOffset.py		GetPageOffset.py
GetSinglePDFContent.py		GetSinglePDFContent.py
GetSpecialInfoFromBookAndTitle.py		GetSpecialInfoFromBookAndTitle.py
GetTitlesFromContent.py		GetTitlesFromContent.py
GetTotalPages.py		GetTotalPages.py
GetZhuanTiBaikeDocs.py		GetZhuanTiBaikeDocs.py
GetZhuanTiBaikeHotWords.py		GetZhuanTiBaikeHotWords.py
GetZhuanTiBaikeSpider.py		GetZhuanTiBaikeSpider.py
GetZhuanTiBookInfo.py		GetZhuanTiBookInfo.py
GetZhuanTiImages.py		GetZhuanTiImages.py
GetZhuanTiLawContent.py		GetZhuanTiLawContent.py
GetZhuanTiResources.py		GetZhuanTiResources.py
GetZhuanTiTongJiColorImage.py		GetZhuanTiTongJiColorImage.py
GetZhuanTiTongJiImagesInfo.py		GetZhuanTiTongJiImagesInfo.py
GetZhuantiLaw.py		GetZhuantiLaw.py
README.md		README.md
RenameCoverFile.py		RenameCoverFile.py
RenameOrgName.py		RenameOrgName.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

JinRongExtractScript

About

Uh oh!

Releases 5

Packages

Uh oh!

Languages

dumin199101/JinRongExtractScript

Folders and files

Latest commit

History

Repository files navigation

JinRongExtractScript

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Languages

Packages