hpr3596 :: Extracting text, tables and images from docx files using Python
Summary: In this episode, I describe how I used 2 python libraries to extract import data from docx files
Series: A Little Bit of Python
Source: [http://hackerpublicradio.org/eps.php?id=3596](http://hackerpublicradio.org/eps.php?id=3596)
Original audio: [http://archive.org/download/hpr3596/hpr3596\_source.flac](http://archive.org/download/hpr3596/hpr3596\_source.flac)
Tools to extract data from docx files:
docx2txt
python-docx2txt
python-docx
Code Snippets
text = docx2txt.process(src, img_dest)with open("data.txt", "wt") as f: f.write(text)
document = docx.Document(src)tables = document.tablesdata = []for table in tables: table_data = [] for row in table.rows: row_data = [] for cell in row.cells: row_data.append(cell.text) table_data.append(row_data) data.append(table_table)for i, table in enumerate(tables): with open(f"{i}.csv", "wt") as f: writer = csv.writer(f) writer.writerows(table)
1/1hpr3596
Kommentare
Seien Sie der Erste, der kommentiert
Es gibt noch keine Kommentare zu diesem Inhalt. Beginnen Sie die Diskussion!