Py学习  »  Python

如何跳过python pandas中表上方的行

Bilbo Swaggins • 4 年前 • 331 次点击  

我正在使用pandas read_html从几个HTML文件中读取表格,并使用pandas的ExcelWriter将它们放在Excel文件中。

我遇到的问题是,每个文件在我想删除的表上方都有14行垃圾数据;我发现了一些线程建议使用跳过行,它可以删除表上方的数据,但也会删除表中的前14行。

  • 对于如何在不丢失表中的任何行的情况下删除表上方的行,是否有人有任何建议?
  • 此外,我还使用index_col=0去掉了行上的索引,但找不到语法来去掉列上的索引?

任何帮助或建议都将不胜感激。

这是我的阅读HTML调用:

for i in os.listdir(dl):
    if "Export" in i:
        for df in pd.read_html(i, skiprows = 14, index_col = 0):
            df_list.append(df)
dfs = pd.concat(df_list)

这是我的文件的格式,包含几行垃圾数据和下面的表:

===============================================

GPF采购订单预测

生成日期:2018-08-30
订货日期:2018-09-08
交货日期0000-00-00

供应商编号:全部

仓库:全部

===============================================

仓库项目编号项目描述UPC编号包装尺寸预测

XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX

HTML文件的前100行:

<!-- For export to excel style needs to be written on the page-->

<style type="text/css">

    .Header

    {

        font-weight: bold;

    }

    .HeadUnderline

    {

        font-weight: bold;

        text-decoration: underline;

    }

</style>

</head>

<body id="portal">

<form name="frmMain" method="post" action="Export.aspx?DcNbr=0&amp;VendorNbr=0&amp;OrdDate=2018-09-01&amp;GenDate=2018-08-30&amp;DivNbr=0&amp;DelDate=0000-00-00" id="frmMain">

<div>

<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwUKLTg0NDMyMzg5OGQYAQUJZ3ZSZXN1bHRzDzwrAAwBCAIBZC77FhJcYYUB/Yk3jdfFNSAWWS9MSP5BghZFEKqOFLXh" />

<!-- c1under - to use this page as a popup window without the header change the id from rlHeader 

    to rlStyle. The rlFooter literal could be removed if you do not want the footer on the popup window.

     -->

<div id="main-content-area" style="vertical-align: top;">

    <table width="100%" border="0" bordercolor="#FFCC00" cellpadding="0" cellspacing="0" align="center" style="vertical-align: top">

        <tr style="vertical-align: top" align="center">

            <td style="vertical-align: top; border: solid 2 black;" align="center" colspan="8">

                <span id="lblAppTitle" class="HeadUnderline">GPF Purchase Order Forecasts</span>

            </td>

        </tr>

        <tr>

            <td colspan="8">

                &nbsp;

            </td>

        </tr>

        <tr style="height: 27px">

            <td align='right' colspan="8">

                <span id="lblGenDate" class="Header">Generation Date:</span>&nbsp;

                <span id="lblGenDateValue">2018-08-30</span>

            </td>

        </tr>

        <tr>

            <td colspan="8">

                <span id="lblOrderDate" class="Header">Order Date:</span>&nbsp;

                <span id="lblOrderDateValue">2018-09-01</span>

            </td>

        </tr>

        <tr>

            <td colspan="8">

                <span id="lblDeliveryDate" class="Header">Delivery Date</span>&nbsp;

                <span id="lblDeliveryDateValue">0000-00-00</span>

            </td>

        </tr>

        <tr>

            <td colspan="8">

                &nbsp;

            </td>

        </tr>

        <tr style="height: 27px">

            <td align="right" colspan="7">

                <span id="lblVendorNumber" class="Header">Vendor No.:</span>&nbsp;

            </td>

            <td align="left">

                <span id="lblVendorNumberValue">ALL</span>

            </td>

        </tr>

        <tr>

            <td id="vendorAddress" align="right"></td>



            <td colspan="7">

            </td>

        </tr>

        <tr>

            <td colspan="8">

                &nbsp;

            </td>

        </tr>

        <tr style="height: 27px">

            <td align='right' colspan="7">

                <span id="lblWarehouse" class="Header">Warehouse:</span>&nbsp;

            </td>

            <td align="left">

                <span id="lblWarehouseValue">ALL</span>

            </td>

        </tr>

        <tr>

            <td id="depotAddress" align="left" colspan="8"></td>



        </tr>

        <tr>

            <td colspan="8">

                &nbsp;

            </td>

        </tr>

    </table>

    <table cellspacing="0" cellpadding="0" border="0">
Python社区是高质量的Python/Django开发社区
本文地址:http://www.python88.com/topic/38061
 
331 次点击  
文章 [ 1 ]  |  最新文章 4 年前
Xukrao
Reply   •   1 楼
Xukrao    5 年前

试试这个:

for i in os.listdir(dl):
    if "Export" in i:
       # Read out all html tables into list of dataframes
       data = pd.read_html(i, index_col=0)

       # Drop first table containing junk data
       data = data[1:]

       # Merge with already existing list of dataframes
       df_list += data

dfs = pd.concat(df_list)