ENH: read-html fixes #3616

cpcloud · 2013-05-15T22:17:07Z

Some updates and bug fixes. See release notes for more details.

~~vbench stuff~~ sort of pointless right now since we don't really have control over the speed of the parsing library
~~Figure out why lxml chooses to ignore things~~ reported a bug w/ example to lxml people
~~Figure out why bs4's thead.find_all(['th', 'td']) parses differently than lxml's thead.xpath('.//thead//th|.//thead//td') even when lxml is the bs4 backend.~~ same as above

jreback · 2013-05-16T11:28:08Z

let me know when you need merging on any PR's

jreback · 2013-05-19T17:24:07Z

this closes #3606, right?

cpcloud · 2013-05-19T17:27:11Z

yep, that is fixed already. i might be able to get to the rest of this today, i know the 0.11.1 rls is due today...the annoyances of the parsing may have to wait tho or i might just open up the flavor argument to allow one of 'lxml', 'bs4', 'html5lib' since html5lib is a bit more WYSIWIG than the others even if it's drastically slower.

jreback · 2013-05-19T17:29:45Z

the main issue is the import errors.....

cpcloud · 2013-05-19T17:30:06Z

that is also fixed in this.

cpcloud · 2013-05-19T17:30:49Z

u r talking about #3605/#3607 right?

jreback · 2013-05-19T17:31:51Z

yep...

jreback · 2013-05-19T17:32:01Z

take your time...btw

cpcloud · 2013-05-19T17:33:00Z

ok thanks. i'm working on a cmdline interface to store neurophys data and it's due tmrw so pandas may have to wait...

cpcloud · 2013-05-20T00:39:09Z

@jreback @y-p what do u think about removing the pure lxml option (then options would then be lxml and html5lib which correspond to the backend of bs4 to use)? i think i was a little gung-ho about lxml being fast, but in reality the best is bs4 + html5lib. Even though it's very slow, it gives correct output where lxml does not.

jreback · 2013-05-20T01:34:43Z

@cpcloud I would rather see correct and slow then wrong but fast!

let's see premature optimization is evil

Can always add it back in 0.12 (or after) if you discover how to fix it. And you have the flavor option, so sort of 'easy' to add it. (course have to edit stuff to take it out...docs,install docs,docstrings...)

of course if there are cases where lxml can do better (and is correct), but bombs on other cases, then you could always raise on those (but that may be more trouble than its worth)

cpcloud · 2013-05-20T02:17:08Z

i think the xpath implementation of lxml might be broken... :(

@jreback can i leave the code for lxml/bs4 + lxml in html.py? i would just make only html5lib available until i either figure out the issue or decide that it's not worth it...

jreback · 2013-05-20T02:56:35Z

ok

cpcloud · 2013-05-20T03:44:19Z

@jreback this ready 2 go as soon as travis passes.

cpcloud · 2013-05-20T04:04:52Z

that is odd. travis is not running arg

cpcloud · 2013-05-20T04:12:20Z

ah there we go

jreback · 2013-05-20T11:45:17Z

@cpcloud thanks...this is great...

I edited the v0.11.1 a bit (as this is new, just announcing it). I think an example is warranted. Maybe take a df, df.to_html, them read it in ? just to give an example of how to do it (i do this with read_csv a little lower in the file)

do a separate PR

jreback · 2013-05-20T13:01:08Z

see this:

https://www.travis-ci.org/pydata/pandas/jobs/7320947

I don't think travis was actually testing html5lib stuff....(I just added it in)
its taken out right now

add in ci/install.sh (right after bs4)....and test

jreback · 2013-05-20T13:23:31Z

going to put in a separate issue

cpcloud · 2013-05-20T13:24:34Z

ok.

This was referenced May 15, 2013

BUG/TST: fix failing html tests #3607

Closed

Importing bs4 in tests #3605

Closed

read_html: fails to parse column #3606

Closed

ENH: read-html fixes

a8723a4

jreback merged commit a8723a4 into pandas-dev:master May 20, 2013

Uh oh!

ENH: read-html fixes #3616

ENH: read-html fixes #3616

Uh oh!

Conversation

cpcloud commented May 15, 2013

Uh oh!

jreback commented May 16, 2013

Uh oh!

jreback commented May 19, 2013

Uh oh!

cpcloud commented May 19, 2013

Uh oh!

jreback commented May 19, 2013

Uh oh!

cpcloud commented May 19, 2013

Uh oh!

cpcloud commented May 19, 2013

Uh oh!

jreback commented May 19, 2013

Uh oh!

jreback commented May 19, 2013

Uh oh!

cpcloud commented May 19, 2013

Uh oh!

cpcloud commented May 20, 2013

Uh oh!

jreback commented May 20, 2013

Uh oh!

cpcloud commented May 20, 2013

Uh oh!

jreback commented May 20, 2013

Uh oh!

cpcloud commented May 20, 2013

Uh oh!

cpcloud commented May 20, 2013

Uh oh!

cpcloud commented May 20, 2013

Uh oh!

jreback commented May 20, 2013

Uh oh!

jreback commented May 20, 2013

Uh oh!

jreback commented May 20, 2013

Uh oh!

cpcloud commented May 20, 2013

Uh oh!

Uh oh!