Skip to content

ENH: read-html fixes #3616

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 20, 2013
Merged

ENH: read-html fixes #3616

merged 1 commit into from
May 20, 2013

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented May 15, 2013

Some updates and bug fixes. See release notes for more details.

  • vbench stuff sort of pointless right now since we don't really have control over the speed of the parsing library
  • Figure out why lxml chooses to ignore things reported a bug w/ example to lxml people
  • Figure out why bs4's thead.find_all(['th', 'td']) parses differently than lxml's thead.xpath('.//thead//th|.//thead//td') even when lxml is the bs4 backend. same as above

@jreback
Copy link
Contributor

jreback commented May 16, 2013

let me know when you need merging on any PR's

@jreback
Copy link
Contributor

jreback commented May 19, 2013

this closes #3606, right?

@cpcloud
Copy link
Member Author

cpcloud commented May 19, 2013

yep, that is fixed already. i might be able to get to the rest of this today, i know the 0.11.1 rls is due today...the annoyances of the parsing may have to wait tho or i might just open up the flavor argument to allow one of 'lxml', 'bs4', 'html5lib' since html5lib is a bit more WYSIWIG than the others even if it's drastically slower.

@jreback
Copy link
Contributor

jreback commented May 19, 2013

the main issue is the import errors.....

@cpcloud
Copy link
Member Author

cpcloud commented May 19, 2013

that is also fixed in this.

@cpcloud
Copy link
Member Author

cpcloud commented May 19, 2013

u r talking about #3605/#3607 right?

@jreback
Copy link
Contributor

jreback commented May 19, 2013

yep...

@jreback
Copy link
Contributor

jreback commented May 19, 2013

take your time...btw

@cpcloud
Copy link
Member Author

cpcloud commented May 19, 2013

ok thanks. i'm working on a cmdline interface to store neurophys data and it's due tmrw so pandas may have to wait...

@cpcloud
Copy link
Member Author

cpcloud commented May 20, 2013

@jreback @y-p what do u think about removing the pure lxml option (then options would then be lxml and html5lib which correspond to the backend of bs4 to use)? i think i was a little gung-ho about lxml being fast, but in reality the best is bs4 + html5lib. Even though it's very slow, it gives correct output where lxml does not.

@jreback
Copy link
Contributor

jreback commented May 20, 2013

@cpcloud I would rather see correct and slow then wrong but fast!

let's see premature optimization is evil

Can always add it back in 0.12 (or after) if you discover how to fix it. And you have the flavor option, so sort of 'easy' to add it. (course have to edit stuff to take it out...docs,install docs,docstrings...)

of course if there are cases where lxml can do better (and is correct), but bombs on other cases, then you could always raise on those (but that may be more trouble than its worth)

@cpcloud
Copy link
Member Author

cpcloud commented May 20, 2013

i think the xpath implementation of lxml might be broken... :(

@jreback can i leave the code for lxml/bs4 + lxml in html.py? i would just make only html5lib available until i either figure out the issue or decide that it's not worth it...

@jreback
Copy link
Contributor

jreback commented May 20, 2013

ok

@cpcloud
Copy link
Member Author

cpcloud commented May 20, 2013

@jreback this ready 2 go as soon as travis passes.

@cpcloud
Copy link
Member Author

cpcloud commented May 20, 2013

that is odd. travis is not running arg

@cpcloud
Copy link
Member Author

cpcloud commented May 20, 2013

ah there we go

@jreback jreback merged commit a8723a4 into pandas-dev:master May 20, 2013
@jreback
Copy link
Contributor

jreback commented May 20, 2013

@cpcloud thanks...this is great...

I edited the v0.11.1 a bit (as this is new, just announcing it). I think an example is warranted. Maybe take a df, df.to_html, them read it in ? just to give an example of how to do it (i do this with read_csv a little lower in the file)

do a separate PR

@jreback
Copy link
Contributor

jreback commented May 20, 2013

see this:

https://www.travis-ci.org/pydata/pandas/jobs/7320947

I don't think travis was actually testing html5lib stuff....(I just added it in)
its taken out right now

add in ci/install.sh (right after bs4)....and test

@jreback
Copy link
Contributor

jreback commented May 20, 2013

going to put in a separate issue

@cpcloud
Copy link
Member Author

cpcloud commented May 20, 2013

ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants