Saturday, 10 August 2013

HTML Cleaner + XPath Not Working in Android App

HTML Cleaner + XPath Not Working in Android App

I'm building a simple news readers app and I am using HTMLCleaner to
retrieve and parse the data. I've sucessfully gotten the data I need using
the commandline version of HTMLCleaner and using xmllint for example:
java -jar htmlcleaner-2.6.jar src=http://www.reuters.com/home
nodebyxpath=//div[@id=\"topStory\"]
and
curl www.reuters.com | xmllint --html --xpath //div[@id='"topStory"'] -
both return the data I want. Then when I try to make this request using
HTMLCleaner in my code I get no results. Even more troubling is that even
basic queries like //div only return 8 nodes in my app while command line
reports 70+ which is correct.
Here is the code I have now. It is in an Android class extending AsyncTask
so its performed in the background. The final code will actually get the
text data I need but I'm having trouble just getting it to return a
result. When I Log Title Node the node count is zero.
I've tried every manner of escaping the xpath query strings but it makes
no difference. The HTMLCleaner code is in a separate source folder in my
project and is (at least I think) compiled to dalvik with the rest of my
app so an incompatible jar file shouldn't be the problem.
I've tried to dump the HTMLCleaner file but it doesn't work well with
LogCat and alot of the page markup is missing when I dump it which made me
think that HTMLCleaner was parsing incorrectly and discarding most of the
page but how can that be the case when the commandline version works fine?
Also the app does not crash and I'm not logging any exceptions.
protected Void doInBackground(URL... argv) {
final HtmlCleaner cleaner = new HtmlCleaner();
TagNode lNode = null;
try {
lNode = cleaner.clean( argv[0].openConnection().getInputStream() );
Log.d("LoadMain", argv[0].toString());
} catch (IOException e) {
Log.d("LoadMain", e.getMessage());
}
final String lTitle = "//div[@id=\"topStory\"]";
// final String lBlurp = "//div[@id=\"topStory\"]//p";
try {
Object[] x = lNode.evaluateXPath(lTitle);
// Object[] y = lNode.evaluateXPath(lBlurp);
Log.d("LoadMain", "Title Nodes: " + x.length );
// Log.d("LoadMain", "Title Nodes: " + y.length);
// this.mBlurbs.add(new BlurbView (this.mContext,
x.getText().toString(), y.getText().toString() ));
} catch (XPatherException e) {
Log.d("LoadMain", e.getMessage());
}
return null;
}
Any help is greatly appreciated. Thank you.

No comments:

Post a Comment