ASP, can use mshtml, HTML and DOM analysis of tree

The use of ASP, get a webpage source code. We hope the analysis with ASP source code, find the body text of the webpage.

The realization of this function, it is common practice by analyzing the webpage text density, find the text area.
JAVA generally have Htmlparser and other open source parser, DOM tree, the node analysis.

ASP, how to find the webpage text text?
ASP, can I use mshtml or any other component, HTML and analysis of DOM tree? ?

Or other methods, ideas, please generous with your criticism, thank you.

Started by Carr at February 07, 2016 - 1:06 AM

Html is not an DOM tree?

Posted by Basil at February 14, 2016 - 1:53 AM

A legitimate html is a well formed XML? At least for XHTML so

Posted by Basil at February 15, 2016 - 2:11 AM

There are third party related COM components, is not very strong.

Posted by Frances at February 21, 2016 - 2:40 AM

The XML specification, HTML far from, XMLDOM cannot analysis


Html is not an DOM tree?
For, he is already a tree, do not know how to use ASP to pick the fruit of the tree, may need an HTML parser, don't know ASP itself has no such function or MS components, or third party components, recommendation.

Posted by Carr at March 04, 2016 - 3:33 AM

How strong I don't, you know the recommendation, thank you

Posted by Carr at March 17, 2016 - 4:03 AM

MSXML tried it

Posted by Basil at March 18, 2016 - 4:49 AM

Grab the page, the use of regular, get text

Posted by Abelard at March 23, 2016 - 5:35 AM

Different sites, different template. Could regular is different.

I hope to find a kind of method, can collect the Sina page, can capture sohu. Without the need to change the program, or regular. To automatically identify the text.

Posted by Carr at March 27, 2016 - 6:07 AM

set xml=server.CreateObject("MicroSoft.XMLDom")
xml.async="false"
xml.loadxml(Http.responseTEXT)


Don't know how to play.?

Posted by Carr at April 05, 2016 - 6:28 AM

Can only tell you that, this idea is not too realistic, I want to ask you, if you can analyze into the tree, how do you determine the different. Different webpage contents of his, how do you know what is his body? HTML tags can be placed in what is, is not necessarily the text can be placed

Posted by Tony at April 12, 2016 - 6:36 AM

Your demand is too high, not reality

Posted by Abelard at April 14, 2016 - 7:31 AM

Thanks for the reply personally.

I think, should have certain feasibility. There are at least 3 preliminary research to see the realization of the idea. In fact, their algorithms are more mature, such as search engines, crawler or some lovers development. Are not required to set up collection rules, the crawler can automatically complete the data cleaning.

Common is the analysis of the DOM tree, text density algorithm and link density algorithm, a node under the high density of text, text word count is greater than the 100 threshold, the node according to the text and link density, there will be a weight, his brother weight + together, the father node weight. The parent node weight high is most likely the text area, the area can be determined approximately, the nodes in the same parent nodes, and then determine the next parent node's brother, generally lock body. Combined with some boundary threshold to remove the text SCRIPT advertising these. What combined with neural network algorithm (they say), let the program can automatically adjust the threshold of some boundary, the webpage text content, even a comment below the text are filtered out. If weight threshold of 2 or more areas have reached a certain value, a plurality of regions are Suan Zhengwen.


There have been many thieves acquisition program, so make, without setting acquisition rules.
They only through a page, can grasp the text. Accuracy can not be 100%, but if not specifically special structure of text, usually escape the procedure judgment. I hope in the acquisition phase, analysis of multiple pages, the webpage template, automatic generation of acquisition rules. In order to cannot hold out the back when the program, call the site collection rules to filter template. Strive to accuracy is increased from 99.9999% to 100%


Http.responseTEXT has access to the HTML source code, ASP do not know how to analyze this tree? Seeking advice.

Posted by Carr at April 22, 2016 - 8:00 AM

Grab the page, with regular is the right way, regular is much higher than using the analytic htmlDOM module efficiency.

Posted by Ellis at May 05, 2016 - 8:40 AM

Search engine crawlers, or some lovers development
They are getting the text content of the entire page, text rather than what you say. If you want to get the content words, your mind can be realistic.

Posted by Abelard at May 12, 2016 - 9:33 AM

I don't know, how to achieve a regular from the content page, find the body ? ?
Regular if can realize, of course, is the best.

Posted by Carr at May 25, 2016 - 10:33 AM

Gets the text content should be relatively simple. With regular deletion of <a </a> < is then removed; > between the content on it. But this is not what I want

Posted by Carr at May 31, 2016 - 10:47 AM

Regular can find, is to write regular problems, through regular and certainly
Also according to the label of ID, class positioning
Like some catch news, novel tools are configured by s regularization, a site corresponds to a regular

Posted by Ellis at June 12, 2016 - 11:10 AM

ID class these, different site, different, may be difficult to use. The acquisition system, I have been in use, a dozen years ago, optimization to now, currently without rules to achieve the acquisition. Using the canonical realization has no idea. Regular should be able to realize by line analysis, but do not know how to extract the node information. If not, need to write a HTML analytic functions, using ASP according to the W3C, estimates the difficulty not small.

Posted by Carr at June 20, 2016 - 11:20 AM

I don't know what you mean, "currently without rules to achieve the collection", even if you use the node structure analysis for the whole document, also have rules to locate,
As you say, "ID class, different sites, were different."

Posted by Ellis at November 10, 2016 - 5:53 PM

In front of the 12 floor, the moderator reply posts, have explained, text density algorithm, also has a simple algorithm, you can look at the.
So far, CSDN no one could understand what I was saying, so do not start. I only want to solve the following problems.

I don't know how to analyze ASP, DOM tree, or any component can also be? Please help.

Posted by Carr at November 18, 2016 - 6:15 PM

Units with IE component internetexplorer.application
Feel shy don't see, it can also analyze DOM structure

Posted by Ellis at December 01, 2016 - 6:25 PM

Oh, you can go to download something you don't need to have a look to the thief program acquisition regulations will know

As you said the reptile class, How do you say, It was his own a set of algorithms, Is the site must meet his algorithm, Many websites in order to improve their let the search engines, And improve the weight, Have to do these text by his rules, Therefore, this kind of grab the text is much better, A but not with him the rules so he may not included or not weight, This you also from different search engines on the same website he has different included effect and different weights can analysis out., But you have to do., All you have to do is a collection of others page, If you capture a site of his body does not conform to the rules you've set, Are you going to acquisition or don't acquisition? If the acquisition, Perhaps this information may not text.

Posted by Tony at December 15, 2016 - 7:14 PM

Of course, the search engine they will more or less have some similar or the same algorithm, it can be said that some common a web.

Posted by Tony at December 26, 2016 - 7:43 PM

Yes, it is only based on the common characteristics of the. No matter what the algorithm, the accuracy should be improved, so I am going to try, in the acquisition phase, try to analyze his use of 1- multiple sets of templates, if when the program is confused, then template matching.
At present, these, I just want to do some attempts, it may not succeed, the problem is not so simple.


I mainly want to solve this now:
Http.responseTEXT has access to the HTML source code, ASP do not know how to analyze this tree? There are ready-made, or third party components can, seeking advice.

Posted by Carr at December 30, 2016 - 8:33 PM

Do not know how to analysis

Posted by Carr at January 01, 2017 - 9:26 PM

You want some label density position, you can put the HTML document and a one-dimensional array line
Remove the label position with regular, you again according to the location of some of the labels to analyze density
<\w+[^>]+> Start tag
<\/w+\s*>The tag end

Posted by Ellis at January 04, 2017 - 10:35 PM

Regular can make some simple analysis. But HTML is not sure, label is not specific, <divXXXXX style X is uncertain, > may be nested, paired or unpaired label treatment, feeling very hard, analysis of DOM might be easy. ASP analysis of DOM tree or component, for help? ?

Posted by Carr at January 05, 2017 - 12:28 AM

Recommend that you use a regular
Regular can be nested analysis, JS, VBS regular like bad nested pairing, you may own judge again
A tag name, location, recycling again analyze label, pairing

For components, you can use the internetexplorer.application, to img, CSS, JS, ifram

Posted by Ellis at January 07, 2017 - 12:37 AM

You can go to CSDN project release. If you don't want to go out

Posted by Hamiltion at January 08, 2017 - 11:32 PM

Thank you for reply! Baidu under internetexplorer.application is really no harvest, and VBS analysis of HTML has little experience, feel the problem is still relatively trouble.

You are one of the few to know what I wanted to do people, thank you very much. Expect other master help.

Posted by Carr at January 10, 2017 - 10:11 PM