Inner text of node? #101

no-realm · 2017-04-15T01:23:15Z

Hi,
I am trying to get the inner text of an node.

<a href="http://example-com">Link Name</a>

I tried different means to get the 'Link Name' part, but I always get NULL back.

myhtml_node_text(); // Returns NULL
myhtml_node_string(); // Returns an object with length == 0
myhtml_token_node_text(); // Returns NULL
myhtml_token_node_string(); // Returns an object with length == 0

no-realm · 2017-04-15T01:35:11Z

Ah, never mind.
I had to first get the child node and then get the text with myhtml_node_text().
I am basing my program on some C# code which is why I thought that the node with the tag contained the link name.

But myhtml works a bit different I guess 😄
A C++ wrapper would be nice... just saying.

lexborisov · 2017-04-15T07:39:05Z

@Randshot
Yea,

<a href="http://example-com">Link Name</a>

created tree

<a href="http://example-com">
    -text: Link Name

for get text from <a> node use myhtml_node_child and myhtml_node_text
or use collection

myhtml_collection_t *nodes = myhtml_get_nodes_by_tag_id(tree, NULL, MyHTML_TAG_A, NULL);
myhtml_node_text( myhtml_node_child(nodes->list[0]) );

or see serialization functions == innerText in JS

myhtml_serialization_tree_callback(a_node->child, callback, NULL);
// or buffer
mycore_string_raw_t str = {0};
myhtml_serialization_tree_buffer(a_node->child, &str);

see example

or get all the text nodes at once

myhtml_collection_t *nodest= myhtml_get_nodes_by_tag_id(tree, NULL, MyHTML_TAG__TEXT, NULL);
myhtml_node_text( nodes->list[0] );

Use Modest for search a nodes by CSS Selectors, see example it's much easier than fingering a tree.

P.S.: Yes, wrapper C ++ is needed, who would do ?!

no-realm · 2017-04-15T18:38:49Z

I have started working on one.
My C++ skills aren't the best but it should be sufficient in most cases.
For more intense usage, the C-API should used.

lexborisov · 2017-04-15T18:48:55Z

Thanks!
After done you send me link for your wrapper?

no-realm · 2017-04-15T20:00:49Z

@lexborisov
Yeah sure.
I plan to implement it as a single header wrapper which has various classes for myhtml.
I am still unsure about some design aspects though.

For example, I have a Node class which contains a protected pointer to the myhtml node struct and various methods for reading and modifying the node.
Should I read all node properties when the Node object is initialized or only get the property on demand by using the provided methods (myhtml_node_text)?.

lexborisov · 2017-04-15T20:09:43Z

@Randshot
You do not need to store data in class. They may become obsolete, this can later cause confusion.
I think it should look like this, for example:

node->next();

/* class node... */
next() {
node->next; /* get from C structure or  myhtml_node_next(node)*/
}

hbakhtiyor · 2017-05-03T03:57:20Z

@Randshot any updates of your wrapper?

no-realm · 2017-05-03T09:50:05Z

@hbakhtiyor I haven't had any time for it lately. I will update you when I have some progress.

fariouche · 2018-01-12T09:34:35Z

Hi,
I have a similar issue, I cannot extract text from a <script> tag.
The page I'm testing is google.com.
I'm doing a get_child_node() on the <script> node, and it returns NULL... (works fine with a <title> node...)
Did I missed something?

lexborisov · 2018-01-12T09:46:01Z

Hi,
You can show me HTML pages (html code)?

fariouche · 2018-01-12T10:27:09Z

dump.log
This is the google page I've got, exactly what I've pushed to myhtml_parse.
myhtml_parse(pCtx->tree, MyENCODING_UTF_8, (char*)html_buffer, html_buffer_size);
No error returned.
Thanks

lexborisov · 2018-01-12T10:46:59Z

Work fine.
Code:

    myhtml_parse(tree, MyENCODING_UTF_8, res.html, res.size);
    myhtml_collection_t *collection = myhtml_get_nodes_by_tag_id(tree, NULL, MyHTML_TAG_SCRIPT, NULL);
    
    for (size_t i = 0; i < collection->length; i++) {
        mycore_string_raw_t str = {0};
        if(collection->list[i]->child == NULL) {
            printf("Oh, God! This not work, I can't believe this is not working\n");
            exit(1);
        }
        
        myhtml_serialization_tree_buffer(collection->list[i]->child, &str);
        
        printf("%s\n", str.data);
        
        mycore_string_raw_destroy(&str, false);
    }

lexborisov · 2018-01-12T10:50:50Z

and, we have no get_child_node() function, we have myhtml_node_child() function

fariouche · 2018-01-12T11:22:19Z

Thanks...

Yes, myhtml_node_child(), not get_child_node() (typo)
strange... I'm not using collection. And tokenizer_colorize_high_level() seems to work.
I Just do the following:
myhtml_parse()
node = myhtml_node_child()
Verify that tag is TAG_HTML.
node = myhtml_node_child(node)
Verify that TAG is TAG_HEAD
node = myhtml_node_child(node)
while(node)
parse_node(node)
node = myhtml_node_next(node)

At some time, my parse_node() function will parse TAG_SCRIPT, and this is where I'm doing the myhtml_node_child(node) -> NULL.

fariouche · 2018-01-12T11:25:32Z

This is maybe linked to
myhtml_tree_parse_flags_set(tree,
MyHTML_TREE_PARSE_FLAGS_SKIP_WHITESPACE_TOKEN|
MyHTML_TREE_PARSE_FLAGS_WITHOUT_DOCTYPE_IN_TREE);

I just tried parse_without_whitespace example, and I see that <script> is empty

fariouche · 2018-01-12T11:30:28Z

I confirm that this is because of MyHTML_TREE_PARSE_FLAGS_SKIP_WHITESPACE_TOKEN.

Is a script a whitespace?

lexborisov · 2018-01-12T11:32:21Z

I think there's a bug with MyHTML_TREE_PARSE_FLAGS_SKIP_WHITESPACE_TOKEN flag

donglu · 2018-04-04T16:21:27Z

myhtml_collection_t *text=myhtml_get_nodes_by_tag_id_in_scope(tree,NULL,classname_list->list[i]->child,MyHTML_TAG__TEXT, NULL);

const char *title=myhtml_node_text(text->list[0],NULL);
printf("%s\n",title)

Azq2 · 2018-05-23T21:00:59Z

If you want "true" analog of innerText (!= textContent), i have some example: https://github.com/Azq2/perl-html5-dom/blob/f57c11343a3c8ab77a5162083791560de7d6746b/DOM.xs#L282 written by spec.

If you want more simple textContent - https://github.com/Azq2/perl-html5-dom/blob/f57c11343a3c8ab77a5162083791560de7d6746b/DOM.xs#L252

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inner text of node? #101

Inner text of node? #101

no-realm commented Apr 15, 2017 •

edited

Loading

no-realm commented Apr 15, 2017 •

edited

Loading

lexborisov commented Apr 15, 2017 •

edited

Loading

no-realm commented Apr 15, 2017 •

edited

Loading

lexborisov commented Apr 15, 2017

no-realm commented Apr 15, 2017 •

edited

Loading

lexborisov commented Apr 15, 2017

hbakhtiyor commented May 3, 2017

no-realm commented May 3, 2017

fariouche commented Jan 12, 2018

lexborisov commented Jan 12, 2018

fariouche commented Jan 12, 2018

lexborisov commented Jan 12, 2018 •

edited

Loading

lexborisov commented Jan 12, 2018

fariouche commented Jan 12, 2018

fariouche commented Jan 12, 2018

fariouche commented Jan 12, 2018

lexborisov commented Jan 12, 2018

donglu commented Apr 4, 2018

Azq2 commented May 23, 2018

Inner text of node? #101

Inner text of node? #101

Comments

no-realm commented Apr 15, 2017 • edited Loading

no-realm commented Apr 15, 2017 • edited Loading

lexborisov commented Apr 15, 2017 • edited Loading

no-realm commented Apr 15, 2017 • edited Loading

lexborisov commented Apr 15, 2017

no-realm commented Apr 15, 2017 • edited Loading

lexborisov commented Apr 15, 2017

hbakhtiyor commented May 3, 2017

no-realm commented May 3, 2017

fariouche commented Jan 12, 2018

lexborisov commented Jan 12, 2018

fariouche commented Jan 12, 2018

lexborisov commented Jan 12, 2018 • edited Loading

lexborisov commented Jan 12, 2018

fariouche commented Jan 12, 2018

fariouche commented Jan 12, 2018

fariouche commented Jan 12, 2018

lexborisov commented Jan 12, 2018

donglu commented Apr 4, 2018

Azq2 commented May 23, 2018

no-realm commented Apr 15, 2017 •

edited

Loading

no-realm commented Apr 15, 2017 •

edited

Loading

lexborisov commented Apr 15, 2017 •

edited

Loading

no-realm commented Apr 15, 2017 •

edited

Loading

no-realm commented Apr 15, 2017 •

edited

Loading

lexborisov commented Jan 12, 2018 •

edited

Loading