Missing text content #2248
Replies: 1 comment
-
Hi, Well, take a look at the source code (View Source + Line wrap). You'll see all the page content is in a giant Javascript blob1: <script nonce="+B0Jpd/rO1dvg+rwHgeYiFk9FLKplTkeKjLnUkEAJPk=" type="text/javascript">
window.__DATA__ = {"assets":
{"-----------------------.js":"https://dac-static.atlassian.com/_static/
-----------------------.80866af279244d8a0ff3.bundle.js","-
...snip
|\n|------------|---------------------------------|\n| Applicable |
Confluence Server 5.5 - 8.5 \u003cbr> Confluence Data Center 5.6 and
later|\n\nThe Confluence Server and Data Center REST API is for admins who
want to script interactions with Confluence Server or Confluence Data Center
and developers who want to integrate with or build on top of the Confluence
platform.\n\n\u003cbr>\n\u003cdiv style=\"color: green;
background-color: #f0f0f0; padding: 10px;\">\nFor REST API documentation,
see \u003ca href=\"/server/confluence/rest/v900/intro\">Confluence Server
and Data Center REST API reference\u003c/a>.\n\u003c/div>\n\nUsing Cloud?
Find out about the [Confluence Cloud REST API]
(/cloud/confluence/rest).\n\n\n## CRUD Operations\n\nConfluence's REST APIs
provide access to resources (data entities) via URI paths. To use a REST
API, your application will make an HTTP request and parse the response. By
default, the response format is JSON. Your methods will be the standard HTTP
methods: GET, PUT, POST and DELETE.
</script>
<title>The Confluence Data Center REST API</title> It looks like they have a form of Markdown as the content that they place into a script and then client side render it. Which also gives it appreciable rendering jank when you visit the page. Recall that jsoup is a HTML parser, not a Javascript executor. One approach to get the specific content in this case using jsoup would be to Or, use a full headless browser like Playwright, which will more general, but necessarily have a higher resource overhead. |
Beta Was this translation helpful? Give feedback.
-
A lot of the text content is missing on fetching the text using document.text() method. Including an example to show the discrepancy between raw html and extracted text content.
example web page: https://developer.atlassian.com/server/confluence/rest/v920/Intro
Extracted content using jsoup:
The Confluence Data Center REST API Support for Server products ended Feb. 15, 2024. Learn what this means for you. Confluence Data Center Guides Reference Resources Changelog Search Support Log in REST API Modules Java API Switch to classic view REST API About Confluence Data Center REST API Advanced Searching using CQL Confluence REST API examples Content properties in the REST API Custom actions with the blueprint API Expansions in the REST API Pagination in the REST API Access Mode Admin Group Admin User Attachments Backup and Restore Category Child Content Content Blueprint Content Body Content Descendant Content Labels Content Property Content Resource Content Restrictions Content Version Content Watchers GlobalColorScheme Group Instance Metrics Label Long Task Search Server Information Space Space Label Space Permissions Space Property Space Watchers SpaceColorScheme User User Group User Watch Webhooks Other operations Rate this page: Unusable Poor Okay Good Excellent Changelog System status Privacy Notice at Collection Developer Terms Trademark Cookie preferences © 2024 Atlassian
Beta Was this translation helpful? Give feedback.
All reactions