Der ultimative XPath-Spickzettel. Wie man einfach leistungsstarke Selektoren schreibt.

Mihai Maxim am 16. Dezember 2022

Ein XPath-Spickzettel?

Mussten Sie schon einmal einen CSS-Selektor schreiben, der klassenunabhängig ist? Wenn nein, dann können Sie sich glücklich schätzen. Wenn ja, dann ist unser XPath Cheat Sheet genau das Richtige für Sie. Das Web wimmelt nur so von Daten. Ganze Unternehmen sind darauf angewiesen, einige dieser Daten zusammenzuführen, um der Welt neue Dienste anzubieten. APIs sind von großem Nutzen, aber nicht jede Website hat offene APIs. Manchmal muss man sich das, was man braucht, auf die alte Weise beschaffen. Sie müssen einen Scraper für die Website entwickeln. Moderne Websites umgehen das Scrapen, indem sie ihre CSS-Klassen umbenennen. Daher ist es besser, Selektoren zu schreiben, die sich auf etwas Stabileres stützen. In diesem Artikel erfahren Sie, wie Sie Selektoren auf der Grundlage des DOM-Knoten-Layouts der Seite schreiben.

Was ist XPath und wie kann ich es ausprobieren?

XPath steht für XML Path Language. Sie verwendet eine Pfadnotation (wie in URLs), um eine flexible Möglichkeit zu bieten, auf jeden Teil eines XML-Dokuments zu verweisen.

XPath wird hauptsächlich in XSLT verwendet, kann aber auch als wesentlich leistungsfähigere Methode zur Navigation durch das DOM eines beliebigen Dokuments in einer XML-ähnlichen Sprache unter Verwendung von XPath-Ausdrücken verwendet werden, z. B. in HTML und SVG, anstatt sich auf die Methoden Document.getElementById() oder Document.querySelectorAll(), die Node.childNodes-Eigenschaften und andere DOM Core-Funktionen zu verlassen. XPath | MDN (mozilla.org)

Eine Pfadnotation?

<!DOCTYPE html>
<html lang="en">
<head>
    <title>Nothing to see here</title>
</head>
<body>
    <h1>My First Heading</h1>
    <p>My first paragraph.</p>
    <div>
        <h2>My Second Heading</h2>
        <p>My second paragraph.</p>
        <div>
            <h3>My Third Heading</h3>
            <p>My third paragraph.</p>
        </div>
    </div>
</body>
</html>

Es gibt zwei Arten von Pfaden: relative und absolute

Der eindeutige Pfad (oder absolute Pfad) zu My third paragraph. lautet /html/body/div/div/p

Ein relativer Pfad zu Mein dritter Absatz. ist //body/div/div/p
Für Meine Zweite Überschrift. => //body/div/h2
Für Meinen ersten Absatz. => //body/p

Beachten Sie, dass ich //body verwende. Relative Pfade verwenden //, um direkt zum gewünschten Element zu springen.

The usage of //<path> also implies that it should look for all occurrences of <path> in the document, regardless of what came before <path>.

For example, //div/p returns both My second paragraph. and My third paragraph.

Sie können dieses Beispiel in Ihrem Browser testen, um einen besseren Überblick zu erhalten!

Fügen Sie den Code in eine .html-Datei ein und öffnen Sie sie mit Ihrem Browser. Öffnen Sie die Entwicklertools und drücken Sie Strg + F. Fügen Sie den XPath-Locator in die kleine Eingabeleiste ein und drücken Sie die Eingabetaste.

Sie können auch den XPath eines beliebigen Tags abrufen, indem Sie mit der rechten Maustaste auf den Tag in der Registerkarte "Elemente" klicken und "XPath kopieren" wählen.

Beachten Sie, wie ich zwischen "Mein zweiter Absatz" und "Mein dritter Absatz" wechsle.

Also, another important thing to know is that it is not necessary for a path to contain // in order to return multiple elements. Let's see what happens when I add another <p> in the last <div>.

/html/body/div/div/p ist nicht länger ein absoluter Pfad.

Wenn Sie mir bis hierher gefolgt sind, herzlichen Glückwunsch, Sie sind auf dem richtigen Weg, XPath zu beherrschen. Sie sind jetzt bereit, sich in die lustigen Dinge zu stürzen.

Die eckigen Klammern

Sie können die eckigen Klammern verwenden, um bestimmte Elemente auszuwählen.

 In this case, //body/div/div[2]/p[3] only selects the last <p> tag.

Attribute

Sie können auch Attribute verwenden, um Ihre Elemente auszuwählen.

//body//p[@class="not-important"] => select all the <p> tags that are inside a <body> tag and have the "not-important" class.

//div[@id] => select all the <div> tags that have an id attribute.

//div[@class="p-children"][@id="important"]/p[3] => select the third <p> that is within a <div> tag that has both class="p-children" and id="important"

//div[@class="p-children" and @id="important"]/p[3] => same as above

//div[@class="p-children" or @id="important"]/p[3] => select the third <p> that is within a <div> that has class="p-children" or id="important"

Beachten Sie, dass @ den Beginn eines Attributs markiert

Funktionen

XPath bietet eine Reihe von nützlichen Funktionen, die Sie innerhalb der eckigen Klammern verwenden können.

position() => returns the index of the element
Ex: //body/div[position()=1] selects the first <div> in the <body>

last() => returns the last element
Ex: //div/p[last()] selects all the last <p> children of all the <div> tags

count(element) => returns the number of elements
Ex: //body/count(div) returns the number of child <div> tags inside the <body>

node() or * => returns any element
Ex: //div/node() and //div/*=> selects all the children of all the <div> tags

text() => returns the text of the element
Ex: //p/text() returns the text of all the <p> elements

concat(string1, string2) => fügt string1 mit string2 zusammen

contains(@attribute, "value") => returns true if @attribute contains "value" 
Ex:
 //p[contains(text(),"I am the third child")] selects all the <p> tags that have the "I am the third child" text value.

beginnt-mit(@Attribut, "Wert") => gibt true zurück, wenn @Attribut mit "Wert" beginnt 
endet-mit(@Attribut, "Wert") => gibt true zurück, wenn @Attribut mit "Wert" endet

substring(@attribute,start_index,end_index)] => gibt die Teilzeichenkette des Attributwerts auf der Grundlage von zwei Indexwerten zurück
Beispiel:
//p[substring(text(),3,12)="bin das dritte"] => gibt true zurück, wenn text() = "Ich bin das dritte Kind"

normalize-space() => funktioniert wie text(), entfernt aber die Leerzeichen am Ende
Beispiel: normalize-space(" example ") = "example"

string-length() => returns the length of the text
Ex: //p[string-length()=20] returns all the <p> tags that have the text length of 20

Die Funktionen können etwas schwierig zu merken sein. Zum Glück gibt es in The Ultimate Xpath Cheat Sheet hilfreiche Beispiele:

//p[text()=concat(substring(//p[@class="not-important"]/text(),1,15), substring(text(),16,20))]

//p[text()=<expression_return_value>] will select all the <p> elements that have the text value equal to the return value of the condition.

//p[@class="not-important"]/text() returns the text values of all the <p> tags that have class="not-important".

If there is only one <p> tag that satisfies this condition, then we can pass the return_value to the substring function.

substring(rückgabe_wert,1,15) gibt die ersten 15 Zeichen der Zeichenkette rückgabe_wert zurück.

substring(text(),16,20) gibt die letzten 5 Zeichen des gleichen

text() value that we used in //p[text()=<expression_return_value>].

Finally, concat() will merge the two substrings and create the return value of <expression_return_value>.

Pfadverschachtelung

XPath unterstützt Pfadverschachtelung. Das ist cool, aber was genau meine ich mit Pfadverschachtelung?

Versuchen wir mal etwas Neues: /html/body/div[./div[./p]]

You can read it as "Select all the <div> sons of the <body> that have a <div> child. Also, the children must also be parents to a <p> element."

If you don't care about the father of the <p> element, you can write: /html/body/div[.//p]

This now translates to "Select all the div children of the body that have a <p> descendant"

In diesem speziellen Beispiel führen /html/body/div[./div[./p]] und /html/body/div[.//p] zum gleichen Ergebnis.

Sicherlich fragen Sie sich jetzt, was es mit den Punkten in ./ und .// auf sich hat.

Der Punkt steht für das Element self. Wenn er in einem Paar von Klammern verwendet wird, verweist er auf den spezifischen Tag, der sie geöffnet hat. Tauchen wir ein wenig tiefer ein.

In our example, /html/body/div returns two divs:
<div class="no-content"> and <div class="content">

/html/body/div[.//p] wird übersetzt in:

   /html/body/div[1][/html/body/div[1]//p]
und /html/body/div[2][/html/body/div[2]//p]

/html/body/div[2][/html/body/div[2]//p] ist wahr, also gibt es /html/body/div[2] zurück

In our case, the dot ensures that /html/body/div and /html/body/div//p refer to the same <div>

Schauen wir uns nun an, was passiert wäre, wenn dies nicht der Fall gewesen wäre.

/html/body/div[/html/body/div//p] would return both 
<div class="no-content">  and <div class="content">

Warum? Weil /html/body/div//p sowohl für /html/body/div[1] als auch für /html/body/div[2] gilt.

/html/body/div[/html/body/div//p] actually translates to "Select all the div children of the <body> if /html/body/div//p is true.

/html/body/div//p is true if the body has a <div> child, and that child has a <p> descendent". In our case, this statement is always true.

Es ist eine Schande, dass andere Xpath Cheat Sheets nichts über Verschachtelung erwähnen. Ich finde es erstaunlich. Sie ermöglicht es Ihnen, das Dokument nach verschiedenen Mustern zu durchsuchen und etwas anderes zurückzugeben. Der einzige Nachteil ist, dass das Schreiben von Abfragen auf diese Weise schwer nachvollziehbar werden kann. Die gute Nachricht ist, dass es auch andere Wege gibt, dies zu tun.

Die Äxte

Sie können Achsen verwenden, um Knoten relativ zu anderen Kontextknoten zu positionieren.

Lassen Sie uns einige von ihnen erkunden.

Die vier Hauptachsen

//p/ancestor::div => selects all the divs that are ancestors of <p>

How I read it: Get all the <p> tags, for each <p> look through its ancestors. If you find <div> tags, select them.

//p/parent::div => selects all the <div> tags that are parents of <p>

How I read it: Get all the <p> tags and of all their parents, if the parent is a <div>, select it.

//div/child::p=> selects all the <p> tags that are children of <div> tags.

How I read it: Get all the <div> tags and their children, if the child is a <p>, select it.

//div/descendant::p => selects all the <p> tags that are descendants of <div> tags.

How I read it: Get all the <div> tags and their descendants, if the descendant is a <p>, select it.

Nun ist es an der Zeit, den vorherigen Ausdruck umzuschreiben:

/html/body/div[./div[./p]] ist gleichwertig mit /html/body/div/div/p/parent::div/parent::div

Aber /html/body/div[.//p] ist NICHT gleichwertig mit /html/body/div//p/ancestor::div

Die gute Nachricht ist, dass wir sie ein wenig optimieren können.

/html/body/div//p/ancestor::div[last()] ist äquivalent zu /html/body/div[.//p]

Andere wichtige Achsen

//p/following-sibling::span => for each <p> tag, select its following <span> siblings.

//p/preceding-sibling::span => for each <p> tag, select its preceding <span> siblings.

//title/following::span => selects all the <span> tags that appear in the DOM after the <title>.

In our example, //title/following::span selects all the <span> tags in the document.

//p/preceding::div => selects all the <div> tags that appear in the DOM before any <p> tag. But it ignores ancestors, attribute nodes and namespace nodes.

In our case, //p/preceding::div only selects <div class="p-children"> and <div class="no_content">.

Most of the <p> tags are in <div class="content">, but this <div> is not selected because it is a common ancestor for them. As I mentioned, the 
preceding axe ignores ancestors.

<div class="p-children"> is selected because it is not an ancestor for the <p> tags inside <div class="p-children" id="important">

Zusammenfassung

Glückwunsch, Sie haben es geschafft. Sie haben Ihren Selektor um ein brandneues Werkzeug erweitert! Wenn Sie einen Web Scraper bauen oder Web-Tests automatisieren, wird Ihnen dieses Xpath Cheat Sheet sehr nützlich sein! Wenn Sie nach einem einfacheren Weg suchen, das DOM zu durchqueren, sind Sie hier genau richtig. Es lohnt sich auf jeden Fall, XPath einmal auszuprobieren. Wer weiß, vielleicht entdecken Sie sogar noch mehr Anwendungsfälle dafür.
Klingt das Konzept des Web Scraping für Sie interessant? Sie können uns hier kontaktieren: WebScrapingAPI - Kontakt. Wenn Sie das Web scrapen wollen, unterstützen wir Sie gerne auf dem Weg dorthin. In der Zwischenzeit können Sie WebScrapingAPI - Product kostenlos testen.