Parsing instagram with java jsoup not give Elements gives source.

huangapple go评论73阅读模式
英文:

Parsing instagram with java jsoup not give Elements gives source

问题

我正在尝试使用Java中的jsoup在Android Studio上获取reels视频URL。我想要获取检查元素,但代码返回页面源代码。我在其他项目中在不同的网页上使用jsoup,并且从未遇到过这种情况。您能告诉我我做错了什么,以及如何在检查元素中获取元素吗?谢谢

public class fetchData extends AsyncTask<Void, Void, Void> {
    Document doc = null;
    String str;

    @Override
    protected void onPostExecute(Void aVoid) {
        super.onPostExecute(aVoid);
        MainActivity.textView.setText(str);
    }

    @Override
    protected Void doInBackground(Void... voids) {
        try {
            doc = Jsoup.connect("https://www.instagram.com/reel/CDok74FJzHp/?igshid=cam8ylb7okl7").get();
        } catch (IOException e) {
            e.printStackTrace();
        }
        str = doc.toString();
        return null;
    }
}

如果您想要获取检查元素中的内容,您需要使用jsoup来解析页面并选择您感兴趣的元素。这里只是您的代码示例,您可以在 doInBackground 方法中添加额外的代码来选择并提取页面中的元素。例如,如果要获取特定元素,可以使用jsoup的选择器方法,如 doc.select("选择器")

英文:

I'm trying to get reels video URL with jsoup using java on Android Studio. I want to get Elements in inspect but code returns page source. I use jsoup in other projects on different web pages and never encounter this situation. Can you tell me what ı doing wrong and how can ı get the Elements in inspect? Thank you

  public class fetchData extends AsyncTask&lt;Void, Void, Void&gt; {
        Document doc = null;
        String str;

        @Override
        protected void onPostExecute(Void aVoid) {
            super.onPostExecute(aVoid);
            MainActivity.textView.setText(str);
        }
    
        @Override
        protected Void doInBackground(Void... voids) {
            try {
                doc = Jsoup.connect(&quot;https://www.instagram.com/reel/CDok74FJzHp/?igshid=cam8ylb7okl7&quot;).get();
            } catch (IOException e) {
                e.printStackTrace();
            }
            str = doc.toString();
            return null;
        }
}

答案1

得分: 0

以下是翻译好的部分:

&lt;video class=&quot;tWeCl&quot;
  playsinline=&quot;&quot; 
  poster=&quot;https://instagram.flhr4-2.fna.fbcdn.net/v/t51.2885-15/e35/117157253_120443486171759_7332785595039685871_n.jpg?_nc_ht=instagram.flhr4-2.fna.fbcdn.net&amp;amp;_nc_cat=111&amp;amp;_nc_ohc=aX7rVh9IbGoAX_lj74j&amp;amp;oh=ba74c5c8ad97ba14c35710addd523dfd&amp;amp;oe=5F363C59&quot; 
  preload=&quot;none&quot; 
  type=&quot;video/mp4&quot; 
  src=&quot;https://instagram.flhr4-2.fna.fbcdn.net/v/t50.2886-16/117284962_313567919762486_3343704909021624596_n.mp4?_nc_ht=instagram.flhr4-2.fna.fbcdn.net&amp;amp;_nc_cat=102&amp;amp;_nc_ohc=3wvoN4vNzkUAX_DLFTR&amp;amp;oe=5F3659EF&amp;amp;oh=7a38d593469a99239a7cb07050cc47f2&quot;&gt;
&lt;/video&gt;
import com.jayway.jsonpath.JsonPath;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class Instagram {

    private final String url;

    public Instagram(String url) {
        this.url = url;
    }

    public void start() {
        Document doc = getHtmlPage(url);
        Elements videoElement = getScriptElementContainingVideoUrl(doc);

        List&lt;String&gt; relevantTagWithMp4Url = getSingleScriptElementWithVideoUrl(videoElement);
        String scriptInnerHtml = relevantTagWithMp4Url.get(0);

        System.out.println(&quot;Video Url: &quot; + getVideoUrl(scriptInnerHtml));
    }

    private List&lt;String&gt; getSingleScriptElementWithVideoUrl(Elements scriptElements) {
        List&lt;String&gt; relevantTagWithMp4Url = new ArrayList&lt;&gt;();

        for (Element element : scriptElements) {
            if (element.data().contains(&quot;mp4&quot;)) {
                relevantTagWithMp4Url.add(element.data());
            }
        }

        return relevantTagWithMp4Url;
    }

    private Elements getScriptElementContainingVideoUrl(Document doc) {
        return doc.select(&quot;script&quot;);
    }

    private String getVideoUrl(String videoElement) {
        String jsonResponse = videoElement.split(&quot; = &quot;)[1];
        // $.. is equivalent to $.[*] - (a wild card matcher) - you may need to play with this
        List&lt;String&gt; videoUrl = JsonPath.read(jsonResponse, &quot;$..video_url&quot;);
        return videoUrl.get(0);
    }

    private Document getHtmlPage(String url) {
        try {
            return Jsoup.connect(url).get();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }


    public static void main(String[] args) {
        new Instagram(&quot;https://www.instagram.com/reel/CDok74FJzHp/?igshid=cam8ylb7okl7&quot;).start();
    }
}
英文:

If you check the source of the page (inspect the video element) you'll find:

&lt;video class=&quot;tWeCl&quot;
  playsinline=&quot;&quot; 
  poster=&quot;https://instagram.flhr4-2.fna.fbcdn.net/v/t51.2885-15/e35/117157253_120443486171759_7332785595039685871_n.jpg?_nc_ht=instagram.flhr4-2.fna.fbcdn.net&amp;amp;_nc_cat=111&amp;amp;_nc_ohc=aX7rVh9IbGoAX_lj74j&amp;amp;oh=ba74c5c8ad97ba14c35710addd523dfd&amp;amp;oe=5F363C59&quot; 
  preload=&quot;none&quot; 
  type=&quot;video/mp4&quot; 
  src=&quot;https://instagram.flhr4-2.fna.fbcdn.net/v/t50.2886-16/117284962_313567919762486_3343704909021624596_n.mp4?_nc_ht=instagram.flhr4-2.fna.fbcdn.net&amp;amp;_nc_cat=102&amp;amp;_nc_ohc=3wvoN4vNzkUAX_DLFTR&amp;amp;oe=5F3659EF&amp;amp;oh=7a38d593469a99239a7cb07050cc47f2&quot;&gt;
&lt;/video&gt;

If you then search the html for the mp4 url you'll find it in one of the javascript html tags... it is delivered as a json value. So by breaking up the javascript text on the &quot; = &quot; and taking the latter half, you get the raw json which can then be parsed for the &quot;video_url&quot; using JayWay's JsonPath.read method.

It would seem the video tag is therefore generated in the html by the javascript as it doesn't appear possible to filter the html for any <video> elements.

import com.jayway.jsonpath.JsonPath;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class Instagram {

    private final String url;

    public Instagram(String url) {
        this.url = url;
    }

    public void start() {
        Document doc = getHtmlPage(url);
        Elements videoElement = getScriptElementContainingVideoUrl(doc);

        List&lt;String&gt; relevantTagWithMp4Url = getSingleScriptElementWithVideoUrl(videoElement);
        String scriptInnerHtml = relevantTagWithMp4Url.get(0);

        System.out.println(&quot;Video Url: &quot; + getVideoUrl(scriptInnerHtml));
    }

    private List&lt;String&gt; getSingleScriptElementWithVideoUrl(Elements scriptElements) {
        List&lt;String&gt; relevantTagWithMp4Url = new ArrayList&lt;&gt;();

        for (Element element : scriptElements) {
            if (element.data().contains(&quot;mp4&quot;)) {
                relevantTagWithMp4Url.add(element.data());
            }
        }

        return relevantTagWithMp4Url;
    }

    private Elements getScriptElementContainingVideoUrl(Document doc) {
        return doc.select(&quot;script&quot;);
    }

    private String getVideoUrl(String videoElement) {
        String jsonResponse = videoElement.split(&quot; = &quot;)[1];
        // $.. is equivalent to $.[*] - (a wild card matcher) - you may need to play with this
        List&lt;String&gt; videoUrl = JsonPath.read(jsonResponse, &quot;$..video_url&quot;);
        return videoUrl.get(0);
    }

    private Document getHtmlPage(String url) {
        try {
            return Jsoup.connect(url).get();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }


    public static void main(String[] args) {
        new Instagram(&quot;https://www.instagram.com/reel/CDok74FJzHp/?igshid=cam8ylb7okl7&quot;).start();
    }
}

huangapple
  • 本文由 发表于 2020年8月10日 22:39:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/63342403.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定