Parsing instagram with java jsoup not give Elements gives source.

huangapple go评论103阅读模式
英文:

Parsing instagram with java jsoup not give Elements gives source

问题

我正在尝试使用Java中的jsoup在Android Studio上获取reels视频URL。我想要获取检查元素,但代码返回页面源代码。我在其他项目中在不同的网页上使用jsoup,并且从未遇到过这种情况。您能告诉我我做错了什么,以及如何在检查元素中获取元素吗?谢谢

  1. public class fetchData extends AsyncTask<Void, Void, Void> {
  2. Document doc = null;
  3. String str;
  4. @Override
  5. protected void onPostExecute(Void aVoid) {
  6. super.onPostExecute(aVoid);
  7. MainActivity.textView.setText(str);
  8. }
  9. @Override
  10. protected Void doInBackground(Void... voids) {
  11. try {
  12. doc = Jsoup.connect("https://www.instagram.com/reel/CDok74FJzHp/?igshid=cam8ylb7okl7").get();
  13. } catch (IOException e) {
  14. e.printStackTrace();
  15. }
  16. str = doc.toString();
  17. return null;
  18. }
  19. }

如果您想要获取检查元素中的内容,您需要使用jsoup来解析页面并选择您感兴趣的元素。这里只是您的代码示例,您可以在 doInBackground 方法中添加额外的代码来选择并提取页面中的元素。例如,如果要获取特定元素,可以使用jsoup的选择器方法,如 doc.select("选择器")

英文:

I'm trying to get reels video URL with jsoup using java on Android Studio. I want to get Elements in inspect but code returns page source. I use jsoup in other projects on different web pages and never encounter this situation. Can you tell me what ı doing wrong and how can ı get the Elements in inspect? Thank you

  1. public class fetchData extends AsyncTask&lt;Void, Void, Void&gt; {
  2. Document doc = null;
  3. String str;
  4. @Override
  5. protected void onPostExecute(Void aVoid) {
  6. super.onPostExecute(aVoid);
  7. MainActivity.textView.setText(str);
  8. }
  9. @Override
  10. protected Void doInBackground(Void... voids) {
  11. try {
  12. doc = Jsoup.connect(&quot;https://www.instagram.com/reel/CDok74FJzHp/?igshid=cam8ylb7okl7&quot;).get();
  13. } catch (IOException e) {
  14. e.printStackTrace();
  15. }
  16. str = doc.toString();
  17. return null;
  18. }
  19. }

答案1

得分: 0

以下是翻译好的部分:

  1. &lt;video class=&quot;tWeCl&quot;
  2. playsinline=&quot;&quot;
  3. poster=&quot;https://instagram.flhr4-2.fna.fbcdn.net/v/t51.2885-15/e35/117157253_120443486171759_7332785595039685871_n.jpg?_nc_ht=instagram.flhr4-2.fna.fbcdn.net&amp;amp;_nc_cat=111&amp;amp;_nc_ohc=aX7rVh9IbGoAX_lj74j&amp;amp;oh=ba74c5c8ad97ba14c35710addd523dfd&amp;amp;oe=5F363C59&quot;
  4. preload=&quot;none&quot;
  5. type=&quot;video/mp4&quot;
  6. src=&quot;https://instagram.flhr4-2.fna.fbcdn.net/v/t50.2886-16/117284962_313567919762486_3343704909021624596_n.mp4?_nc_ht=instagram.flhr4-2.fna.fbcdn.net&amp;amp;_nc_cat=102&amp;amp;_nc_ohc=3wvoN4vNzkUAX_DLFTR&amp;amp;oe=5F3659EF&amp;amp;oh=7a38d593469a99239a7cb07050cc47f2&quot;&gt;
  7. &lt;/video&gt;
  1. import com.jayway.jsonpath.JsonPath;
  2. import org.jsoup.Jsoup;
  3. import org.jsoup.nodes.Document;
  4. import org.jsoup.nodes.Element;
  5. import org.jsoup.select.Elements;
  6. import java.io.IOException;
  7. import java.util.ArrayList;
  8. import java.util.List;
  9. public class Instagram {
  10. private final String url;
  11. public Instagram(String url) {
  12. this.url = url;
  13. }
  14. public void start() {
  15. Document doc = getHtmlPage(url);
  16. Elements videoElement = getScriptElementContainingVideoUrl(doc);
  17. List&lt;String&gt; relevantTagWithMp4Url = getSingleScriptElementWithVideoUrl(videoElement);
  18. String scriptInnerHtml = relevantTagWithMp4Url.get(0);
  19. System.out.println(&quot;Video Url: &quot; + getVideoUrl(scriptInnerHtml));
  20. }
  21. private List&lt;String&gt; getSingleScriptElementWithVideoUrl(Elements scriptElements) {
  22. List&lt;String&gt; relevantTagWithMp4Url = new ArrayList&lt;&gt;();
  23. for (Element element : scriptElements) {
  24. if (element.data().contains(&quot;mp4&quot;)) {
  25. relevantTagWithMp4Url.add(element.data());
  26. }
  27. }
  28. return relevantTagWithMp4Url;
  29. }
  30. private Elements getScriptElementContainingVideoUrl(Document doc) {
  31. return doc.select(&quot;script&quot;);
  32. }
  33. private String getVideoUrl(String videoElement) {
  34. String jsonResponse = videoElement.split(&quot; = &quot;)[1];
  35. // $.. is equivalent to $.[*] - (a wild card matcher) - you may need to play with this
  36. List&lt;String&gt; videoUrl = JsonPath.read(jsonResponse, &quot;$..video_url&quot;);
  37. return videoUrl.get(0);
  38. }
  39. private Document getHtmlPage(String url) {
  40. try {
  41. return Jsoup.connect(url).get();
  42. } catch (IOException e) {
  43. e.printStackTrace();
  44. }
  45. return null;
  46. }
  47. public static void main(String[] args) {
  48. new Instagram(&quot;https://www.instagram.com/reel/CDok74FJzHp/?igshid=cam8ylb7okl7&quot;).start();
  49. }
  50. }
英文:

If you check the source of the page (inspect the video element) you'll find:

  1. &lt;video class=&quot;tWeCl&quot;
  2. playsinline=&quot;&quot;
  3. poster=&quot;https://instagram.flhr4-2.fna.fbcdn.net/v/t51.2885-15/e35/117157253_120443486171759_7332785595039685871_n.jpg?_nc_ht=instagram.flhr4-2.fna.fbcdn.net&amp;amp;_nc_cat=111&amp;amp;_nc_ohc=aX7rVh9IbGoAX_lj74j&amp;amp;oh=ba74c5c8ad97ba14c35710addd523dfd&amp;amp;oe=5F363C59&quot;
  4. preload=&quot;none&quot;
  5. type=&quot;video/mp4&quot;
  6. src=&quot;https://instagram.flhr4-2.fna.fbcdn.net/v/t50.2886-16/117284962_313567919762486_3343704909021624596_n.mp4?_nc_ht=instagram.flhr4-2.fna.fbcdn.net&amp;amp;_nc_cat=102&amp;amp;_nc_ohc=3wvoN4vNzkUAX_DLFTR&amp;amp;oe=5F3659EF&amp;amp;oh=7a38d593469a99239a7cb07050cc47f2&quot;&gt;
  7. &lt;/video&gt;

If you then search the html for the mp4 url you'll find it in one of the javascript html tags... it is delivered as a json value. So by breaking up the javascript text on the &quot; = &quot; and taking the latter half, you get the raw json which can then be parsed for the &quot;video_url&quot; using JayWay's JsonPath.read method.

It would seem the video tag is therefore generated in the html by the javascript as it doesn't appear possible to filter the html for any <video> elements.

  1. import com.jayway.jsonpath.JsonPath;
  2. import org.jsoup.Jsoup;
  3. import org.jsoup.nodes.Document;
  4. import org.jsoup.nodes.Element;
  5. import org.jsoup.select.Elements;
  6. import java.io.IOException;
  7. import java.util.ArrayList;
  8. import java.util.List;
  9. public class Instagram {
  10. private final String url;
  11. public Instagram(String url) {
  12. this.url = url;
  13. }
  14. public void start() {
  15. Document doc = getHtmlPage(url);
  16. Elements videoElement = getScriptElementContainingVideoUrl(doc);
  17. List&lt;String&gt; relevantTagWithMp4Url = getSingleScriptElementWithVideoUrl(videoElement);
  18. String scriptInnerHtml = relevantTagWithMp4Url.get(0);
  19. System.out.println(&quot;Video Url: &quot; + getVideoUrl(scriptInnerHtml));
  20. }
  21. private List&lt;String&gt; getSingleScriptElementWithVideoUrl(Elements scriptElements) {
  22. List&lt;String&gt; relevantTagWithMp4Url = new ArrayList&lt;&gt;();
  23. for (Element element : scriptElements) {
  24. if (element.data().contains(&quot;mp4&quot;)) {
  25. relevantTagWithMp4Url.add(element.data());
  26. }
  27. }
  28. return relevantTagWithMp4Url;
  29. }
  30. private Elements getScriptElementContainingVideoUrl(Document doc) {
  31. return doc.select(&quot;script&quot;);
  32. }
  33. private String getVideoUrl(String videoElement) {
  34. String jsonResponse = videoElement.split(&quot; = &quot;)[1];
  35. // $.. is equivalent to $.[*] - (a wild card matcher) - you may need to play with this
  36. List&lt;String&gt; videoUrl = JsonPath.read(jsonResponse, &quot;$..video_url&quot;);
  37. return videoUrl.get(0);
  38. }
  39. private Document getHtmlPage(String url) {
  40. try {
  41. return Jsoup.connect(url).get();
  42. } catch (IOException e) {
  43. e.printStackTrace();
  44. }
  45. return null;
  46. }
  47. public static void main(String[] args) {
  48. new Instagram(&quot;https://www.instagram.com/reel/CDok74FJzHp/?igshid=cam8ylb7okl7&quot;).start();
  49. }
  50. }

huangapple
  • 本文由 发表于 2020年8月10日 22:39:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/63342403.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定