什么是从多个网站抓取数据的最佳方法?

huangapple go评论67阅读模式
英文:

What is the best way to scrape data from from multiple websites?

问题

以下是您提供的代码的翻译部分:

// 检查连接
final StringRequest request = new StringRequest("https://www.google.com/", new Response.Listener<String>() {
    @Override
    public void onResponse(String response) {
        // ...
        // 创建每个所需网站的实例
        final HimalayanTimes himalayanTimes = new HimalayanTimes(getContext());
        final GsmArena gsmArena = new GsmArena();
        // ... (其他网站实例)

        // 为每个网站创建线程
        Thread thread = new Thread(new Runnable() {
            @Override
            public void run() {
                try  {
                    ArrayList<NewsItem> himalyannews;
                    himalyannews = himalayanTimes.getNews();
                    news.addAll(himalyannews);
                    // ...
                } catch (Exception ignored) {
                }
            }
        });
        thread.start();

        // ... (其他网站的线程)

        // 主线程等待每个线程完成
        try {
            thread.join();
        } catch (InterruptedException ignored) {
        }

        // ...
        // 将每个新闻项目放入主容器
        Collections.shuffle(headlines);
        // ...
    }
}, new Response.ErrorListener() {
    @Override
    public void onErrorResponse(VolleyError error) {
        // ...
    }
});
queue.add(request);

public class CinemaBlend {
    ArrayList<NewsItem> news;

    public CinemaBlend() {
        news = new ArrayList<>();
    }

    @RequiresApi(api = Build.VERSION_CODES.KITKAT)
    public ArrayList<NewsItem> getNews() throws IOException {
        String url = "https://www.cinemablend.com/news.php";
        OkHttpClient okHttpClient = new OkHttpClient();
        Request request = new Request.Builder().url(url).get().build();
        Document document = Jsoup.parse(Objects.requireNonNull(okHttpClient.newCall(request).execute().body()).string());
        Elements articles = document.select("div.order-of-type-2").select("div.story-related").select("a");

        for (Element article : articles) {
            String link = article.attr("href");
            String title = article.attr("title");
            String img = article.select("div.story-related-content").select("span.story-cover-image").select("img").attr("data-src");
            String date = article.select("span.story-related-published-date").text();

            NewsItem newsItem = new NewsItem();
            newsItem.imgsrc = img;
            newsItem.title = title;
            newsItem.link = link;
            newsItem.tag = "entertainment";
            newsItem.publisher = "cinemablend.com";
            newsItem.source_logo = "https://image.pitchbook.com/WFQVGYL17V0MevlcfQKlWjC3E8K1447542818374_200x200";

            if (!date.equals("")) {
                newsItem.date = date + " ago";
                news.add(newsItem);
            }
        }

        return news;
    }
}

请注意,由于篇幅原因,我只翻译了部分内容。如果您需要其他部分的翻译,请随时告诉我。

英文:

What I have tried is that I used volley request to scrape from first website and inside it I created multiple threads for each website and inside each thread I used jsoup connect method to scrape rather than volley. It gets the job done, actually faster. But, what the problem is that it freezes the app when scraping the data until it's fully loaded. It freezes the progress bar and I am having problem to find the cause.

Here's the code I have implemented. It's a bit lengthy.

//        Checking the connection
final StringRequest request = new StringRequest(&quot;https://www.google.com/&quot;, new Response.Listener&lt;String&gt;() {
@Override
public void onResponse(String response) {
relativeLayout.setVisibility(View.GONE);
//                instances for each required website
final HimalayanTimes himalayanTimes = new HimalayanTimes(getContext());
final GsmArena gsmArena = new GsmArena();
final CinemaBlend cinemaBlend = new CinemaBlend();
final KathmanduPost kathmanduPost = new KathmanduPost(getContext());
final GlobalNews globalNews =  new GlobalNews();
final NepaliTimes nepaliTimes = new NepaliTimes(getContext());
final GoalNepal goalNepal = new GoalNepal(getContext());
final GadgetByte gadgetByte = new GadgetByte();
final TechLekh techLekh = new TechLekh();
final OnlineKhabar onlineKhabar = new OnlineKhabar();
final NepaliSansar nepaliSansar = new NepaliSansar();
final CricketingNepal cricketingNepal = new CricketingNepal();
//                thread for each website
//                thread fot thehimalayantimes
Thread thread = new Thread(new Runnable() {
@Override
public void run() {
try  {
ArrayList&lt;NewsItem&gt; himalyannews;
himalyannews = himalayanTimes.getNews();
news.addAll(himalyannews);
for(int i=0; i&lt;4; i++){
finalHeadlines.add(himalyannews.get(i));
}
} catch (Exception ignored) {
}
}
});
thread.start();
//                thread for gsmArena
Thread thread1 = new Thread(new Runnable() {
@Override
public void run() {
try  {
ArrayList&lt;NewsItem&gt; gsmarenanews;
gsmarenanews = gsmArena.getNews();
news.addAll(gsmarenanews);
for(int i=0; i&lt;3; i++){
headlines.add(gsmarenanews.get(i));
}
} catch (Exception ignored) {
}
}
});
thread1.start();
//                thread for cinemaBlend
Thread thread2 = new Thread(new Runnable() {
@Override
public void run() {
try  {
ArrayList&lt;NewsItem&gt; cinemablendnews;
cinemablendnews = cinemaBlend.getNews();
news.addAll(cinemablendnews);
for(int i=0; i&lt;4; i++){
headlines.add(cinemablendnews.get(i));
}
} catch (Exception ignored) {
}
}
});
thread2.start();
//                thread for kathmanduPost
Thread thread3 = new Thread(new Runnable() {
@Override
public void run() {
try {
ArrayList&lt;NewsItem&gt; kathmandupostnews;
kathmandupostnews = kathmanduPost.getNews();
news.addAll(kathmandupostnews);
for(int i=0; i&lt;3; i++){
finalHeadlines.add(kathmandupostnews.get(i));
}
} catch (IOException ignored) {
}
}
});
thread3.start();
//                thread for globalNews
Thread thread4 = new Thread(new Runnable() {
@Override
public void run() {
try {
ArrayList&lt;NewsItem&gt; globalnewsnews;
globalnewsnews = globalNews.getNews();
news.addAll(globalnewsnews);
for(int i=0; i&lt;5; i++){
finalHeadlines.add(globalnewsnews.get(i));
}
} catch (IOException ignored) {
}
}
});
thread4.start();
//                thread for nepaliTimes
Thread thread5 = new Thread(new Runnable() {
@Override
public void run() {
try {
ArrayList&lt;NewsItem&gt; nepalitimesnews;
nepalitimesnews = nepaliTimes.getNews();
news.addAll(nepalitimesnews);
for(int i=0; i&lt;3; i++){
finalHeadlines.add(nepalitimesnews.get(i));
}
} catch (IOException ignored) {
}
}
});
thread5.start();
//                thread for GoalNepal
Thread thread6 = new Thread(new Runnable() {
@Override
public void run() {
try {
ArrayList&lt;NewsItem&gt; goalNepalNews;
goalNepalNews = goalNepal.getNews();
news.addAll(goalNepalNews);
for (int i=0; i&lt;4; i++){
headlines.add(goalNepalNews.get(i));
}
} catch (IOException ignored) {
}
}
});
thread6.start();
//                thread for GadgetByteNepal
Thread thread7 = new Thread(new Runnable() {
@Override
public void run() {
try {
ArrayList&lt;NewsItem&gt; gadgetbytenews;
gadgetbytenews = gadgetByte.getNews();
news.addAll(gadgetbytenews);
for (int i=0; i&lt;3; i++){
headlines.add(gadgetbytenews.get(i));
}
} catch (IOException ignored) {
}
}
});
thread7.start();
//                thread for Techlekh
Thread thread8 = new Thread(new Runnable() {
@Override
public void run() {
try {
ArrayList&lt;NewsItem&gt; techlekhnews;
techlekhnews = techLekh.getNews();
news.addAll(techlekhnews);
for (int i=0; i&lt;3; i++){
headlines.add(techlekhnews.get(i));
}
} catch (IOException ignored) {
}
}
});
thread8.start();
//                thread for onlinekhabar
Thread thread9 = new Thread(new Runnable() {
@Override
public void run() {
try {
ArrayList&lt;NewsItem&gt; onlineKhabarnews;
onlineKhabarnews = onlineKhabar.getNews();
news.addAll(onlineKhabarnews);
for (int i=0; i&lt;4; i++){
finalHeadlines.add(onlineKhabarnews.get(i));
}
} catch (IOException ignored) {
}
}
});
thread9.start();
//thread for nepalisansar
Thread thread11 = new Thread(new Runnable() {
@Override
public void run() {
try {
ArrayList&lt;NewsItem&gt; nepalisansarnews;
nepalisansarnews = nepaliSansar.getNews();
news.addAll(nepalisansarnews);
for (int i=0; i&lt;4; i++){
finalHeadlines.add(nepalisansarnews.get(i));
}
} catch (IOException ignored) {
}
}
});
thread11.start();
//                thread for cricketingNepal
Thread thread12 = new Thread(new Runnable() {
@Override
public void run() {
try {
ArrayList&lt;NewsItem&gt; cricketnews;
cricketnews = cricketingNepal.getNews();
news.addAll(cricketnews);
for (int i=0; i&lt;4; i++){
headlines.add(cricketnews.get(i));
}
} catch (IOException ignored) {
}
}
});
thread12.start();
//                main thread wait for each thread to finish
try {
thread.join();
} catch (InterruptedException ignored) {
}
try {
thread1.join();
} catch (InterruptedException ignored) {
}
try {
thread2.join();
} catch (InterruptedException ignored) {
}
try {
thread3.join();
} catch (InterruptedException ignored) {
}
try {
thread4.join();
} catch (InterruptedException ignored) {
}
try {
thread5.join();
} catch (InterruptedException ignored) {
}
try {
thread6.join();
} catch (InterruptedException ignored) {
}
try {
thread7.join();
} catch (InterruptedException ignored) {
}
try {
thread8.join();
} catch (InterruptedException ignored) {
}
try {
thread8.join();
} catch (InterruptedException ignored) {
}
try {
thread9.join();
} catch (InterruptedException ignored) {
}
try {
thread11.join();
} catch (InterruptedException ignored) {
}
try {
thread12.join();
} catch (InterruptedException ignored) {
}
for(NewsItem item : news){
if (item.tag.contains(&quot;kathmandu&quot;))
nepal.add(item);
if (item.tag.contains(&quot;cricket&quot;))
sports.add(item);
if (item.tag.contains(&quot;football&quot;))
sports.add(item);
switch (item.tag) {
case &quot;nepal&quot;:
nepal.add(item);
break;
case &quot;world&quot;:
world.add(item);
break;
case &quot;sports&quot;:
sports.add(item);
break;
case &quot;tech&quot;:
tech.add(item);
break;
case &quot;entertainment&quot;:
entertainment.add(item);
break;
}
}
//                putting each news item to the main container
Collections.shuffle(headlines);
Collections.shuffle(finalHeadlines);
finalHeadlines.addAll(headlines);
Collections.shuffle(nepal);
Collections.shuffle(world);
Collections.shuffle(sports);
Collections.shuffle(tech);
Collections.shuffle(entertainment);
tab1 t1 = new tab1(finalHeadlines);
t1.setRetainInstance(true);
tab2 t2 = new tab2(nepal);
t2.setRetainInstance(true);
tab3 t3 = new tab3(world);
t3.setRetainInstance(true);
tab4 t4 = new tab4(sports);
t4.setRetainInstance(true);
tab5 t5 = new tab5(tech);
t5.setRetainInstance(true);
tab6 t6 = new tab6(entertainment);
t6.setRetainInstance(true);
assert getFragmentManager() != null;
pagerAdapter = new PageAdapter(finalHeadlines, nepal, world, sports, tech, entertainment, getFragmentManager(), tabLayout.getTabCount());
viewPager.setAdapter(pagerAdapter);
shimmerFrameLayout.setVisibility(View.GONE);
}
}, new Response.ErrorListener() {
@Override
public void onErrorResponse(VolleyError error) {
Toast.makeText(getContext(), &quot;Internet Connection Error!&quot;, Toast.LENGTH_SHORT).show();
shimmerFrameLayout.setVisibility(View.GONE);
tabLayout.setVisibility(View.GONE);
}
});
queue.add(request);

For each website, I made class. One of the following class:-

public class CinemaBlend {
ArrayList&lt;NewsItem&gt; news;
public CinemaBlend() {
news = new ArrayList&lt;&gt;();
}
@RequiresApi(api = Build.VERSION_CODES.KITKAT)
public ArrayList&lt;NewsItem&gt; getNews() throws IOException{
String url = &quot;https://www.cinemablend.com/news.php&quot;;
OkHttpClient okHttpClient = new OkHttpClient();
Request request = new Request.Builder().url(url).get().build();
Document document = Jsoup.parse(Objects.requireNonNull(okHttpClient.newCall(request).execute().body()).string());
Elements articles = document.select(&quot;div.order-of-type-2&quot;).select(&quot;div.story-related&quot;).select(&quot;a&quot;);
for(Element article : articles)
{
String link = article.attr(&quot;href&quot;);
String title = article.attr(&quot;title&quot;);
String img = article.select(&quot;div.story-related-content&quot;).select(&quot;span.story-cover-image&quot;).select(&quot;img&quot;).attr(&quot;data-src&quot;);
String date = article.select(&quot;span.story-related-published-date&quot;).text();
NewsItem newsItem = new NewsItem();
newsItem.imgsrc = img;
newsItem.title = title;
newsItem.link = link;
newsItem.tag = &quot;entertainment&quot;;
newsItem.publisher = &quot;cinemablend.com&quot;;
newsItem.source_logo = &quot;https://image.pitchbook.com/WFQVGYL17V0MevlcfQKlWjC3E8K1447542818374_200x200&quot;;
if(!date.equals(&quot;&quot;))
{
newsItem.date = date + &quot; ago&quot;;
news.add(newsItem);
}
}
return news;
}
}

答案1

得分: 1

寻找解释如何执行后台工作的教程。有许多不同的方法可以做到这一点:Service、Kotlin 协程、简单的自管理线程等等。

只需远离关于 AsyncTasks 和 Loaders(已弃用)的教程。

一个很好的起点是 Android 开发者指南:https://developer.android.com/guide/background

英文:

Look for tutorials explaining how to perform background work. There are lots of different ways to do that: Service, Kotlin Coroutines, simple self-managed Threads, etc.

Just stay away from tutorials about AsyncTasks and Loaders (deprecated).

A good starting point is the Android developer guide: https://developer.android.com/guide/background

答案2

得分: -1

public class DownloadNews extends AsyncTask<Void, Void, Void> {

    @Override
    protected void onPreExecute() {
        shimmerFrameLayout.setVisibility(View.VISIBLE);
        relativeLayout.setVisibility(View.GONE);
        tabLayout.setVisibility(View.GONE);
    }

    @RequiresApi(api = Build.VERSION_CODES.KITKAT)
    @Override
    protected Void doInBackground(Void... voids) {
        // doInBackground code...
        return null;
    }

    @Override
    protected void onPostExecute(Void aVoid) {
        tabLayout.setVisibility(View.VISIBLE);
        shimmerFrameLayout.setVisibility(View.GONE);

        for (NewsItem item : news) {
            // Parsing and categorizing news items...
        }

        // Shuffling and setting up tab contents...

        assert getFragmentManager() != null;
        pagerAdapter = new PageAdapter(finalHeadlines, nepal, world, sports, tech, entertainment, getFragmentManager(), tabLayout.getTabCount());
        viewPager.setAdapter(pagerAdapter);
    }
}
英文:

But doing the tasks in Async task create another similar problem. When I create similar threads in background, exactly the similar problem arises where it skips the data and the ui doesn't get updated. Any suggestion would be greatly appreciated.

public class DownloadNews extends AsyncTask&lt;Void, Void, Void&gt;
{
@Override
protected void onPreExecute() {
shimmerFrameLayout.setVisibility(View.VISIBLE);
relativeLayout.setVisibility(View.GONE);
tabLayout.setVisibility(View.GONE);
}
@RequiresApi(api = Build.VERSION_CODES.KITKAT)
@Override
protected Void doInBackground(Void... voids) {
tabLayout.setOnTabSelectedListener(new TabLayout.OnTabSelectedListener() {
@Override
public void onTabSelected(TabLayout.Tab tab) {
viewPager.setCurrentItem(tab.getPosition());
}
@Override
public void onTabUnselected(TabLayout.Tab tab) {
}
@Override
public void onTabReselected(TabLayout.Tab tab) {
}
});
viewPager.addOnPageChangeListener(new TabLayout.TabLayoutOnPageChangeListener(tabLayout));
final RequestQueue queue = Volley.newRequestQueue(Objects.requireNonNull(getContext()));
//        Checking the connection
final StringRequest request = new StringRequest(&quot;https://www.google.com/&quot;, new Response.Listener&lt;String&gt;() {
@Override
public void onResponse(String response) {
//                instances for each required website
final HimalayanTimes himalayanTimes = new HimalayanTimes(getContext());
final GsmArena gsmArena = new GsmArena();
final CinemaBlend cinemaBlend = new CinemaBlend();
final KathmanduPost kathmanduPost = new KathmanduPost(getContext());
final GlobalNews globalNews =  new GlobalNews();
final NepaliTimes nepaliTimes = new NepaliTimes(getContext());
final GoalNepal goalNepal = new GoalNepal(getContext());
final GadgetByte gadgetByte = new GadgetByte();
final TechLekh techLekh = new TechLekh();
final OnlineKhabar onlineKhabar = new OnlineKhabar();
final NepaliSansar nepaliSansar = new NepaliSansar();
final CricketingNepal cricketingNepal = new CricketingNepal();
//                thread for each website
//                thread fot thehimalayantimes
Thread thread = new Thread(new Runnable() {
@Override
public void run() {
try  {
himalyannews = himalayanTimes.getNews();
news.addAll(himalyannews);
for(int i=0; i&lt;4; i++){
finalHeadlines.add(himalyannews.get(i));
}
} catch (Exception ignored) {
}
}
});
thread.start();
//                thread for gsmArena
Thread thread1 = new Thread(new Runnable() {
@Override
public void run() {
try  {
gsmarenanews = gsmArena.getNews();
news.addAll(gsmarenanews);
for(int i=0; i&lt;3; i++){
headlines.add(gsmarenanews.get(i));
}
} catch (Exception ignored) {
}
}
});
thread1.start();
//                thread for cinemaBlend
Thread thread2 = new Thread(new Runnable() {
@RequiresApi(api = Build.VERSION_CODES.KITKAT)
@Override
public void run() {
try  {
cinemablendnews = cinemaBlend.getNews();
news.addAll(cinemablendnews);
for(int i=0; i&lt;4; i++){
headlines.add(cinemablendnews.get(i));
}
} catch (Exception ignored) {
}
}
});
thread2.start();
//                thread for kathmanduPost
Thread thread3 = new Thread(new Runnable() {
@RequiresApi(api = Build.VERSION_CODES.KITKAT)
@Override
public void run() {
try {
kathmandupostnews = kathmanduPost.getNews();
news.addAll(kathmandupostnews);
for(int i=0; i&lt;3; i++){
finalHeadlines.add(kathmandupostnews.get(i));
}
} catch (IOException ignored) {
}
}
});
thread3.start();
//                thread for globalNews
Thread thread4 = new Thread(new Runnable() {
@Override
public void run() {
try {
globalnewsnews = globalNews.getNews();
news.addAll(globalnewsnews);
for(int i=0; i&lt;5; i++){
finalHeadlines.add(globalnewsnews.get(i));
}
} catch (IOException ignored) {
}
}
});
thread4.start();
//                thread for nepaliTimes
Thread thread5 = new Thread(new Runnable() {
@Override
public void run() {
try {
nepalitimesnews = nepaliTimes.getNews();
news.addAll(nepalitimesnews);
for(int i=0; i&lt;3; i++){
finalHeadlines.add(nepalitimesnews.get(i));
}
} catch (IOException ignored) {
}
}
});
thread5.start();
//                thread for GoalNepal
Thread thread6 = new Thread(new Runnable() {
@Override
public void run() {
try {
goalNepalNews = goalNepal.getNews();
news.addAll(goalNepalNews);
for (int i=0; i&lt;4; i++){
headlines.add(goalNepalNews.get(i));
}
} catch (IOException ignored) {
}
}
});
thread6.start();
//                thread for GadgetByteNepal
Thread thread7 = new Thread(new Runnable() {
@Override
public void run() {
try {
gadgetbytenews = gadgetByte.getNews();
news.addAll(gadgetbytenews);
for (int i=0; i&lt;3; i++){
headlines.add(gadgetbytenews.get(i));
}
} catch (IOException ignored) {
}
}
});
thread7.start();
//                thread for Techlekh
Thread thread8 = new Thread(new Runnable() {
@Override
public void run() {
try {
techlekhnews = techLekh.getNews();
news.addAll(techlekhnews);
for (int i=0; i&lt;3; i++){
headlines.add(techlekhnews.get(i));
}
} catch (IOException ignored) {
}
}
});
thread8.start();
//                thread for onlinekhabar
Thread thread9 = new Thread(new Runnable() {
@Override
public void run() {
try {
onlineKhabarnews = onlineKhabar.getNews();
news.addAll(onlineKhabarnews);
for (int i=0; i&lt;4; i++){
finalHeadlines.add(onlineKhabarnews.get(i));
}
} catch (IOException ignored) {
}
}
});
thread9.start();
//thread for nepalisansar
Thread thread11 = new Thread(new Runnable() {
@Override
public void run() {
try {
nepalisansarnews = nepaliSansar.getNews();
news.addAll(nepalisansarnews);
for (int i=0; i&lt;4; i++){
finalHeadlines.add(nepalisansarnews.get(i));
}
} catch (IOException ignored) {
}
}
});
thread11.start();
//                thread for cricketingNepal
Thread thread12 = new Thread(new Runnable() {
@Override
public void run() {
try {
cricketnews = cricketingNepal.getNews();
news.addAll(cricketnews);
for (int i=0; i&lt;4; i++){
headlines.add(cricketnews.get(i));
}
} catch (IOException ignored) {
}
}
});
thread12.start();
}
}, new Response.ErrorListener() {
@Override
public void onErrorResponse(VolleyError error) {
Toast.makeText(getContext(), &quot;Internet Connection Error!&quot;, Toast.LENGTH_SHORT).show();
shimmerFrameLayout.setVisibility(View.GONE);
relativeLayout.setVisibility(View.VISIBLE);
tabLayout.setVisibility(View.GONE);
}
});
queue.add(request);
return null;
}
@Override
protected void onPostExecute(Void aVoid) {
tabLayout.setVisibility(View.VISIBLE);
shimmerFrameLayout.setVisibility(View.GONE);
for(NewsItem item : news){
if (item.tag.contains(&quot;kathmandu&quot;))
nepal.add(item);
switch (item.tag) {
case &quot;nepal&quot;:
nepal.add(item);
break;
case &quot;world&quot;:
world.add(item);
break;
case &quot;sports&quot;:
sports.add(item);
break;
case &quot;tech&quot;:
tech.add(item);
break;
case &quot;entertainment&quot;:
entertainment.add(item);
break;
}
}
// putting each news item to the main container
Collections.shuffle(headlines);
Collections.shuffle(finalHeadlines);
finalHeadlines.addAll(headlines);
Collections.shuffle(nepal);
Collections.shuffle(world);
Collections.shuffle(sports);
Collections.shuffle(tech);
Collections.shuffle(entertainment);
tab1 t1 = new tab1(finalHeadlines);
t1.setRetainInstance(true);
tab2 t2 = new tab2(nepal);
t2.setRetainInstance(true);
tab3 t3 = new tab3(world);
t3.setRetainInstance(true);
tab4 t4 = new tab4(sports);
t4.setRetainInstance(true);
tab5 t5 = new tab5(tech);
t5.setRetainInstance(true);
tab6 t6 = new tab6(entertainment);
t6.setRetainInstance(true);
shimmerFrameLayout.setVisibility(View.GONE);
assert getFragmentManager() != null;
pagerAdapter = new PageAdapter(finalHeadlines, nepal, world, sports, tech, entertainment, getFragmentManager(), tabLayout.getTabCount());
viewPager.setAdapter(pagerAdapter);
}
}

The code for the adapter is exactly similar to above.

huangapple
  • 本文由 发表于 2020年9月22日 21:19:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/64010644.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定