Post by account_disabled on Mar 11, 2024 9:12:39 GMT 3.5
These would have been crawled slowly over time and would likely have resulted in their domains to be assigned a DA of because their linkage with other domains in the index would be minimal. But because we had a DNS outage that caused a large number of highquality domains to be banned we replaced them in the schedule with a lot of lowquality domains from the .pw and .cn TLDs for a day period. These domains though not connected to other domains in the index were highly connected to each other.
Thus when an index was generated with this information a significant Europe Cell Phone Number List percentage of these domains gained enough DA to make the bug in scheduling nonbenign. With lots of lowquality domains now being available for scheduling we used up a significant percentage of our crawl budget on lowquality sites. This had the effect of making our crawl of highquality sites more shallow while the lowquality sites were either dead or very slow to respond this caused a reduction in the total number of actual pages crawled. Another side effect was the shape of the domains we crawled.
As noted above domains with the .pw and .cn TLDs seem to have a different strategy in terms of linking both externally to one other and internally to themselves in comparison with North American and European sites. This data shape caused a couple of problems when processing the data that increased the required time to process the data due to the unexpected shape and the resulting hot spots in our processing cluster. We fixed the originally benign bug in scheduling. This was a twoline code change to make sure that domains were correctly categorized by their Domain Authority. We use DA to determine how deeply to crawl a domain. During this year we have increased our crawler fleet and added some extra checks in the scheduler.
Thus when an index was generated with this information a significant Europe Cell Phone Number List percentage of these domains gained enough DA to make the bug in scheduling nonbenign. With lots of lowquality domains now being available for scheduling we used up a significant percentage of our crawl budget on lowquality sites. This had the effect of making our crawl of highquality sites more shallow while the lowquality sites were either dead or very slow to respond this caused a reduction in the total number of actual pages crawled. Another side effect was the shape of the domains we crawled.
As noted above domains with the .pw and .cn TLDs seem to have a different strategy in terms of linking both externally to one other and internally to themselves in comparison with North American and European sites. This data shape caused a couple of problems when processing the data that increased the required time to process the data due to the unexpected shape and the resulting hot spots in our processing cluster. We fixed the originally benign bug in scheduling. This was a twoline code change to make sure that domains were correctly categorized by their Domain Authority. We use DA to determine how deeply to crawl a domain. During this year we have increased our crawler fleet and added some extra checks in the scheduler.