I'm making a Web Crawler and I just found out that one of my methods, GetHTML, is very slow because it uses a StreamReader to get a string of the HTML out of the HttpWebResponse object.
Here is the method:
static string GetHTML(string URL)
{
HttpWebRequest Request = (HttpWebRequest)WebRequest.Create(URL);
Request.Proxy = null;
HttpWebResponse Response = ((HttpWebResponse)Request.GetResponse());
Stream RespStream = Response.GetResponseStream();
return new StreamReader(RespStream).ReadToEnd(); // Very slow
}
I made a test with Stopwatch and used this method on YouTube.
Time it takes to get an HTTP response: 500 MS
Time it takes to convert the HttpWebResponse object to a string: 550 MS
So the HTTP request is fine, it's just the ReadToEnd() that is so slow.
Is there any alternative to the ReadToEnd() method to get an HTML string from the response object? I tried using WebClient.DownloadString() method, but it's just a wrapper around HttpWebRequest that uses streams too.
EDIT: Tried it with Sockets and it's much faster:
static string SocketHTML(string URL)
{
string IP = Dns.GetHostAddresses(URL)[0].ToString();
Socket s = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
s.Connect(new IPEndPoint(IPAddress.Parse(IP), 80));
s.Send(Encoding.ASCII.GetBytes("GET / HTTP/1.1\r\n\r\n"));
List<byte> HTML = new List<byte>();
int Bytes = 1;
while (Bytes > 0)
{
byte[] Data = new byte[1024];
Bytes = s.Receive(Data);
foreach (byte b in Data) HTML.Add(b);
}
s.Close();
return Encoding.ASCII.GetString(HTML.ToArray());
}
The problem with using it with Sockets, though, is that it most of the time returns errors such as "Moved Permanently" or "Your browser sent a request that the server could not understand".
When I call this method but return String.Empty instead of the ReadToEnd, the method takes about 500 MS.
All that says is that starting to get the response takes 500ms. Calling GetResponseStream
doesn't consume all the data.
ReadToEnd
will also be doing conversion from the binary data to text, but I doubt that's significant - I strongly suspect it's just waiting for the data to arrive over the network. To verify that, you should add logging to every aspect of your code and run Wireshark - you should then be able to see packet-by-packet when the data arrives, and correlate it with the logging.
As a side issue, you should definitely have a using
statement for the response:
using (var response = ((HttpWebResponse)Request.GetResponse())
{
// The stream will be disposed when the response is.
return new StreamReader(response.GetResponseStream())
.ReadToEnd();
}
If you don't dispose of the response, you'll tie up connections until the garbage collector finalizes them. That can lead to timeouts.
See more on this question at Stackoverflow