feature request internal storing of urls hashed

Press Ctrl / CMD + C to copy this to your clipboard.

Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at

Bradley Kronson 58 posts 109 karma points

May 27, 2014 @ 14:53

0

Feature request: internal storing of urls hashed ?

We have a large web site which has a lot of legacy urls, many of which are stored in RSS feeds. The URL tracker quickly runs up in number of rows, this week it hit over 1.5 million rows - partially of course due to no-one doing the obvious and fixing the links in the admin interface.

The large number of rows has a huge performance impact on the site - SQL Server was averaging 40% of the CPU usage, mainly due to queries hitting the URL Tracker table - there are no indexes and there can't be due the size of the URL string fields.

What I wanted to suggest is maybe storing an extra table field of the url being added as a hash of it - this way it can be indexed and the queries can should increase quite significantly in speed- what do you think? You could still store the original urls for reference and display, but any searches based on the data would use the hash value instead

Copy Link
Stefan Kip 1606 posts 4098 karma points c-trib

Jun 04, 2014 @ 17:14

0

Sounds like a good suggestion to me... Haven't done it before. You'd like to help out?

Copy Link
Bradley Kronson 58 posts 109 karma points

Jun 05, 2014 @ 08:01

0

Sure :)

Copy Link
Stefan Kip 1606 posts 4098 karma points c-trib

Jun 05, 2014 @ 10:28

0

How? Have you done this in the past?

Copy Link
Bradley Kronson 58 posts 109 karma points

Jun 05, 2014 @ 11:05

0

Creating the hash is pretty simple - SQL Server (CHECKSUM) can be used to do this when adding the rows (returns an int)

http://msdn.microsoft.com/en-us/library/ms189788.aspx

So (in my head)
- add a new field which contains the hash which gets generated when saving the records to the table.
- add an index to the field
- change all existing code in the package to use the hash for searching instead of the URL (indexed based searching on the INT hash would be materially faster than a row by row scan an dcomparison of the table for the URLs which is what is happening currently)

Copy Link
Stefan Kip 1606 posts 4098 karma points c-trib

Jun 05, 2014 @ 11:14

0

How about just adding an index to the OldUrl field? That might give a nice performance boost (I'd have to test performance to know)...?

Copy Link
Bradley Kronson 58 posts 109 karma points

Jun 05, 2014 @ 11:24

0

multiple hashes or hashes based on more than one field can also be created if different types of searching is required

Copy Link
Bradley Kronson 58 posts 109 karma points

Jun 05, 2014 @ 11:28

0

I tried that :)

Problem is the fields are too big. SQL has a maximum size of 900 bytes for a field to be indexed. With the URLs stored as nvarchars, each character uses up two bytes, so you would need to drop the maximum length of the URLs from 1000 to 449 characters - this could cause issues with the URLs then being too long to store in the field....

Copy Link
Stefan Kip 1606 posts 4098 karma points c-trib

Jun 05, 2014 @ 11:29

0

Ah okay, well I'll try this when I've got some spare time and do some performance tests. Thanks for the help!

Copy Link
Stefan Kip 1606 posts 4098 karma points c-trib

Jun 05, 2014 @ 11:32

0

Btw, it's better to use HASHBYTES instead for unicode strings.

Copy Link
Bradley Kronson 58 posts 109 karma points

Jun 05, 2014 @ 11:42

0

The only thing I would say there is that CHECKSUM might be better than HASHBYTES because you are storing INTs vs VARBINARY and searching and indexing INTs should be faster than a VARBINARY

Good luck, looking forward to seeing it, shout if you want any assistance :)

Copy Link
Bradley Kronson 58 posts 109 karma points

Jun 05, 2014 @ 11:43

0

I suppose its a trade off between the reliability of the unique hashes and performance....

Copy Link
Stefan Kip 1606 posts 4098 karma points c-trib

Jun 05, 2014 @ 11:44

0

For performance CHECKSUM is absolutely better, but CHECKSUM(N'google.com') and CHECKSUM(N'g-oogle.com') both give the same result, which is wrong in this situation.

Copy Link
Bradley Kronson 58 posts 109 karma points

Jun 05, 2014 @ 11:55

0

Yup :)

Copy Link
Stefan Kip 1606 posts 4098 karma points c-trib

Jun 06, 2014 @ 17:35

0

I just executed some tests with the addition of this pull request with an index on the icUrlTracker table.
The results are awesome! I had a clean table and added 2.000.000 404 records. A request for a non-existing URL took 70ms with no data.
After adding those 2.000.000 records, the load time of the page went up to ~ 800ms, which sucks :D
After adding the index as suggested in the PR, the load time went back to ~ 70ms. That's an awesome performance increase imo.
So I'll release a new version with this index and call it a day :-)

Copy Link
Bradley Kronson 58 posts 109 karma points

Jun 06, 2014 @ 19:48

0

BRILLIANT news, thanks for taking the time to check it out and implement, looking forward to upgrading! :D

Copy Link
is working on a reply...

This forum is in read-only mode while we transition to the new forum.

You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Please Sign in or register to post replies

Flag this post as spam?

Feature request: internal storing of urls hashed ?